Listing price prediction using different Feature selection techniques on the Seattle Airbnb Datasets
In this post we go through the process of analysing the Seattle Airbnb datasets available on Kaggle.
We are provided with 3 datasets by Kaggle.
- Listings, including full descriptions and average review score.
- Reviews, including unique id for each reviewer and detailed comments.
- Calendar, including listing id and the price and availability for that day.
We first need to thoroughly analyse each dataset first looking for fields that are of interest, check for NULLs, Unique values, Outliers and other statistically important information.
We are mainly driven by the below questions for our analysis:
- How the availability of listings and price vary over the course of the year? This would help us understand what are the busy months for rentals in Seattle.
- Explore monthly price variation over the course of the year in greater depth and also see how listing prices vary between weekdays and weekends.
- Use feature selection techniques(univariate analysis and Recursive Feature elimination) to arrive at features with most influence on listing price
- Build Two Linear Regression Models using the selected features and compare their performance.
Listing Availability and Price variation
Some listings in the listings dataset are available through out the year and some are available for a limited period of time . The average availability is 244 days.
Less than 25% of listings are available for a period of less than 125 days in a year. These are the casual hosts.The hosts in the 75th Percentile are the more serious one’s who probably use airbnb as a regular income source and probably would care more about reviews.
Let us now plot the monthly availability along with monthly listing average.While computing monthly listing average let us only consider available listings .
- We see that the available listings drop in Summer months only to peak from October to December.
- During the early part of the year, more listings are available in March.
- The average listing price peaks during Summer months when the listings hit bottom, hinting at a Demand-supply situation.As expected, price is high when availability is low and vice-versa. However there could be there factors at play as well like more visitors in Summer months increasing the demand.
- As a host, it makes sense to list during Summer months for maximum earnings and use the winter months for maintenance and repairs.
Explore Price variation Further
Continuing from our previous question, Price in general is lowest till April and tends to peak from June to August. however we would like to probe further and see if this trend is true for all listings or limited to listings above certain price range.
To answer this we plot the Q1, Median, Q3 percentiles of the monthly price averages and observe the trend. We also try to plot the prices on weekends and weekdays to observe how they vary.
From the above plots, we can suggest below
- Price of cheaper listings seems to be consistent across the year and does not change much especially from March. This is understandable as hosts are not expected to further discount prices that are already on budget.
- Prices of listings in the 3rd quantile seem to be more affected by seasonal changes. In other words expensive listings are even more expensive during Summer months.
- When it comes to weekday vs weekend price averages, the prices on weekends are always higher compared to weekdays. However the two curves move together which clearly indicates most listings apply a fixed markup for weekend pricing.
Apply Feature selection Techniques to find ‘Price influencers’
Apart from the seasonal demand — there should be many other factors influencing price of a listing.These can be Quantitative or Categorical variables.
In this blog post we focus on feature selection techniques that can be applied on Quantitative variables . A number of feature selection techniques are available and we limit ourselves to the below
- Univariate analysis
- Recursive Feature Elimination (RFE)
First let us visualise correlation between our numeric features using Seaborn library’s heatmap. We use both Pearson’s correlation as well as Spearman correlation to inspect if they throw different results. A correlation co-efficient obtained will have a value between -1 and +1 and measures the strength of the association between two variables and the direction of the relationship.
The results from both heatmap’s (using Pearson and Spearman correlation) suggest pretty much the same thing. From the above we can see that
a. The number of bedrooms, the number of bathrooms, and the number of beds have a high correlation with each other and a high correlation with price. It’s natural to think that the more rooms, beds, and toilets you have, the more expensive it becomes. However, the number of people a listing can accommodate was the most influential among them (0.65).
b. Whether or not a homestay can be rented for a long time has no significant correlation with other features other than each other.
c. The minimum and maximum stay dates are not significantly correlated with other features.
Features selected using Univariate analysis:
For this we use an estimator available in scikit-learn called SelectKBest that will allow us to select the ‘K’ Best features. In our case we set K to 6. The SelectKBest allows us to choose the statistical technique to select the ‘K’ most relevant features. We have chosen f_regression for our purpose. We chose this as we want to use the features to predict price and price prediction is regression analysis.
Below features were selected for our dataset using SelectKBest:
accommodates , bathrooms, bedrooms, beds, guests_included and reviews_per_month
Features selected using Recurisive Feature Elimination (RFE):
This technique selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. RFE is available as an estimator in scikit-learn.
We used Linear Regression and below features were selected:
host_listings_count, host_total_listings_count, accommodates , bathrooms, bedrooms and review_scores_location.
Categorical Variables as features
There are techniques such as two-way ANOVA that can be used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables. However it is a more detailed topic and not considered in scope for this study.
We did some exploratory analysis(please refer my github repo for details) where we have seen that the price varies greatly across different property types, room types and neighbourhoods. So along with quantitative variables obtained from the feature selection techniques, we will also use Categorical variables like neighbourhood_cleansed, property type and try to predict price of a listing.
Predicting Prices using Linear Regression
To predict the price we aim to build a simple linear regression model. In fact we will build two different models and supply them with features from univariate analysis and RFE and validate their performance.
In each case , as part of the model building exercise, we split the listing dataset into train and test subsets and create a linear model to predict listing price using the features that we found to have a high correlation with price.
The root mean square error of the models is 3672.29 and 3742.02, which along with the r-squared values suggests that the models can be further improved.
Based on our analysis we can draw the below conclusions
- Listing prices peak in Summer months but the prices of already low-priced listings tend to not change much through out the year.
- Total listings available are less in Summer and increase towards end of year.
- Not surprisingly during period where fewer listings were available prices were high in general.
- Listing price has a strong correlation with number of bedrooms, the number of bathrooms, and the number of beds. It’s natural to think that the more rooms, beds, and toilets you have, the more expensive it becomes. However, the number of people a property could accommodate had a high influence on price.
- We used two different techniques to select features — RFE and Univariate analysis. We found that accommodates, bedrooms, bathrooms are common features presented by both approaches
- We were then able to build two linear regression models to predict the list price based on quantitative variables presented by both approaches by adding some more features by including categorical variables like neighbourhood, property type, cancellation policy etc..
- The results from the model showed the features selected using Univariate analysis provided a slightly better r-squared score
If you would like to see more in depth of how I arrived at these findings, then you can head over to my Github.