Hi data science enthusiasts! It is my pleasure to write you for the first time, I am happy to share my passion for data science and machine learning with the community! Today I will be analyzing the Airbnb market in Amsterdam.
Amsterdam is a very popular destination among tourists as it receives well over 15 million tourists a year. Amsterdam is the first city to implement taxation specific to Airbnb renting (10% tourist tax, which is a lot compared to other jurisdictions), and as such provides an interesting use case to find answers to some of my intriguing questions.
The questions I will answer using Airbnb data are:
How many hosts are running a business with multiple listings and where are they located in Amsterdam?
What are the top 10 Airbnb hosts based on the number of listings they have and how much do they constitute as a percent of the Amsterdam Airbnb market?
Can I develop an accurate regression model that would predict the price of an Airbnb listing in Amsterdam using available attributes? Which attributes are good predictors for price?
You can find my Python code on Github and you can follow along.
As Airbnb does not publish its own data on listings, I will be using the Amsterdam Airbnb dataset from Inside Airbnb, which is an independent third party that publishes datasets on Airbnb listings from major cities across the world.
The datasets are very handy to do such analysis, they are updated frequently and lend themselves greatly for doing regression analysis for predicting listing night prices, for doing time-based analysis on listing level or visualizing these interesting data points.
I am using the Amsterdam Airbnb dataset (as of 18.08.2020) which can be found here
Part I: How many hosts are running a business with multiple listings in Amsterdam?
The straightforward answer is 2161 hosts. Percentage-wise, that’s roughly 13.17 %, which means that actually slightly more than one in ten Airbnb hosts in Amsterdam have multiple listings and are probably running a home rental business.
Part II: What are the top 10 Airbnb hosts based on the number of listings they have and how much do they constitute as a percent of the Amsterdam Airbnb market?
The top Airbnb hosts based on the number of listings they have are as follows:
The ten of them own 351 homes together and make up around 1.85% of the Airbnb homes in Amsterdam, which is a remarkable finding.
Part II: Can I develop an accurate regression model that would predict the price of an Airbnb listing in Amsterdam using available attributes? Which attributes are good predictors for price?
I ran three different models, with the only difference in feature selection:
- Model 1: Feature selection using correlation threshold of 10%
- Model 2: All features
- Model 3: Subset of features depending on Information Gain feature selection
Measuring model performance
I will be using r-square (R2) as my performance score, which measures the strength of the relationship between the model and the dependent variable on a scale of 0 to 1 and indicates the percentage of variance in the target variable that can be explained by the model.
Feature Selection using correlation threshold of 10%
Only 17% of the variance in the price can be explained by this model.
The predictions are as follows:
We have a very bad performance score of -8.798 when running the model on all features, which is to be expected since the model has over 200 features.
The predictions are as follows:
Subset of features (k = 4) depending on Information Gain
We have a bad performance r- square score of 0.096 when running the model on just 4 features and using Information Gain as feature selection criteria. The model has a better prediction accuracy compared to model 2, but performs just at half the performance of model 1.
It seems Model 1 outperforms the other models by a significant amount. From Model 1 we will look at the coefficients to determine feature importance when it comes to predicting the price. See below.
The features that are the most important in determining the price of stay make sense: property_type_Room in aparthotel and room_type_Hotel room, which means that if a listing is a room in an aparthotel or an hotel room can add a significant amount to the price.
Whether or not a listing is located in the neighbourhood of ‘Centrum-West’ is also a strong predictor for price.
Furthermore, the feature accomodates, which stands for the number of guests that the listing can accomodate, is a valuable indicator for the price.
Lastly, if the listing is a private room in an appartment, also is a good indicator for the price.
In this post, we took a look at Airbnb data from Amsterdam and were able to determine some interesting insights:
- There are 2161 hosts who rent out multiple listings on Airbnb in Amsterdam. They make up around 13.17 % of the market, which means that actually slightly more than one in ten Airbnb hosts in Amsterdam have multiple listings and are probably running a home rental business.
- The top 10 of hosts by listings own 351 homes together and make up around 1.85% of the Airbnb homes in Amsterdam
- Overall, our price prediction models proved not that successful in predicting the price of a stay in Amsterdam. Our best model is model 1 which uses a correlation threshold. The features that are of the biggest importance to the price of a stay in Amsterdam are if the listing is a Room in an aparthotel or Hotel room. Furthermore, if the listing is located in the neighbourhood of ‘Centrum-West’ is also a strong predictor for price. The number of guests that the listing can accomodate, is also avaluable indicator for the price, as well as if the listing is a private room in an appartment, is also a good indicator for the price.
The low result for the models could be due to the independent variables not being good predictors, or imputation and variable selection/transformation choices made by me.
Perhaps the model’s performance could be improved by employing other feature selection algorithms, such as Wrapper methods (RFE, Backward Elimination…).
Another helpful thing might be to impute missing values using KNN imputation instead of mode and mean imputation, which might prove beneficial in this context to provide with more accurate predictions.
It makes sense of course always sense to just have more data, to use Airbnb data from different cities and union them and run different types of machine learning models on the data and see if it performs better :)
I hope you enjoyed this read and I am looking forward to post some more interesting stories in the future!
Stay tuned for more!
#datascience #machinelearning #tableau #python #udacity #airbnb #amsterdam