Predicting churn on big data using Pyspark and AWS cloud computing

Jovan Gligorević
12 min readDec 31, 2020
Sparkify: a fictitious music streaming platform

Hi data science enthusiasts! It is my pleasure to write you for the second time, I am happy to share my passion for data science and machine learning with the community! Today I will be sharing a churn prediction model and Pyspark machine learning pipeline I developed to predict churn for a fictitious music streaming platform called Sparkify. This is part of Udacity’s Data Scientist Nanodegree course. Enjoy the read!

Churn prediction is one of the most common and typical use cases for machine learning in a lot of subscription based businesses. It is generally significantly more costly (around 5x times as costly!) to acquire a new customer than to retain an existing one, usually 80% percent of traffic or revenue comes from 20% of customers (Pareto rule).

Thus it is absolutely crucial for businesses to be able to accurately identify potential churners before they churn and offer them a certain incentive to stay, like a discount for example.

In this blog, I will be discussing the following:

  • Overview of the Sparkify dataset
  • A note on cloud computing in Amazon Web Service (AWS)
  • Data Understanding
  • Data Exploration
  • Feature Engineering
  • Modelling
  • Results
  • Conclusion

Overview of the Sparkify dataset

Sparkify is a fictitious music streaming service. The dataset resembles the data captured by platforms such as Spotify or Pandora.

Millions of users play their favorite songs through music streaming services on a daily basis, either through a free tier plan that plays advertisements, or by using a premium subscription model, which offers additional functionalities and is typically ad-free.

Users are able to upgrade or downgrade their subscription any time, but they can also cancel it altogether.

I will be focusing on predicting churn where users will have cancelled their subscription completely (users who have stopped using the service of Sparkify and not just downgraded their subscription).

Cloud computing using Amazon Web Service EMR(AWS)

Amazon EMR is a cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark.

To be able to run models on such a big dataset, I leveraged the EMR service from AWS. I did decide to use more powerful machines with more cores and memory for this job than was recommended, although against a higher price. You should really consider the costs to choose the rights machines for your cluster.

I had a cluster of 3 nodes.

These are the instances of my cluster:

Master: 1x r5.12xlarge (48 vCore, 384 GiB memory, EBS only storage)

Core nodes: 2x r5.12xlarge (48 vCore, 384 GiB memory, EBS only storage)

For my custom configuration of the spark context please refer to my github repo containing the Jupyter notebook code. It seems like Amazon EMR’s default settings for maximum resource allocation does not actually use the entire cluster’s capability. I have implemented some best practices around setting up a cluster in EMR. See here.

I was able to run all of my models within one hour using this setup.

Data Understanding

First of all, no documentation was provided for the data, therefore I performed data exploration to understand the data structure and the relationship between variables.

The dataset is a json file and consists of 26259199 usage log records from 22278 users of Sparkify. The data was captured from 10–1–2018 until 12–1–2018.

From these user interactions, 778479 logs are from users that are not registered on the platform, as such, these users are filtered out. I was able to identify these by empty names and a missing registration date field.

Contrary to my expectations, the churn rate is 22.45% in this dataset, which means roughly every fifth user in the dataset is a churner. Churn prediction usually involves imbalanced classes, with only a small percentage of users being churners, so theoretically in this case, over- or undersampling techniques, stratified train- and testsplits and stratified k-fold cross validations are not that necessary and we could just proceed with random splits of data for machine learning and still expect a decent result.

The log data has a number of irregularities. Apart from the users that are not registered, there were also some missing user id’s which had to be taken out.

I also created a new field called State, which is derived from the Location field, which might come in handy for the analysis.

The raw fields are as follows:

artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

The log data is represented in the following format:

Row(artist=’Popol Vuh’, auth=’Logged In’, firstName=’Shlok’, gender=’M’, itemInSession=278, lastName=’Johnson’, length=524.32934, level=’paid’, location=’Dallas-Fort Worth-Arlington, TX’, method=’PUT’, page=’NextSong’, registration=1533734541000, sessionId=22683, song=’Ich mache einen Spiegel — Dream Part 4', status=200, ts=1538352001000, userAgent=’”Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"’, userId=’1749042')

Apart from the identifiers for the user, the name and the agent used by the user when using Sparkify, there are some interesting attributes that seem to be worthy exploring to gain insights into what differentiates a churner from a non-churner.

Data Exploration

There are lot of different ways to predict churn, depending on the context. One particular concept of predictors that stands out from the literature on churn prediction is the concept of RFM: Recency, Frequency and Monetary value.

Recency stands for how recent a user has used a service, this can be measured in time units in a certain time period, usually from the latest session or from a certain point in time. Active users use a service often and therefore have less time between sessions and have a lower propensity to churn.

Frequency stands for how often a user uses a service in a particular timeframe. The more frequent a user uses the service, the less likely he/she is to churn.

Monetary value stands for the amount of money spent by a user, the more is spent, the more a user is committed to the service and theoretically is less likely to churn.

I will implement the above concept in a unique way for Sparkify users, although the last concept of monetary value is not readily applicable in this situation.

However, we can substitute it with features such as the number of different page visits, average number of sessions in a time period as well as session duration, which are all logical substitutes for monetary value in this context.

Page visits

First, let’s have a look at how page visits and their frequency are different accross churners and non-churners.

The chart types used vary between plain histograms or Seaborn’s distplot (kernel density distribution plot) to show the distribution of data between churners and non churners.

The chart below illustrates what the frequency is of different page visits by churners and non-churners.

Distribution of page visits by churners and non churners

There does not seem to be significant differences in the distribution of page visits between churners and non-churners, although it seems like non churners do give more Thumbs up on the songs they like compared to non-churners.

It also looks like churners receive more ads (Roll advert) compared to non churners, which makes sense as a result of there being more churners in the free plan than in the paid plan.

Days since registration

Let’s have a look at the tenure of a particular user, namely the amount of days since they registered. Hypothetically, the longer a user uses a service, the less likely he/she is to churn.

Kernel density distribution plot of the amount of days since registration of churners and non churners

It is clear from the kernel density plot that non churners use the service relatively longer compared to churners. The relative probability of a churner having a certain amount of days since registration (low amount of days) compared to non churners is higher.

Number of Sessions (daily/monthly)

Let’s have a look at the average number of sessions on a daily and monthly level.

Distribution of daily and monthly average number of sessions of churners and non churners

On a daily level, the average number of sessions seems to be equally distributed between churners and non churners.

However, on a monthly level, there is a difference to be found between churners and non churners, it seems like churners relatively speaking have a lower amount of sessions on a monthly level compared to non churners.

Average Daily/Monthly Session Duration

You probably saw this one coming, if we know the number of sessions per time period, we might as well compare session duration between churners and non churners, as we are given the timestamp per each interaction.

Distribution of daily and monthly session duration of churners and non churners

The distributions follow a similar pattern in the case of session duration.

Recency

Let’s have a look at Recency, as we discussed previously. In the context of Sparkify, I define recency as the number of days between the last session and the session before that.

Distribution of Recency between churners and non churners

It seems like non churners tend to have fewer days between the last session and the session before it compared to churners, especially by looking at the second bar, where the relative probability for this amount of days is higher than that of churners.

Number of unique artists listened to

This one also seems like a good predictor, assuming that churners generally listen less music but may also listen to fewer artists compared to non churners, as they may use other platforms for other other artists they listen to (could have to do with licenses) or they just generally do not listen so much to music.

Distribution of distinct artists listened to between churners and non churners

It seems like the distribution of values follow a similar skewed pattern, however, the relative probability of a churner falling in the first bucket of listened artists seems higher (0.0014) compared to the same bucket for non churners (around 0.0013), but this may not be significant.

Number of distinct songs listened to

The same as above but then looking at the number of distinct songs listened to.

Distribution of distinct songs listened to by churners and non churners

Churners seem to listen to slightly fewer unique songs compared to non churners.

Churn by State

Let´s have a look at the numbers between states, is there significant difference between churners and non churners with regards to their location?

The left chart shows the distribution of non churners accross states. The states are sorted by the percentage of users coming from that state. Thus, it is easy to compare the charts and see that for some states in the right chart there are more churners from this state relatively speaking, most notably New York. Or a bit less as in the case of Illinois.

Distribution in percentages of churners and non churners accross states

Churn by Gender

I was hesitant to include this field, as I was expecting churn to be equal between genders. However, from the chart below, it is evident there are more male (M) users than female (F) users of the platform, however, the distribution of churners between male and female favors male, meaning there are slightly more male churners as opposed to female churners.

Distribution of churners and non churners by Gender

Feature Engineering

I have developed the features for the machine learning model according to the following steps:

  • Engineering the features used to identify churned users by aggregating the above data per user. I defined functions to aggregate or pivot data and join these calculated fields back to the main dataframe.
  • Defining the binary response variable, churned users get a ‘1’ label, non churners get a ‘0’ label. As said in the beginning, churn is defined as users who completely cancel their subscription, regardless if they are on a paid plan or not.
  • Transforming the original dataset (one row per user log) to a data-set with user-level statistics (one row per user).

Modelling

In the modelling phase, I proceeded to do the following:

  • defining machine learning pipelines using Spark MLlib for each model: indexing string variables (‘gender’, ‘state’), one hot encoding (‘level’, ‘gender’, ‘state’), MinMax standardization of the numerical features (because of highly skewed data), feature assembly, and a binary classifier (logistic regression, random forest and gradient boosted trees model)
  • splitting the data-set into training and test set using a 80/20 train/test split
  • training and tuning using grid-search with 3-fold cross validation on the training data.
  • For Logistic Regression, I used the following hyperparameters: regularization parameter :0, 0.1, maximum iterations: 10
  • For Random Forest, I used the following hyperparameters: number of trees: 10, 20; maximum tree depth: 10, 20
  • For Gradient Boosted Trees, I used the following hyperparameters: maximum iterations: 5, 10; maximum tree depth: 5, 10, 20.
  • analyzing each model’s performance in cross validation using F1 and testing the model on the test set (using Accuracy, F1 score and AUC metric).

Evaluation metrics

Since churn prediction is a classification problem with usually highly imbalanced classes, it is also necessary to not only evaluate models by Accuracy, but also by Recall and Precision and AUC.

Recall and Precision can be evaluated using the F1 score, which combines these two metrics into a single metric by taking the harmonic mean.

Results

The results are given per model:

Logistic Regression

The accuracy on the test set is 83.33%

The F1 score on the test set is 80.37%

The areaUnderROC (AUC) on the test set is 81.02%

The hyperparameters of the best model are:

maximum iterations: 10

regularization parameter: 0

Random Forest

The accuracy on the test set is 94.49%

The F1 score on the test set is 83.74%

The areaUnderROC (AUC) on the test set is 84.44%

The hyperparameters of the best model are:

number of trees: 20

maximum depth: 20

Gradient Boosted Trees

The accuracy on the test set is 85.41%

The F1 score on the test set is 82.73%

The areaUnderROC (AUC)on the test set is 82.58%

The hyperparameters of the best model are:

maximum iterations: 10

maximum depth: 5

Conclusion

The good performance of the three models, means that we managed to develop a prediction model that can accurately identify churners, based on log data containing the user’s interactions and usage patterns.

The best performing model was Random Forest, that reached pretty good prediction results on the test set, with an accuracy of 94,49%, and high class prediction metrics as well.

The trained model could be deployed to help Sparkify identify current users at risk of churning, and make it easier for Sparkify to organize an individual discount to these churners that have an increased risk of churning.

Reflecting on the work, I can say that the most challenging part was to come up with good predictors, but it is also the most fun part. In this situation where no documentation is given about the data, it was fun to get creative and seek out important features that can be helpful to predict churn.

Potential improvements

There are several improvements possible:

  • Perhaps stratified cross validation with a stratified train and test sample could be implemented to see if it will improve the model’s performance, although in this case where classes are not that imbalanced, not so sure if it will greatly improve prediction accuracy.
  • Perhaps other user data (if available) could be leveraged, data regarding the listening behavior of users, such as songs that are skipped, which playlists a user listens to, how many personal playlists have been created/deleted by the user etc.
  • More comprehensive grid search, although the model’s complexity and run time might increase dramatically, but it must be said that with current hyperparameters, the models have pretty good performance already and extra tuning might not necessarily improve results. Perhaps, in the case of Random Forest, the maximum tree depth could be increased slightly, as you can notice that with other parameters staying constant, a higher tree depth seems to result in higher F1 score.

I hope you enjoyed this read. Leave a clap if you liked it! :)

For the full code please refer to my github repo.

--

--