Causal Inference
Was your Marketing Campaign Effective? Let Regression Discontinuity Design Help You! — A Practical Python Tutorial
Using the help of simulated data, learn practical, real life, step-by-step application of Regression Discontinuity Design (RDD) for measuring impact of marketing interventions taken up your organization.
You are a marketing data analyst at your organization and your marketing team decided that they will roll out a specific campaign for consumers who are spending more than 100 USD per week on your website. They call these customers as “loyal customers”. Now you have been tasked to present it to the leaders of the organization whether this campaign was worth it. Because your organization decided to not do A/B Testing, you are now going to use Regression Discontinuity Design arsenal from your Causal Inference Toolbox. Unfortunately, you do not know how!
In this post, I am going to explain you step by step with the help of code and charts, how you can do an RDD analysis. I will first talk about what RDD is, why do we use it, why is it closest to the golden standard of Randomized Control Trials (RCTs), and how you can analyze the effects of treatments for effective marketing campaigns. If you are new to causal inference, you can read my previous article published on Towards Data Science here. Let’s go!
Regression Discontinuity Design
According to World Bank, “Regression Discontinuity Design (RDD) is a quasi-experimental impact evaluation method used to evaluate programs that have a cutoff point determining who is eligible to participate. RDD allows researchers to compare the people immediately above and below the cutoff point to identify the impact of the program on a given outcome.” Let us break this definition apart.
- The first is that RDD is quasi-experimental. This means that it works on observational data and there is no randomized assignment.
- The second is that because there is no randomized assignment, the units (consumers in this case) were assigned to treatment and control group according to a particular threshold. In our case, the threshold was 100 USD. If the consumer spent more than 100 USD, they got the treatment, else they did not.
- The third is that we will compare the people just below and just above this threshold to see if our marketing campaign was impactful. The next part answers why we do this.
Why RDD? How RDD?
RDD estimates the local average treatment effects at the cutoff, where treated and comparison units are closest in similarity. It is assumed that a person spending 99 USD is similar to the person spending USD 101. The units on left and right are increasingly alike. The only difference is that if the customer spent 101 USD, they got a specialized marketing campaign, but if they did not, they didn’t get the specialized marketing campaing. For close limits, this can be approximated to a randomized control trial experiment. Therefore, RDD works with this logic. For more information, one can read the following article.
RDD Assumptions
- There must be a continuous variable on which the cutoff rests, also called a running variable. In our case, the consumer spending is a continuous variable.
- There should be no sign of individuals manipulating their eligibility in order to increase their chances of being included in or excluded from the program. In our case, this assumption is also fulfilled because the customers do not know that they can increase their spending to get into the treatment effect as they do not decide if they can get a marketing campaign or not.
- Individuals close to the cutoff point should be very similar, on average, in observed and unobserved characteristics. This assumption is considered to be true because 100 USD is an arbitrary number, so it can be safe to assume that people who were spending 101 USD are similar to people who were spending USD 99.
Python Code
For this tutorial, I am going to generate a dataset. With the help of this dataset, we will find out whether the marketing campaign was effective for our intervention.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
pd.options.mode.copy_on_write = True
# Simulate customer data
np.random.seed(42)
# Number of customers
n_customers = 1000
# Simulating customer spending with random values
spending = np.random.normal(100, 25, n_customers)
# Treatment assignment (1 if spending > 100, 0 otherwise)
treatment = (spending > 100).astype(int)
# Simulating an outcome variable (e.g., future spending) based on treatment effect
# The treatment is expected to have a positive effect on future spending
future_spending = spending + 25 * treatment + np.random.normal(0, 15, n_customers) # Adding noise to future spending
# Creating a DataFrame
data = pd.DataFrame({
'customer_id': range(1, n_customers + 1),
'spending': spending,
'treatment': treatment,
'future_spending': future_spending
})
The next step is seeing the first discontinuity in how the consumers reacted to marketing campaign.
# Plotting the data to visualize the discontinuity
plt.figure(figsize=(10, 6))
plt.scatter(data['spending'], data['future_spending'], c=data['treatment'], cmap='coolwarm', alpha=0.5)
plt.axvline(x=100, color='red', linestyle='--', label='Treatment Threshold')
plt.title('RDD: Marketing Campaign Effectiveness')
plt.xlabel('Customer Spending')
plt.ylabel('Future Spending')
plt.legend()
plt.show()
The next step is to visualize the trendlines for treatment and control group. This can be achieved with the help of the following code.
# Fit linear regression models for the data on both sides of the cutoff
left_data = data[data['spending'] <= 100]
right_data = data[data['spending'] > 100]
# Create the X and Y arrays for each side (including constant for intercept)
X_left = sm.add_constant(left_data['spending'])
y_left = left_data['future_spending']
X_right = sm.add_constant(right_data['spending'])
y_right = right_data['future_spending']
# Fit the regression models
model_left = sm.OLS(y_left, X_left).fit()
model_right = sm.OLS(y_right, X_right).fit()
# Create a range of spending values for plotting the trendlines
spending_range_left = np.linspace(left_data['spending'].min(), 100, 100)
spending_range_right = np.linspace(100, right_data['spending'].max(), 100)
# Predict the future spending based on the regression models
pred_left = model_left.predict(sm.add_constant(spending_range_left))
pred_right = model_right.predict(sm.add_constant(spending_range_right))
# Plotting the data and the trendlines
plt.figure(figsize=(10, 6))
plt.scatter(data['spending'], data['future_spending'], c=data['treatment'], cmap='coolwarm', alpha=0.5, label='Customer Data')
plt.axvline(x=100, color='red', linestyle='--', label='Treatment Threshold')
# Plot the trendlines
plt.plot(spending_range_left, pred_left, color='blue', label='Trendline (Below Threshold)', linewidth=2)
plt.plot(spending_range_right, pred_right, color='black', label='Trendline (Above Threshold)', linewidth=2)
plt.title('RDD: Marketing Campaign Effectiveness')
plt.xlabel('Customer Spending')
plt.ylabel('Future Spending')
plt.legend()
plt.show()
Fitting the OLS Estimators
With a bandwidth of five, I use the ordinary least square method to fit the regression curve. The code below and the corresponding images show the output of that.
# Define a bandwidth around the cutoff (we'll use a small bandwidth to focus on the discontinuity)
bandwidth = 5
# Selecting data within the bandwidth of the cutoff
data_rdd = data[(data['spending'] >= 100 - bandwidth) & (data['spending'] <= 100 + bandwidth)]
# Create the running variable centered around the cutoff (spending - 100)
data_rdd['running_var'] = data_rdd['spending'] - 100
# Adding constant for the intercept in the regression model
X = sm.add_constant(data_rdd['running_var'])
y = data_rdd['future_spending']
# Fit the regression model (we will fit a linear model to each side of the cutoff)
model = sm.OLS(y, X).fit()
# Display the summary of the regression model
model.summary()
This model is telling us that the future spending increases by 5.001 points when the consumers get the marketing campaign treatment. In other words, I can say that the marketing campaign increases the chance of more spending by 4.39%. Statistically, this is significant. But, if is significant business wise, this can be debated.
More Charts
Boxplot
plt.figure(figsize=(10, 6))
data_rdd.boxplot(column='future_spending', by='treatment', grid=False, patch_artist=True,
boxprops=dict(facecolor='lightblue', color='black'), medianprops=dict(color='red'))
plt.title('Boxplot: Future Spending (Treated vs Control)')
plt.suptitle('')
plt.xlabel('Group')
plt.ylabel('Future Spending')
plt.xticks([1, 2], ['Control', 'Treated'])
plt.show()
Cumulative Distribution of Future Spending
from scipy.stats import cumfreq
# Cumulative distributions
control = data_rdd[data_rdd['running_var'] < 0]['future_spending']
treated = data_rdd[data_rdd['running_var'] >= 0]['future_spending']
plt.figure(figsize=(10, 6))
plt.hist(control, bins=30, density=True, cumulative=True, histtype='step', color='blue', label='Control')
plt.hist(treated, bins=30, density=True, cumulative=True, histtype='step', color='orange', label='Treated')
plt.title('Cumulative Distribution of Future Spending')
plt.xlabel('Future Spending')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.show()
Some Caution
Because RDD estimates local average treatment effects around the cutoff point, the estimate does not necessarily apply to units with scores further away from the cutoff point. Therefore, we need to know what we are trying to answer using this experiment.
If the evaluation primarily seeks to answer whether the treatment should exist or not, then the RDD will not provide a definitive answer. However, if the question of interest is whether the treatment should be cut or expanded at the margin, then the RDD produces the local estimate of interest to inform this decision.
Sources: