CAUSAL INFERENCE

What is Counterfactual? — A Guide for Elementary Causal Inference Scientists

The crux of causal inference lies in comparison. Therefore, it is absolutely important for a scientist to understand, define, and develop valid counterfactuals, thus leading to valid conclusions, and consequently, valid decisions. In this article, I explain what counterfactual is.

Aayush Malik

--

Source https://matheusfacure.github.io/python-causality-handbook/_images/potential_outcomes.png

This is my second post on Causal Inference. In my first post, I talk about the basis of causal inference, and how it is used for making business decisions. In this post, I discuss Counterfactual, Validity, and Biases in Causal Inference in detail.

What if?

Humans have been fascinated by this question since antiquity. What if I had taken the other route to reach the office today? What if the British had not set up the East India Company in India? What if I had chosen to study engineering over a natural science degree? What if we had given a discount to all our customers instead of high-value buyers? What if we had chosen to invest more in marketing as compared to product development?

All the questions that are written above force us to think of something that is unobserved and unreal. It’s an imaginary construct. Because in the physical space we live in, it is not possible for us to see alternate scenarios for the same thing, we need a way of imagining this alternate world, called counterfactual, a no intervention scenario. Estimating and creating an empirical counterfactual helps us to measure the impact of our actions, and the effects seen by the interventions we do. The difference between the actual outcomes we see and the outcomes that could have happened gives us the real measure of the impact. Therefore, the key challenge before us is identifying and measuring valid counterfactual estimates.

Problems with Non-Counterfactual Methods

  1. Reflexive Comparison, also known as a before-after comparison — This simple method requires testing the same set of populations before and after an intervention. For example, measuring the maths and reading abilities score of a set of students, training them for a year, and measuring the scores again after the training. Another example could be rolling out a special loyalty discount to all the customers, and measuring their average buying values before and after this discount. The problem with this method is that it leads to confounded causality. In simple words. it means that we cannot be 100% sure if the before-after change is due to the intervention.
  2. Cross-Sectional Comparison, also known as an apple-oranges comparison — This incorrect method compares the people who participated in a program with other different people who did not participate in the program. For example, comparing high-income high-value buyers with low-income low-value buyers. In simple words, an apples-to-apples comparison is not done in this method and thus it leads to wrong conclusions.

Therefore, our goal is to make the comparison unconfounded with other factors. In other words, we would like to create a situation in which there is a clear statistical empirical comparison group that is similar to the intervention group, the only difference being that one group receives the treatment and the other doesn’t. The whole field of impact evaluation/causal inference is therefore concerned with creating the best estimator of the counterfactual. In the following section, we learn how this is done.

Control Group OR Comparison Group

One of the approaches that scientists suggest is using a control group or comparison group. Many authors use the word interchangeably, but it is good to make a distinction here. The Control Group is used for experimental methods, whereas the Comparison Group is used for quasi-experimental methods or natural experiments. Suppose we would like to know if loyalty discount or free shipping impacts the buying behavior of our customers. To measure that we divide our customers sharing similar demographics and buying patterns into two groups randomly: one group receives the discount or free shipping and the other doesn’t. Because it’s an experiment, it’s appropriate to use the word “control group” here for that group that doesn’t receive discounts or free shipping. However, we need to ensure that these groups are balanced groups sharing similar characteristics.

Types of Designs

The designs can be divided into two major classes: Experimental Designs and Non-Experimental Designs. The design forms the strategy of how the counterfactual is defined, created, and estimated.

Experimental Designs are also called Randomized Controlled Trials or A/B Testing. They have frequently been termed the gold standard in evaluation design. The industry uses the term “A/B Testing” and research and econometrics use the term “Randomized Controlled Trials”. It is always not possible to do a randomized controlled trial because of social, economical, political, ethical, or operational reasons.

Non-Experimental Designs use observational data that we get after an action has been done. The major difference is that the researcher is not able to control the assignment mechanism, in other words, the researcher is not able to control who gets the intervention and who doesn’t. In these situations, the researcher gets the data, instead of contributing to the creation of the data. These methods can be divided further into three parts.

  1. Natural Experiments — These “experiments” divide a homogenous set of populations into two different groups, where one group receives an intervention because they are located in the geographical area. For example, dividing a village by a highway and measuring the health outcomes between people who are located on the side where the Primary Health Care center is, eventually comparing them with those who live on the non-PHC side of the village.
  2. Quasi-Experimental Methods — These methods use statistical techniques to create a counterfactual. Some of the methods include difference-in-differences, propensity score matching, and regression discontinuity design.
  3. Regression-Based Methods — The participation in the intervention is captured by a dummy variable. Some of the methods include endogenous treatment models, instrumental variables, switching regressions, and double robust regression.

I will write in more detail about each of the methods. The above methods require the researcher to have a large number of observations. The number of observations that are needed to detect the effect clearly in a statistical manner can be calculated from power calculations. I will not go into detail. But sometimes, it’s not possible to have a large number of observations, for example, enactment of a policy in a country or an organization. These kinds of situations allow us to have a large number of observations on the same unit over a longer period of time. There are two methods that can be used for analyzing these situations.

  1. Synthetic Control Method — The synthetic control method is a statistical method to evaluate treatment effects in comparative case studies.
  2. Interrupted Time Series Design — An interrupted time series (ITS) design involves collecting data consistently before and after an interruption. This means introducing and withdrawing your digital product or service, or some part of it, and then see if anything changes in the outcome you’re assessing.

These methods lead us to make evidence-based decisions. To ensure that the decisions are valid, we need to keep in mind the concept of validity and define three kinds of validities.

  1. Internal Validity — This means that there is a balance between the control group and the intervention group. If these groups are not similar, we may not say with confidence if our intervention worked.
  2. External Validity — This means that the findings from the analysis can be generalized to a wider audience. The best way to guarantee that is to have a representative sampling frame from which the randomized units are taken.
  3. Construct Validity — It means that the indicators used to measure a phenomenon are a valid construct for that phenomenon. For example, scores in mathematics and language are a good construct for measuring the scholastic abilities of a student. One more example could be buying behavior of customers, which could be a good construct for their liking of our platform. It is advisable to have subject-matter expertise to have a reliable construct.

Biases and Challenges in Causal Inference

There are some challenges that are there when one uses causality for making business decisions. They can not always be prevented, but articulating that they may exist and clearly writing the assumptions under which the analysis is performed help exercise judgment to make more evidence-based decisions. There are four that require special mention.

  1. Confounding — Confounding refers to the situation in which a third unaccounted variable is influencing both our dependent and independent variable, causing a spurious correlation. If there is something other than the intervention that differs between the treated and untreated groups, then we cannot conclusively say that any difference observed in the outcome of interest between the two groups is due solely to the intervention. Such a difference could also plausibly be due to these other variables that differ between these groups.
  2. Selection Bias — There is a difference between those who take part in an activity and those who do not. The approach adopted to deal with selection bias and other conflating factors is called the identification strategy. Having a strong identification strategy leads to a strong analysis. For example, comparing the buying outcomes of those who buy less from our website with those who buy more, and making a conclusion that the intervention didn’t work out is wrong.
  3. Contamination or contagion — This is an issue that is more likely to occur in the development sector evaluations. The same population is getting benefits from multiple agencies, therefore leading to contamination of activity. In controlled online experiments, it is less likely to be an issue.
  4. Spillover Effects — Sometimes the effects of an intervention can be seen in different places too. It is recommended to have a reasonable geographical separation. Once more, this is an issue more likely to be seen in the development evaluation programs.

I hope you have an understanding of counterfactual and different types of designs that can be used to measure the effects of an intervention, in both commercial applications as well as research applications. To know more about me and my work, contact me on LinkedIn at https://www.linkedin.com/in/aayushmalik/

Happy Learning!

--

--

Aayush Malik

Satellite Imagery | Causal Inference | Machine Learning | Productivity and Communication | https://www.linkedin.com/in/aayushmalik/