Learn about correlation versus causation and how to differentiate these terms from one another when describing the relationship between variables.
In analytics, correlation, and causation both describe relationships between variables. However, the two terms are not interchangeable and have significant differences. Causation indicates that one event causes another, while correlation only identifies that a relationship exists between two events or outcomes.
When two variables respond similarly to an event, you may assume that one event caused the other or that the two are directly connected. However, this isn’t always the case, making it important to be able to distinguish between correlation and causation.
The concept of correlation versus causation strives to determine if two events are simply related or if one caused the other to happen. Correlation versus causation is an important consideration since the presence of a correlation between two variables doesn’t mean one causes the other. When a clear relationship exists between variables, it can be easy to say that a cause-and-effect relationship is present.
The problem with making this observation is that you may fail to consider other factors or variables that could cause the correlation. The correlation you observe may be causation, as both can be true, but correlation alone isn’t enough to declare causation.
Correlation measures the linear relationship between variables. In a positive correlation, when the value of one variable goes up, the other does as well. When one variable goes down, the other variable descends, too.
A negative correlation describes the opposite—one variable goes up, and the other goes down, with the two variables moving in opposite directions. If no relationship exists between variables, you would say the correlation is zero.
You can represent the strength of the relationship between variables using a correlation coefficient ranging from -1 to +1, where the closer the linear relationship is to zero, the weaker the correlation is:
1 = Perfect positive correlation
0.5 = Weak positive correlation
0 = Zero correlation
-0.5 = Weak negative correlation
-1 = Perfect negative correlation
You can also use scatter plots to visualise correlations. If you have a positive correlation, you will notice points on the scatter plot moving up from left to right and down from left to right if a negative correlation is present. A scatter plot representing variables with no correlation will have points that appear spread throughout the graph.
Limitations exist regarding how much you can learn from correlations, as correlation alone isn’t enough to prove causation. Additionally, correlations are only able to establish linear relationships between variables.
Even when variables are strongly correlated, it doesn’t prove a change in one variable caused the change in the other. To be able to do that, you must establish causation. Causation occurs when one variable is directly responsible for the change in the other. This is much more difficult to prove than correlation and requires experimentation using both independent and controlled variables.
Causation occurs when one variable is directly responsible for the change in the other. In other words, a change in one variable causes a change in another variable. Proving this relationship tends to be more difficult than correlation and requires experimentation using both independent and controlled variables.
To prove causation, you need a properly designed experiment that demonstrates these three conditions:
Temporal sequencing: Temporal sequencing states that X, referring to the variable causing the change, comes before Y, the variable that changes.
Non-spurious relationship: A non-spurious relationship means that you can demonstrate with certainty that the relationship between X and Y couldn’t occur simply by chance.
Elimination of alternative causes: By eliminating alternative causes, you are stating that the relationship between X and Y isn’t due to other outside variables that aren’t considered part of the experiment.
Although it’s possible for both correlation and causation to occur at the same time, correlation doesn’t imply causation. This is because the relationship between variables could either be due to a third variable or simply a coincidence.
If you were to collect data on the sale of mangoes and air conditioners throughout the year, you would likely find a strong positive correlation between the two as sales of both increases during the summer months. If you make the mistake of assuming correlation implies causation, you would incorrectly claim that an increase in mango sales causes people to buy air conditioners. However, this isn’t the case since you can attribute the increase in both to another variable—likely the warmer weather people experience during the summer. So, although a correlation is present, you can't support causation.
In another correlation versus causation example, it may not be as easy to identify whether causation is present with two variables. For example, you could find a correlation between the amount someone exercises and their reported happiness levels. While an increase in exercise may be causing an increase in happiness, you can't say for sure that it’s the cause since there could be another unknown variable that significantly influences a person's mood.
To reliably determine causation, you can perform randomised A/B/n testing, which is the same as an A/B test, but with any number of additional variables. This ensures that other possible factors are part of the test as well.
The other method for determining causation is through hypothesis testing. Hypothesis testing is when you test your primary hypothesis against a null hypothesis, which is the opposite of your primary hypothesis. Your primary hypothesis should disprove the null hypothesis to help you be as certain as possible about your results.
In analytics, distinguishing between correlation and causation is crucial because correlation only indicates a relationship between variables, while causation confirms that one variable directly influences the other. Establishing causation requires rigorous experimentation to prove that one event leads to another, eliminating the possibility of other influencing factors.
To develop important analytical skills, such as data collection, calculations, and analysis, consider earning a Google Data Analytics Professional Certificate on Coursera. With this certificate, you can qualify for in-demand positions in less than six months, such as a data analyst or junior data analyst.
The University of Colorado Boulder’s Statistical Inference and Hypothesis Testing in Data Science Applications and Data Analysis Tools from Wesleyan University on Coursera are also great courses to learn more about how you can properly implement hypothesis testing.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.