On our final day in the Critical Thinking unit, we are going to discuss the problem of causation versus correlation. How can we tell when something causes another thing to happen versus two things that seem related (correlated) but really have no deeper linkage? We will also discuss some basics of statistics and introduce the normal distribution, otherwise known as the “bell curve”.

Physicists manipulate and study subatomic particles. Chemists study molecules. Even biologists can study cells or non-intelligent animals. Only the social scientists have to deal with humans – who are aware they are being observed. This is known as the “Hawthorne Effect” after a factory in which it was first noticed that workers worked harder and increased their productivity when they were aware they were being monitored. To get around the Hawthorne Effect, social scientists often use statistical methods to try and understand human behavior.

Some notes on terminology – a sample is a subset of the population and a statistic is any measure taken from a sample. If our sample is a good representation of our population, then the behavior of the sample should match the behavior of the overall population. Sampling is a really effective way of understanding the truth of a larger population through the process of inference. For example, if I sample one piece of a cake, my experience with that slice should be the same as my experience with any other slice. But what if I just sample the frosting? It’s important (critical, even) that my sample resemble the population as closely as possible for my inferences to be valid.

One way to guarantee a representative sample is to take a random sample. The idea is that if I pick my sample randomly, each element of the population will appear proportionally in my sample. And thus, my random sample will be representative of my population. This is a lot tougher than you might think. In this example, the sampling method occurred via a mail-in survey sent out to NCSU alumni. But… such a sampling method tends to ignore those graduates with no address or with incomes so low they feel ashamed to report them. It is biased in favor of graduates with higher incomes. Thus, the results are flawed because the sample is biased and not representative of the overall population. Link for salary statistics: http://www.bizjournals.com/triangle/stories/2010/07/19/daily66.html

When a sample is not representative of the population, we say such a sample is biased. Voluntary responses automatically create bias (see NCSU example). Face-to-face surveying also creates bias, as shy people and those uncomfortable with the surveyors will avoid the process. Using phone calls creates bias. Sometimes, statisticians will try to create a sample based on demographics. If the population is two-thirds white and one-third people of color, they will construct a sample with the same racial proportions in the hopes of minimizing bias. Another good way (used by sites like fivethirtyeight.com) is to aggregate many small samples together, in the hopes that the various kinds of bias will cancel each other out if they are summed together.

Another problem we encounter in statistics is the misuse of precise mathematical terms. Laypeople use “average” to describe the mean, the median or the mode and the three terms are not always identical.

Here is an example where the presence of outliers shifts both the median and the mean away from the most common result (the mode). Income often looks like this when graphed out. A lot of poor and middle class people in the “hump” and a very few, very wealthy people out to the right. The presence of the wealthy few distorts the overall picture. Here, mode is the most representative result. Mean and median are artificially high. All three could be carelessly described as “average”.

Whenever we graph out frequency versus range, we define that as a distribution. Like the previous example, here the presence of some wealthy outliers will skew the median and mean away from the mode.

There is one very, very important distribution where the mean, median and mode are all equal. It is called the normal distribution. It is the most important distribution in statistics. Almost everything that is biological (height, weight, lifespan, IQ) is distributed normally. By the way, test scores also tend to follow a normal distribution. Because the normal distribution is symmetric, we can make lots of useful inferences once we know the mean. For example, if test scores are distributed normally and the mean score is a “C”, I can be fairly confident that the number of A’s and B’s will be equal to the number of D’s and F’s.

Here, we see that men’s height is normally distributed around a mean of about five foot ten inches.

The last topic we will cover today is the difference between causation and correlation. Statistics are often used to make it seem like one thing causes another. Causality is actually a pretty bold claim and hard to prove conclusively. Correlation is just two things that move together and is much easier to find. We make errors when we confuse correlation with causation. Another area of confusion lies with the direction of causation – what is exactly causing what to happen? Are kids prone to violence because they play violent video games? Or… do violent kids prefer violent games?

The classic example of correlation getting confused for causation is the apparent connection between shark attacks and ice cream sales. Both peak during the summer months. But if I want to test a hypothesis that ice cream consumption somehow causes shark attacks, I need to have a mechanism, an explanation that connects the two things to each other. Causation requires a mechanism, an explanation. Correlation does not. Correlation is easy to establish. Causation is hard.

By removing scales from the graph, I can make the two graphs look very similar. The underlying explanation for this correlation is temperature. When it gets hot, people enjoy ice cream. When it gets hot, people go to the beach. And the beach is near the ocean, where the sharks live. In statistical terms, temperature is a hidden variable, a confounder, that causes ice cream sales and shark attacks to move together.

Does being a teen mom mean your baby is destined to not graduate high school? Or – are most teen mothers from low income families and poor people traditionally have lower rates of educational attainment? If this is causation, explain exactly how having a baby as a teenager will cause your baby to grow up to be a high school drop-out. A more likely explanation is that this is correlation with income being the hidden confounding variable.

By playing with the scale of the graph (note that neither axis starts at zero), we can really make a strong case that lemon imports and car crashes are deeply connected. But in the absence of any possible explanation for how these two things are connected, we have to conclude this is just mere correlation.

A personal favorite! Using Internet Explorer made me think murderous thoughts but the notion of these two things being connected through causality is just silly.