LibGuides: Evaluating information & data: Lies and statistics

Lies and statistics

Students and researchers alike love using statistics because they can help transform complex data into something easily digestible. Statistics can also be used to visualize data into charts and graphs giving the reader an easier way to picture what you are trying to tell them about. For example, one could say that a survey of 957 students found that 651 of them like using statistics in their research, but it might come across a bit clearer to the audience by saying that 68% of surveyed students like to use statistics in their research.

However, not everything is as it seems with statistics. Corporations, agencies, and even some scientists manipulate the facts and figures to make the numbers say something that isn’t true. Statistics can be misused to give a feeling of insight and weightiness when there is in fact a lack of both, and it is important to understand how statistics can be manipulated.

For a fun introduction into one statistical problem called “Simpson’s paradox,” watch the TED Talk, “How statistics can be misleading” by Mark Liddell where he goes into lurking variables and how grouping data in different ways can give contrasting results.

How statistics can be manipulated

Much like anything, numbers and data can be manipulated to say something that isn’t true. Corporations, agencies, and even some scientists will try to make the data they are presenting as impactful as possible even if that means stretching the truth, messing with graphs, or reporting false positives. Here, we look at some of the ways that statistics can be made to say something that isn’t entirely true.

Just lie about it. The easiest and most unscrupulous way to lie with statistics is to do just that and lie about them. There is nothing easier than to write that 68% of 957 surveyed students like using statistics without having any data to back that up.

Calvin and Hobbes by Bill Watterson

Using the wrong “average.” When we hear “the average” of something, what we typically think of is the Mean average which is a total of all numbers included in a set divided by the quantity of numbers represented in that set (the mean of 5, 5, 5, 8, 12, 14, 21, 33, 38 is 15.64). However, there are two other averages. The Median is the number midway through an odd set of numbers or a value halfway between the two middle numbers in an even set (the median of 5, 5, 5, 8, 12, 14, 21, 33, 38 is 12), while the Mode is the number or value that occurs most frequently in a series (the mode of 5, 5, 5, 8, 12, 14, 21, 33, 38 is 5). Depending on which way one chooses to find the average one can create different types of results and even numbers which did not exist in the original data set. As this figure from the Harvard Business School shows, despite having a mean average of 4, 4 does not appear in the data set.

The Use and Misuse of Statistics (2006), Harvard Management Update. Mar. 2006, Vol. 11 Issue 3, Special section, 3 Fig 1.

Statistical versus practical significance. In statistical analysis, a result is generally considered significant and not random if it has a p value of 0.05 or less, or a 5% chance or less that the result of the analysis came about by random chance. If a result has a p value higher than 0.05 it is typically thrown away for not being significant. However, as Parks and Yeh put it, “The 0.05 threshold for significance that is commonly agreed upon is also somewhat arbitrary. A result with a p value of 0.06 (6% probability of a result caused by chance) would commonly be discarded because of a lack of statistical significance, even with a large effect size, whereas a result with a p value of 0.04 (4% probability of a result caused by chance) may be interpreted as unassailable truth… It is important to remember what statistical significance actually means and that it does not have anything to do with practical significance” (Park and Yeh 2021: 612).

Correlation does not equal causation. Correlation is a statistical measure that expresses the extent to which two variables are linearly related meaning they change together at a constant rate. However, when one data set is correlated to another this does not mean that the first data set caused changes in the second data set or vice versa. These are called spurious correlations or a connection between two variables that appears to be causal but is not. Tyler Vigen in his book “Spurious Correlations” shows just how wrong it is to assume that correlation equals causation. Here are two charts from his book that demonstrate that even if data is correlated it is not causal.

Data dredging or p-hacking. Data dredging is the practice of probing data in unplanned ways after the data is collected to produce a statistically significant p value. Again, as Parks and Yeh describe it, “One common form of data dredging comes in the form of testing multiple subgroups to produce a positive result. This, however, is likely to produce a positive finding solely as a result of chance. With a significance threshold of 5%, if enough tests are performed on the same set of data, obtaining a false-positive result is practically guaranteed… The impact of data dredging is unknown, but likely widespread; a large proportion of studies with statistically significant results come with a p value just under the 0.05 threshold” (Park and Yeh 2021: 614). What this means is that even if a study shows a statistically significant result we cannot always know if it is actually significant or just a random false positive which was massaged out of the data.

Lying with figures and charts. Depending on how one chooses to visualize the data can greatly affect how that information is understood by the reader. In the two examples below, Park and Yeh demonstrate how changing the vertical axes effects how significant Treatment A appears compared to Treatment B or that using a three-dimensional pie chart can make the sections at the front appear larger or as large as those behind and thus as or more significant even though the percentage represented by that front section is less.

Parks, J., & Yeh, D. D. (2021). How To Lie with Statistics and Figures. Surgical infections, 22(6), 616, 618 Figs 2 and 4.

There are of course more ways that numbers can be made to lie. To learn more, read the following two articles, where much of this information was based on. They are also linked in the suggestion box off to the right:

Sources:

Parks, J., & Yeh, D. D. (2021). How To Lie with Statistics and Figures. Surgical infections, 22(6), 611-619.

(2006). The Use and Misuse of Statistics How good are those numbers you’re looking at, anyway? Don’t rely on statistical analysis unless you know the pitfalls. Harvard Management Update. Mar. 2006, Vol. 11 Issue 3, Special section, 3-4.

Assess original dataset
When finding quantitative information like a number, a table, or a whole dataset we want be able to judge its quality to help us select the right information.

Suggested

How To Lie with Statistics and Figures
How to look for statistical and financial data
Searching for quantitative information such as data or statistics involves a different preparation than when searching for literature in bibliographic or full-text databases. For example, after selecting a database, your next step could be identifying and selecting datatypes or variables to include in your search. The Erasmus Data Service Centre is here to help you.
Spurious Correlations
The Use and Misuse of Statistics

Contact

Email the Information skills team

Evaluating information & data: Lies and statistics

Lies and statistics

How statistics can be manipulated

Related

Suggested

Contact