Understanding the Misuse of P-Values in Statistical Analysis
Written on
Chapter 1: The Nature of P-Values
P-values are crucial metrics in statistics; however, they are frequently misinterpreted and misapplied. To navigate the complexities surrounding p-values, it's essential to clarify what they represent and their inherent limitations. Let’s illustrate this with a relatable example to foster a better understanding.
What is a P-Value?
In the context of the American legal system, a foundational principle is the presumption of innocence, requiring the prosecution to establish guilt "beyond a reasonable doubt." In civil matters, the burden is lighter, relying on a "preponderance of the evidence." While these standards could be roughly interpreted as 99% and 51% certainty levels, quantifying evidence strength this way is often impractical. For the sake of discussion, let’s assume we can.
By conducting hypothesis tests, we can evaluate the significance of new evidence. The null hypothesis posits that the evidence is insignificant, while the alternative hypothesis suggests the contrary. We then calculate the probability of the null hypothesis being valid given the evidence.
Consider a scene from the classic film My Cousin Vinny. Vinny Gambini (played by Joe Pesci) is defending two clients accused of a robbery and shooting. The case heavily relies on the testimonies of three witnesses who identify the defendants. Under the presumption of innocence, how likely is it that three witnesses could wrongly identify the suspects?
Vinny contemplates the improbability: is it possible that two men could enter the store, select 22 specific items, interact with the clerk, and then leave only to have two different men drive up in a similar vehicle, commit a robbery, and shoot the clerk? While not impossible, it’s exceedingly unlikely—akin to a one in a million chance. This scenario illustrates the p-value: the probability of observing such evidence if the null hypothesis were true. A lower p-value indicates stronger evidence against the null hypothesis. For instance, in this case, we might arrive at a P-Value of 0.0001%, leading to a confidence level of 99.99%.
While prosecutors may emphasize the low p-value to persuade the jury, the defense attorney seeks to argue its significance differently.
What a P-Value Is Not
It's vital to recognize that the confidence level does not equate to the probability of guilt; that requires a comprehensive evaluation of all evidence. The defense's aim is to present a narrative where the evidence against their clients seems plausible if they are innocent.
For example, if investigators discover the defendant's DNA and fingerprints at a crime scene, the p-value would initially appear very low—implying a strong case. However, if the crime scene turns out to be the defendant's own home, the evidence becomes far less compelling, and the p-value would increase.
P-Values in Scientific Research
In research, scientists often assume the role of prosecutors, trying to demonstrate the existence of a particular effect—such as the influence of CO2 on global temperature. The null hypothesis suggests no relationship, while the alternative posits that increased CO2 correlates with higher temperatures. Unlike legal situations, researchers can accurately calculate p-values, and the scientific community determines whether findings merit publication.
This scenario creates a pressure for researchers to achieve significant findings, typically aiming for a confidence level of 95% (or a p-value of 5%). But what happens when results only reach a 94% confidence level?
P-Hacking
Abandoning the research is not an option, and acquiring more data can be cost-prohibitive. This scenario can lead to unethical practices, such as selectively omitting variables that negatively affect results. Peer review serves as a safeguard, allowing other researchers to replicate findings and uncover these shortcomings. While this doesn't invalidate all studies, it highlights a significant issue: most funding is directed towards new studies rather than replication.
According to the National Academy of Sciences, Engineering, and Medicine, concerns regarding reproducibility in research have gained attention in both scientific and public spheres. While solutions have been proposed, the problem remains unresolved.
Does this imply that scientific studies are inherently unreliable? Not at all. Although the challenges of replication persist, they can be managed. The study also emphasizes that "research synthesis and meta-analysis" are effective methods for evaluating the reliability of research.
Why Meta-Analysis is Important
To illustrate this, picture a darts game with four players aiming for a bullseye. Each player's throws display considerable bias, making it unlikely for any of them to win. However, an observer analyzing their throws could potentially improve accuracy in future attempts.
Similarly, meta-analyses evaluate a broader range of studies on a subject, considering various biases to create models that better approximate reality. While these models offer valuable insights, they also come with higher variances, indicating they should be treated as aids rather than definitive answers.
Other Considerations
It's essential to understand that p-hacking can sometimes be unintentional. If a data point skews results, one must assess its validity. Researchers face complex decisions, balancing various stakeholder interests while determining which data to include.
A potential approach is to manipulate the p-value and subsequently explore the limitations of the findings. However, accurately documenting these limitations is crucial to prevent misuse. Unfortunately, the nuances of research are often lost in popular media interpretations.
P-Values in Regression Models
When considering p-hacking, it's critical to note that p-values indicate whether a relationship exists between variables but do not measure the strength of that relationship.
For instance, in a study analyzing the effect of state gun laws on homicide rates, a p-value of 0.14% (confidence level of 99.86%) suggests a relationship. However, examining the data visually reveals many exceptions. To quantify this relationship's strength, we utilize the r-squared score, which indicates what percentage of variance in homicide rates can be attributed to state gun laws. In this case, the r-squared value is only 6.73%.
Takeaways
P-values are not inherently flawed; they serve as valuable tools for assessing the significance of evidence. However, relying too heavily on a single metric or study to support broader claims can be misleading. Research consumers must maintain a holistic perspective, and media outlets should avoid sensationalizing individual study results. Ultimately, while p-values are useful, their application requires careful consideration.
In this video, we discuss the issues surrounding NHST, p-values, and significance testing.
This video addresses the problems associated with p-values in statistical analysis.