I've been reading about the replication crisis in psychology and other fields, and p-hacking and statistical significance fishing seems to be a big part of the problem.
The idea that researchers can manipulate their analysis until they get a p-value below 0.05 is really concerning. But how widespread is this practice actually?
Have you encountered p-hacking in your work or studies? What are some red flags to look for when reading research papers? And what can be done to reduce this problem?
P-hacking and statistical significance fishing is probably more widespread than people realize. The pressure to publish, especially in academia, creates strong incentives to find significant" results.
I've seen researchers try 20 different analysis methods until one gives p < 0.05, then only report that one. Or they collect data until they hit significance, then stop. Or they test dozens of hypotheses but only report the significant ones.
These are all forms of p-hacking. The problem is that with a 5% significance level, 1 in 20 tests will be significant by chance alone. If you try enough things, you're almost guaranteed to find something.
Red flags for p-hacking: when a paper reports only p-values without effect sizes or confidence intervals. When they test many things but only discuss the significant results. When they use unconventional analysis methods without justification.
Also, be suspicious of just-barely-significant results (p = 0.049). With true effects, you'd expect a range of p-values, not all clustered right at 0.05.
What can be done? Preregistration of studies helps - stating your hypotheses and analysis plan before collecting data. Reporting all results, not just significant ones. Using stricter significance thresholds when testing many hypotheses (Bonferroni correction).
The multiple comparisons problem is closely related to p-hacking. If you test 20 different things at the 5% level, you'd expect one to be significant by chance alone.
Some fields are starting to address this. In genomics, they use much stricter significance thresholds because they're testing thousands of genes simultaneously. But in psychology or social sciences, the multiple comparisons problem often gets ignored.
I think part of the solution is better education. Many researchers don't fully understand the statistics they're using. They learn to run tests in software and interpret p < 0.05 as real," without understanding all the ways that can go wrong.