[quote]EyeDentist wrote:
[quote]Bill Roberts wrote:
[quote]ActivitiesGuy wrote:
[quote]mertdawg wrote:
If p is less than alpha=.05 doesn’t it mean that the null hypothesis is <5% likely to be true? Well doesn’t that mean that there is a 95% chance of at least some real relationship between the variables? [/quote]
[/quote]
Something I would add to AG’s reply is that there can be a high probability, even near certainty, of the result being by chance “despite” this p value.
Let’s say we ask a bunch of people to come up with the most idiotic ideas they can think of as to what treatments might change various measured blood values of rats.
One comes up with the idea that a treatment wherein the water for the rat’s water bottle is swirled as it’s added to the bottle, instead of just poured in as usual. Another conceives of a treatment wherein the researchers are required to speak only in pig Latin while working with them, instead of English. Still another changes brands of light bulbs in the labs, at same lighting intensity and color temperature. In total, 100 idiot ideas are conceived.
And 10 blood test variables are examined for each study.
Most likely, out of these 1000 possible “effects” being studied, 50 or so will “show an effect” to p <= 0.05.
It will be almost certain that not a single real effect existed.
I know my original post was eye-glazing due to length and probably writing style, but it goes over this more. P value in itself does not calculate or show probability of real causal effect existing, or even necessarily much likelihood.
The more variables being measured in a study, and the less plausible the effects in the first place (most possible treatments in fact don’t provide benefit) the more likely that outcome was not caused, but was from chance alone.
There are many studies which measure 20 things at a time. Most such studies will generate, from chance alone not cause, an “effect” if effect is judged by meeting p <= 0.05. Most authors will at least suggest their data supports causal relation. The reader must beware.
[/quote]
What you’re talking about is sometimes referred to as the ‘familywise error rate’–the probability of whether one of multiple statistical tests will be significant. Fortunately, methods abound for controlling the familywise error rate. So to be fair, no reputable journal would publish a study of the sort you’re describing, ie, in which the authors shotgunned a bunch of tests at the p = .05 level.
[/quote]
Actually, reputable journals do routinely publish studies with a bunch of tests at the p = .05 level. I have done so myself, and I have reviewed several of papers which do the same.
Here’s the thing: it’s not always wrong. Several prominent statisticians have published editorials about the problems with adjusting for multiple comparisons. The biggest argument, and IMO a valid one, is that we can usually squabble endlessly over which comparisons count in the “how many comparisons should we adjust for?” decision.
Suppose I do a randomized clinical trial. Two treatment arms.
My primary comparison of interest is whether Drug A reduced all-cause mortality in 5 years of follow-up compared to Drug B. My secondary analyses of interest are whether Drug A reduced a composite outcome of cardiovascular mortality (defined as MI, stroke, and whatever else), incidence of new peripheral vascular disease, and a few other outcomes such as quality of life, healthcare economics, etc. For those of you that haven’t published papers on large clinical trials, these are generally divided amongst several working groups.
It turns out that there’s a slight difference in five-year mortality (Drug A: 10% vs. Drug B: 13%) which turns out to be not quite significant, p=0.06 (bear with me, I’m making numbers up to prove a point). Same deal for cardiovascular mortality (Drug A: 8% vs. Drug B: 10%, p=0.08). There are no significant differences in PVD incidence or healthcare economics, either, although Drug A generally looks “slightly” better than Drug B on both accounts. However, there is an indication that patients receiving Drug A experienced significantly better quality of life during follow-up, p=0.02.
How do we interpret this result? Can I make a statement that Drug A “significantly” improves QoL over Drug B because my p-value is less than 0.05? If I publish a paper that’s focused solely on QoL outcomes, this may be the main comparison presented in my paper.
Or do I have to couch it by noting that this is a secondary outcome, and when I adjust the p-value for multiple comparisons (some of which may not be published yet because they’re being written as separate papers) it no longer is statistically significant?
What if the group writing the PVD outcomes paper looks at multiple comparisons (say, low ABI as primary diagnosis of PVD, along with more severe sequelae like amputation or lower extremity revascularization)? Is my group writing the QoL paper supposed to know that the PVD group added a few additional PVD-relevant outcomes and now take that into consideration for how we adjust our p-values?
That’s an impossible way of doing things.
IMO, we really should get away from p-values entirely, or at least from the strict view of things as “significant” or “not significant” based on whether they fall above or below one magical threshold. Decisions about results of this type are much more nuanced than “Is it statistically significant?” and yet that’s the only way they are ever evaluated.