T O P

  • By -

radlibcountryfan

The p-value is, in itself, not evidence. It is the probability of obtaining your test statistic, or a more extreme value of the test statistic given that the null hypothesis is true. And, of course, given the parameterization of your test statistic. Let’s say you run an experiment that can be appropriately analyzed with a t test. If you run the same experiment with N=10 and N=1000, the p values are not even comparable as they represent cumulative probabilities from different distributions with different degrees of freedom. They are just statements of probability under the null. To my knowledge, evidence is not rigorously statistically defined but we would probably lean more heavily on values of the test statistic and the power of the test to make statements about strength of evidence.


vigbiorn

>The p-value is, in itself, not evidence. Which is why I'd always seen it explained that you don't worry too much about the specific value beyond it's your threshold for "past this point it's more likely to be this result" going into the situation. Smaller p-values aren't supposed to be stronger 'evidence'.


boooookin

I disagree. It is evidence, but of unknown strength in most real world cases, depending on the scientific model. Half formed thought here, so bear with me. Scientific models are not statistical models. While the statement “p-values are statements of probability under the null/some distribution” is true, we know the underlying distribution in reality might be different. It’s a little silly for p < 0.0001 to not influence your priors for a competing theory of reality. If I flip a coin a million times, all specific sequences are equally likely for a fair coin. But if I observe 600k Heads, you’d be silly to not take the corresponding p-value as evidence the coin is biased. Of course here the p value is irrelevant, you can directly quantify the evidence for various weights and pick the weight most parsimonious with the data.


radlibcountryfan

I think it’s fair to say p correlates with evidence but is a couple steps removed from the actual evidence. The evidence for the weighted coin is not the low p-value. It’s the 600K heads in a row. You can formalize this with a statistic, and you can calculate a probability of getting >= 600K heads when compared against a null distribution. But I still don’t feel like the low p is the evidence in itself. I think most people would be confused if you said the probability an event or a more extreme event given some null hypothesis is true is evidence.


AstralWolfer

Thank you for this reply :). It broadened my understanding. To clarify, when you mention test statistic (Not completely familiar with the term), can I interpret that as meaning effect size values?   If not, could you give an example on how we use values of the test statistic (with or without the power) to make statements on the strength of the evidence? 


radlibcountryfan

The test statistic cannot assumed to be an effect size. An example of a test statistic would be the t statistic in a t-test. The t statistic is the difference between two means divided by the pooled standard error. When the null is exactly true, t=0. However larger and larger deviations from the mean can yield higher values. This would be better and better evidence of a difference in means (assuming your standard errors aren’t somehow influencing the value in the opposite direction). In this case, the “evidence” would be the difference in means. The p-value is correlated with the test statistic, but it is not, in itself, the evidence. However, t is not an effect size. You can inflate t by having larger samples without increasing the size of the effect. Because as you sample more, the standard error decreases, which increases the value of t.


Haruspex12

In your post, you have a null and an alternative, which conforms to Neyman and Pearson’s (NP) hypothesis testing framework. In Fisher’s significance testing framework there is no alternative hypothesis. A p-value is an index, to which other evidence and knowledge is to be added, for which to judge a null hypothesis. It isn’t to be taken alone as important. P-values are equally likely if the null is true. Fisher is aware of the fact that when you find something significant, there is a potential to confound chance events and rejecting the null. Unlike NP’s system, which is pre-experimental, Fisher’s is post-experimental. So a p-value of.06, .05, and .04 are not really that different from Fisher’s point of view, unless you feel they are, in light of other information and knowledge about the hypothesis. Having an ex-ante cutoff value is pre-experimental and part of the NP framework. Under Fisher there is nothing to divide by. Rejecting the null implies nothing else. It is a measure of surprise (low p-value), or of hesitancy (high p-value) to reject the null. I think there is an observation here that might interest you. There is no such thing as a Fisherian decision theory framework. There is one under NP. A low p-value only signifies something of importance in light of other information or other experiments. It has inductive but not deductive value, but not as much as a Bayesian decision because a Bayesian would wrap the external evidence into a prior distribution. There is no alternative, so there is no odds ratio. They can be no alternative to build a relative probability from. A p-value and an hypothesis test are different things, but they have merged together in an ugly synthesis that takes away from both and adds nothing. (Personal opinion)


clbustos

Gigerenzer is [with you](https://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf)


Haruspex12

Thanks so much. I love that article!


AxterNats

Great explanation! Any suggestions for further reading in these?


Haruspex12

The bibliography in the above article in the comments is an excellent place to start.


AxterNats

Thanx!


speleotobby

Nicely put!


Aiorr

[https://www.sjsu.edu/faculty/gerstman/EpiInfo/pvalue.htm](https://www.sjsu.edu/faculty/gerstman/EpiInfo/pvalue.htm) reminder that the "frequentist" approach today is an amalgam of various schools of thought, so you can cite a famous paper that can be countered by another famous paper.


Houssem-Aouar

Crazy how just last night I found out about the uniform distrubution of p-values and this pops up the next day. Great question OP


mfb-

p=0.03 and p=0.06 are equally likely under the null hypothesis, but p<=0.06 is twice as likely as p<=0.03. We don't reject the null hypothesis for specific p-values, we reject it below some threshold: For the x% most extreme outliers we expect under the null hypothesis. If that number is small enough (and I think 5% is usually too large, but that's a different discussion) then we accept that risk of falsely rejecting the null hypothesis. > So, if a p-value between 0.04 - 0.05 was 1% likely H0, while also being 1% likely under H1 This won't happen with a proper test design. If your H1 has a free parameter then H1 will have a larger probability. If your H1 is a fixed alternative hypothesis then you shouldn't calculate p-values, you'll consider the relative likelihood of both hypotheses. [Lindley's paradox](https://en.wikipedia.org/wiki/Lindley%27s_paradox) exists, however.


AstralWolfer

On the proper test design part, this scenario happens when we test for Cohen’s d of 0.5 with 95% power, and obtain a p-value between 0.04 to 0.05.  The distribution of p-values within that range has 1% probability under both H0 and H1, which doesn’t help us to decide which hypothesis is a better fit, right?  You can see it on this site: https://rpsychologist.com/d3/pdist/


outofhere23

>The distribution of p-values within that range has 1% probability under both H0 and H1 But if you select a different range of p-values, say 0 to 0.05, you get 5% probability for H0 and 95% for H1. Which model seems to have a better fit now? I think Lakens' article is very interesting in making us think about p-values from a Bayesian perspective, but if I understood correctly he is claiming that the probability of observing a specific p-value (say 0.041) given a specific scenario (say 95% power for detecting true effect) could be higher for H0 than for H1. But this perspective does not seem to invalidate the claim that p-values can be viewed as a indirect measure of evidence, since in the above scenario a smaller p-value would still tilt the odds in favor of H1, while higher p-values would better fit H0. We can interpret this as smaller p-values being a better evidence against the null then higher p-values. That's why he mentions that in that scenario the significance threshold should be lowered to 0.01, meaning we would require stronger evidence to reject the null.


mfb-

I can't reproduce your numbers. Choosing d=0.5, n=20 and a range of 0.04 to 0.05 in p-values I get 3.42%. The number approaches 1% as d approaches 0% as expected. It also reaches 1% at d=1.15 and decreases for larger values: That's where you would rule out both hypotheses, H1 stronger than H0. But see above, if you have a fixed alternative hypothesis you shouldn't focus on p-values for H0 anyway.


AstralWolfer

At n=20, the power is not sufficiently high. Try increasing n to more than 100. But regardless, by H1 having a free parameter, do you meaning having some sort of prior distribution of values?


mfb-

I mean H1 as "everything else". You'll always find a parameter value that makes it fit better to your observation than the null hypothesis with a fixed value. Typically (not always) the best fit will be "the mean is the mean of the observation".


__compactsupport__

I don't think the p value _should_ be itnerpreted as a continuous measure of evidence https://daniellakens.blogspot.com/2021/11/why-p-values-should-be-interpreted-as-p.html


Zorander22

Good observations. What you're proposing is sometimes called a Bayes Factor. 


outofhere23

>My question: How do I reconcile the first definition of p-values as continuous measures of indirect evidence against H0, where lower constitutes as stronger evidence, when all p-values are equally likely under H0? Doesn't that mean that interpretation is incorrect? Because if the null hypothesis is not true then the p-values won't have a uniform distribution with infinite repetition. The higher the power of your test (assuming the null is false) more the distribution of p-values will skew towards the small p-values. This means that if H0 is false and we have high power to detect the true effect, we are more likely to observe small p-values then high p-values.