How significance tests are misused in climate science
Posted on 12 November 2010 by Maarten Ambaum
Guest post by Dr Maarten H. P. Ambaum from the Department of Meteorology, University of Reading, U.K.Climate science relies heavily on statistics to test hypotheses. For example, we may want to ask whether the global mean temperature has really risen over the past ten years. A standard answer is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do.
This poor practice appears to be widespread. A new paper in the Journal of Climate reports that three quarters of papers in a randomly selected issue of the same journal used significance tests in this misleading way. It is fair to say, though, that most of the times, significance tests are only one part of the evidence provided.
The post by Alden Griffith on the 11th of August 2010 lucidly points to some of the problems with significance tests. Here we summarize the findings from the Journal of Climate paper, which explores how it is possible that significance tests are so widely misused and misrepresented in the mainstream climate science literature.
Not unsurprisingly, preprints of the paper have enthusiastically been picked up by those on the sceptic side of the climate change debate. We better find out what is really happening here.
Consider a scientist who is interested in measuring some effect and who does an experiment in the lab. Now consider the following thought process that the scientist goes through:
- My measurement stands out from the noise.
- So my measurement is not likely to be caused by noise.
- It is therefore unlikely that what I am seeing is noise.
- The measurement is therefore positive evidence that there is really something happening.
- This provides evidence for my theory.
To the surprise of most, the logical fallacy occurs between step 2 and step 3. Step 2 says that there is a low probability of finding our specific measurement if our system would just produce noise. Step 3 says that there is a low probability that the system just produces noise. These sound the same but they are entirely different.
This can be compactly described using Bayesian statistics: Bayesian statistics relies heavily on conditional probabilities. We use notations such as p(M|N) to mean the probability that M is true if N is known to be true, that is, the probability of M, given N. Now say that M is the statement “I observe this effect” and N is the statement “My system just produces noise”. Step 2 in our thought experiment says that p(M|N) is low. Step 3 says that p(N|M) is low. As you can see, the conditionals are swapped; these probabilities are not the same. We call this the error of the transposed conditional.
How about a significance test? A significance test in fact returns a value of p(M|N), the so-called p-value. In this context N is called the “null-hypothesis”. It returns the probability of observing an outcome (M: we observe an upward trend in the temperature record) given that the null-hypothesis is true (N: in reality there is no upward trend, there are just natural variations).
The punchline is that we are not at all interested in this probability. We are interested in the probability p(N|M), the probability that the null hypothesis is true (N: there is no upward temperature trend, just natural variability) given that we observe a certain outcome (M: we observe some upward trend in the temperature record).
Climate sceptics want to argue that p(N|M) is high (“Whatever your data show me, I still think there is no real trend; probably this is all just natural variability”), while many climate scientists have tried to argue that p(N|M) is low (“Look at the data: it is very unlikely that this is just natural variability”). Note that low p(N|M) means that the logical opposite of the null-hypothesis (not N: there really is an upward temperature trend) is likely to be true.
Who is right? There are many independent reasons to believe that p(N|M) is low; standard physics for example. However many climate scientists have shot themselves in the foot by publishing low values of p(M|N) (in statistical parlance, low p(M|N) means a “statistically significant result”) and claiming that this is positive evidence that p(N|M) is low. Not so.
We can make some progress though. Bayes' theorem shows how the two probabilities are related. The aforementioned paper shows in detail how this works. It also shows how significance tests can be used; typically to debunk false hypotheses. These aspects may be the subject of a further post.
In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant. This doesn't mean that those results are false or irrelevant. It just means that the significance test does not provide a way of quantifying the validity of some hypothesis.
So next time someone shows you a “statistically significant” result, do tell them: “I don't care how low your p-value is. Show me the physics and tell me the size of the effect. Then we can discuss whether your hypothesis makes sense.” Stop quibbling about meaningless statistical smoke and mirrors.
Reference:M. H. P. Ambaum, 2010: Significance tests in climate science. J. Climate, 23, 5927-5932. doi:10.1175/2010jcli3746.1

Arguments




























Then this claim below crossed my mind, just like Dr. Ambaum:
In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant.
I think you could statistically correlate car sales and global warming, for instance, and it would mean nothing. It's the underlying physics AND the statistics that will give you the evidence - which is the case.
To illustrate your point:
Of course, some will argue that the recent surge in piracy is the cause of the "perceived flattening of the global temperature rise". Sigh.
In life and statistics, some will see only what they expect to see.
The Yooper
Daniel's chart 'proves' that global warming is caused by lack of pirates... but in the past several years piracy has been booming off the coast of Somalia! We should start seeing temperatures turn around now! :]
Oh, wait...this isn't WUWT?
Still good advice over a Century later!
eg. I think it was the Royal Navy that eliminated a lot of piracy.
But they had to chop a lot of trees down to do it, plus the age of ironclads and battleships (coal use) meant pirates needed to be more sophisticated with access to a better income stream to afford a steam boat with heavy guns.
Also piracy became a state sanctioned aim during the world wars with submarines, but the motive wasn't to steal produce. Although maybe that was the Nazis big mistake. They should of stolen the convoys, rather than sinking them?
Also, paleoclimate data has shown (with statistical significance) that piracy changes lag temperature changes by several hundred years.
At the same time, as someone whose brain hurts whenever I think about probabilities in any sense beyond my chances of finally winning the lottery, I must admit that I find statistics and statisticians as annoying as piracy and pirates... perhaps even more so.
If only global warming had such a negative impact on statisticians! Alas, and alack, I fear that the opposite is the case. I'm far more cognizant of statisticians in this woefully warming world.
I also have no doubt that statisticians keep Bayesian eye patches in their desk drawers, to be worn in complete secrecy in the privacy of their lairs, while performing their heinous acts of statismancy and probabalism. The line between pirate and statistician is, I fear, as blurry as the line between p(M|N) and p(N|M).
Arg.
I'll have to look at this more carefully (I keep telling myself to learn the Bayesian approach, but I still haven't sat down and done it). I had thought that the main misuse of frequentist statistics was in post-hoc analyses of existing data from uncontrolled experiments. That was the other thing I thought you were getting at: that JOC authors were obtaining data, visualizing them, and then deciding to do frequentist tests (after conscious or sub-conscious pre-selection). That's obviously wrong, to me, and I know it happens in my field (biology). I didn't think planned application of frequentist stats in a controlled experimental design was problematic. Time to learn...
General discussion of broad categories of evidence for global warming should go in an appropriate thread, such as this or this.
Also, please note that in a series of visits over the past month, you've left at least five versions of the same comment about ice cores, in five different threads. Most of them have now been deleted or redirected here.
Please try to post your comments in the appropriate thread and then stick with them there, rather than spreading discussions across many different threads. This helps make the site more readable for everyone.
You are fraudulently hiding the incline of the M19CPP (Mid 19th Century Pirate Period) in your graph!
All you Pirate Change Alarmists cannot be trusted!
Dan
The Yo-ho-Yooper
Scientists use multiple criteria to evaluate theories.
See also Tamino's post at Open Mind, on The Power--and Perils--of Statistics.
When do Bayesian statistics matter? When the prior probability is extreme (very likely or very unlikely). So if the chance of a woman your age has breast cancer is 1 in 1000, and mammograms have a 1 in 100 false positive rate, and you had one done as part of a routine checkup and it came back positive, Bayesian statistics tells us that chances are you don't have cancer.
But when your prior probability is something medium, it isn't likely to affect the significance of the result. What's more... just how do you establish the prior probability? By counting planets where the climate sensitivity is above 2 degrees per doubling of CO2 and those where it's below? And if you're already pretty certain that you know what the answer is, what are you adding by doing the experiment? Let's say the existing body of evidence leads you to be 99% certain, and your experiment doesn't cause that figure to budge, do you now show using Bayesian statistics that combining your result with the prior gives you 99% confidence and, presto! a statistically significant publishable result! Of course not.
Another problem with this is that it's that prior (is Global Warming real?) which is precisely what we want to figure out, not the "real" posterior (is it really warming at the moment?) We want P(N), not P(N|M). Asking how to get P(N|M) from P(M|N) is getting a few steps ahead -- you also want to know P(M2|N) and P(M3|N) and P(M4|N) and all the other peices of evidence before you do that calculation. And if someone else finds further evidence and publishes a paper showing P(M5|N), well now that Bayesian analysis you did in your paper to get P(N|M1..M4) is out of date. But that calculation of P(M4|N) stands, and will forever be useful as a piece of the evidence used to assess P(N).
Bayesian analysis provides a way of thinking about how to combine all the pieces of evidence to form your conclusion, but the proper role of research is to establish those individual pieces of evidence. Establish the symptoms if you will. One experiment is your family history, another the mammogram, another the biopsy. We don't calculate whether the mammogram is positive or negative by considering your family history, rather they stand as separate results which we then combine to make an inference. And in this analogy we can't perfectly do the Bayesian calculation because we don't really know what fraction of the population has cancer, except for what we infer through these tests. But you don't subject patients to tests that tell 1 in 5 healthy people they have cancer, and so likewise we demand statistical significance.
Tamino has posted a comment on this article that might be of interest to people: Tamino on Ambaum and stats
Thank you for your comments and links. I understand this issue better than I did before.
Erm this paper suggest 75% of climate science papers use statistical significance in a "misleading" way. Does that mean you think these are all written by "deniers"?
Drop the rhetoric and stick to the science.
2.Daniel Bailey
"Entirely scientific graph:
Who worked out there were 17 pirates in 2000?
No, but it does mean that 75% of climate denier posts are misleading -- and that's significant.
So rather than bring up the subject we should be asking why so much literature on th subject if it has little or no merit?
"This thread is narrowly focused on concepts related to assessing statistical significance. "
There is a high probability that only an off topic post by a skeptic will be flagged while more egregiously off topic posts about pirates will go unanswered.
Based on Ambaum's statement that 3/4 of the articles in a recent randomly picked issue of a prestigious climate publication contained this error is it likely that the papers that the IPCC uses in it's publications are tainted? Ambaum further stated that this number was up over a ten year previous issue where the error only occurred 1/2 the time.
I have seen what Ambaum alludes to in his paper, an increased use of computer programs to analyze data without understanding the underlying reasoning. You will typically see this on tests when asking students to take the sin(pi/3)/cos(pi/3)/sqrt(3). A calculator dependent student will more often than not get this wrong.
Temperature anomaly is a low signal to noise ratio quantity. I'd sure like to see a study of the proper use of statistics in deriving that quantity. In fact it seems like there was one in a past topic. Can't quite recall the name at the moment.
@muoncounter
"No, but it does mean that 75% of climate denier posts are misleading -- and that's significant."
Guess I'm not seeing the connection to "climate deniers". What is a "climate denier" anyway? Someone who denies that there is such a thing as climate? I wasn't aware that the Journal for Climate was an anti-anthropogenic global warming publication. After all they put out this, "Global Warming is Unequivocal: The Evidence from NOAA" 5/6/2010.
@TonyL
The Ambaum Article
The Pirate Chart was used to illustrate Alexandre's point, that just because things can be correlated doesn't mean that the correlation itself has any meaning.
Just because comments by skeptics get flagged for being off-topic doesn't mean comments by those who believe in climate science do not get flagged for being off-topic. Check out the Deleted Comments bin sometime. I've had comments land there before; I can also guarantee I'll end up there again sometime. Comments that are off-topic get deleted; fact of life here.
Tamino has some insights into the Ambaum piece here.
The Yooper
No it means 75% of all climate research is in part misleading. This is surely not a "denier"/fear-mongerer issue.
In fact given that many on this website believe almost all peer-reviewed literature is in support of AGW then this paper is a critique of the mainstream science, "deniers" should be left out of the discussion because this paper has not researched the space where the audience of this website believe "deniers" predominantly publish. Let's stay within the bounds of the published work.
(I'll drop the name calling when others do)
Lacks in rigor, not "is in part misleading".
I love the way that HR and others latch on to one paper critical of statistical analysis in science, and immediately cast aside all the supposed "skepticism" they show towards published work.
I imagine it's because HR and others believe this shows some gaping problem with climate science that undercuts the fundamental overwhelming scientific consensus that increasing CO2 will warm the planet somewhere between 1.5 and 4.5 C per doubling.
Classic example of confirmation bias. Based on essentially a sample size of 1 issue of 1 climate science publication, author Ambaum demonstrated at least one instance of misuse of significance testing in approximately three-fourths of the articles in the issue. No surveying of other publications in the field, no controls to other publications in other fields. Again, a sample size of 1.
Based on that, HumanityRules conflates that into Sad. There was a time when I thought you had something constructive to offer, HumanityRules. Now I find I can't take you seriously anymore as it seems you aren't even trying, preferring to serve up inflammatory distortions instead.
The Yooper
OK. Look here, where we discussed the gross generalization in a 'published work' stating that many climate scientists are computer illiterate. In this case, the gross generalization was "this paper suggest 75% of climate science papers use statistical significance in a "misleading" way".
My point was and remains: Broad generalizations like these include everyone in the affected class. That includes Watt$, Godd@rd, Mc&tyre and the like. If you want to stick with this nonsense, that requires that 75% of climate change denier posts are misleading.
Better to drop both the name-calling ('fear-mongering'? really?) and the gross generalizing. Then maybe we can have an intelligent conversation.
In the post you seemed to object to, I was referring to making claims based on statistical insignificance, as when many climate deniers misunderstood Phil Jones' remarks about warming since 1995 not being statistically significant as evidence that warming has stopped. A statistically insignificant warming trend isn't evidence either way. This is not the sort of error Dr. Ambaum is talking about. Are you aware of instances where climate scientists have made this error?
TOP,
Be sure you are not misinterpreting the author as saying climate scientists should be making weaker claims or that they are publishing "statistically significant" results that if tested the way the author thinks they should be would be insignificant. Chances are climate scientists would use Bayesian statistics to show that they can make even stronger claims of confidence. For example, because the physics of climate lead you to believe it should be warming with high probability, you can combine this prior probability with your analysis of the temperature data to give an even stronger confidence in the existence of a warming trend than you would have otherwise. If Phil Jones had followed Dr. Anbaum's advice when calculating statistical significance, he would have said something far less useful to those trying to cast doubt on warming.
But I think he was right not to do it that way, as I mentioned above. And I should qualify that by saying I haven't read the paper, only this post, so maybe I don't understand what it is Dr. Anbaum thinks they should be doing when analyzing data.
Thank you all for your reactions to my post. I hope you don't mind it if stick my oar in in some of the topics you raised. If I have overlooked something, please let me know. Sorry for the somewhat rambling response here ...
Re post 1, and the pirate-global mean temperature correlation: Alexandre is of course right to say that we need statistics and physics to make any progress. What I am highlighting, though, is not that specific issue (which is serious and important in itself). I am highlighting that significance tests are used to give certain statistical results higher "credibility" than others, based on a largely spurious test. So it is the selection of statistical results that I am objecting to, not the statistical results per se.
Some posts (specifically Steve L) refer to the frequentist vs Bayesian discussion. This is interesting in itself, but in my paper I am simply applying Bayes' equation, which also a frequentists would accept as indisputable. The difference comes in the interpretation of the meaning of these probabilities. Indeed, significance tests have a clear frequentist flavour, while hypothesis tests have a much more Bayesian flavour. I think it is hard to escape that scientific hypotheses naturally fit a Bayesian framework. Nonetheless, I think the distinction between Bayesian and frequentist interpretations is largely irrelevant to the discussion at hand.
Several posts point out that scientists should know about this and also that climate science should not be singled out. Indeed, in my paper I point to more general references which highlight the misuse of significance tests in a wide spectrum of fields (medicine, economics, sociology, psychology, biology, ...) In fact, I suspect that your average research psychologist knows more about the pitfalls of significance tests than the climate scientist. In those more "softer" fields, people have had to mainly rely on statistics from the start and therefore needed to know how to use statistics from day one. In those fields, many people have pointed this problem out (and it still seems to persist).
Climate science has always been a subfield of physics, where significance tests are largely irrelevant. I bet that most physicists (by training, I am a theoretical physicist myself) didn't get a stats course in their curriculum! However, these days more and more geographical thinking seems to enter the field of climate science with the resulting lack of rigour and physical underpinning. Many climate scientists have become geographers of their model worlds!
Also, the point I am making is not new: many people are aware of the problems with significance tests, and many people have pointed it out before (although most practitioners probably believe that climate scientists would know better). It boggles the mind that the error keeps on being propagated - surely an interesting question for a psychologist or sociologist to get their teeth into. I do have an opinion about why this may be, but that would make this post even longer.
Regarding the somewhat rambling posts about 75% of papers being misleading in part. I claim that 75% of papers (in my own paper I clearly state that this is based just 1 (one) sample and make no claim regarding its statistical significance!) make a technical misuse of significance tests: they use it to select or highlight certain statistical results in favour of others.
Perhaps I should write a post where I discuss what significance tests can be used for (largely for debunking fake hypotheses, but even this is an application with its own pitfalls). However, this is generally not how significance tests are presented in the literature. The latter of course follows from the fact that very few scientists would publish negative results (in fact, they would probably have a hard time to get it past the reviewers).
Some people, including John Cook himself, pointed me to a post by Tamino. Tamino also highlights some further points from my original paper. Let me just add two little comments to Tamino's interesting post: Tamino states that "I’ve certainly struggled to emphasize to colleagues that a highly significant statistical result does not prove that one’s hypothesis is true, it merely negates the null hypothesis." This is again the error of the transposed conditional: a low p-value does not negate the null-hypothesis, it just indicates that our statistical result would be unlikely in case the null-hypothesis were true. It is remarkable how easily we can stray into this error. Tamino also seems to indicate that the p-value does provide useful quantitative information. I cannot find any evidence in his post of this. Yes, the p-value is quantitative, but its usefulness is never really made clear. The p-value is perhaps an indication of the signal-to-noise ratio; a high p-value means that it will be difficult to see any evidence of any claimed effect. A low p-value indicates very little really: we want to study the validity of some hypothesis assuming it is false; some attempt at a reductio ad absurdum proof of your hypothesis - unfortunately it is not quite that ...
I strongly disagree that scientists should not bother with Bayesian statistics, especially in the case of statistical significance tests. There is rather more to Bayesianism than Bayes rule (which is a fundamental law of probability whether Bayesian or frequentist); the very definition of what a probability actually is, is an argument in favour of the Bayesian framework in this case. The problem with frequentist approach to statistical significance tests is that they fundamentally cannot assign a probability to the truth of a hypothesis, because a hypothesis is either true or it isn't, its truth is not a random variable and has no long run frequency (the frequentist definition of a probability). Unfortunately the probability of the alternative hypothesis being true is exactly what we want to know! Fortunately the Bayesian definition of probability is based on the state of knowledge regarding the truth of a proposition, so the Bayesian framework can directly assign a probability to the truth of a hypothesis. Generally in science it is best to carefully formulate the question you want to ask, and then choose a method that is capable of giving a direct answer to that question. As such the Bayesian approach is perfectly respectable, if not preferable. The frequentist approach can only give an indirect answer, telling you the likelihood of the observations assuming the null hypothesis is true, and leaving it up to you to decide what to conclude from that. Most of the problems with frequentist statistical tests lie in mistaking the indirect answer to the key question for a direct (Bayesian) one.
The Bayesian approach is more than a means of aggregating evidence; one of the most important benefits of the Bayesian approach is that it gives mechanism to properly incorporate the fact that you know you don't know something, by assigning a non- or minimally-informative prior on it and marginalising it out of the analysis. For instance, if you want to model the impacts of climate change, it is incorrect to assume we know the exact value of climate sensitivity (for instance by picking the maximum likelihood value), instead we should integrate it out by computing an average of the impacts for each value of climate sensitivity weighted by its plausibility according to what we do know.
"When do Bayesian statistics matter? When the prior probability is extreme (very likely or very unlikely). So if the chance of a woman your age has breast cancer is 1 in 1000, and mammograms have a 1 in 100 false positive rate, and you had one done as part of a routine checkup and it came back positive, Bayesian statistics tells us that chances are you don't have cancer."
In this case, the Bayesian result exactly coincides with that from the frequentist approach. The only difference is that the Bayesian approach allows you to formulate the question in terms of an individual patient, rather than a randomly selected member of some population with the same test results.
"Bayesian statistics that combining your result with the prior gives you 99% confidence and, presto! a statistically significant publishable result! Of course not."
Indeed not! Bayesian conclusions are only as strong as the priors used, if you could show the priors were unreasonable then you could reject the result of the test (and the paper). If you can't question the prior, you are logically forced to accept the result of the test. The good thing about the Bayesian approach is that the priors are explicitly stated. If you disagree with the use of priors on the hypothesis, you could always use a "significance test" based on Bayes factors instead, where the priors (on the hypotheses) do not appear in the analysis.
"And if someone else finds further evidence and publishes a paper showing P(M5|N), well now that Bayesian analysis you did in your paper to get P(N|M1..M4) is out of date."
That is equally true of any frequentist analysis - if your information changes, your view on the truth of the hypothesis should also change, whatever form of analysis you choose.
"But that calculation of P(M4|N) stands, and will forever be useful as a piece of the evidence used to assess P(N)."
That is only correct if M4 is independent of M1-M3 & M5 (otherwise it is the so-called Naive Bayes approach), which in the case of climate change is rather unlikely as rising levels of atmospheric carbon dioxide are posited to be a causal factor for a great many phenomena.
"And in this analogy we can't perfectly do the Bayesian calculation because we don't really know what fraction of the population has cancer"
This is incorrect, the whole point of the Bayesian formulation is that it allows your to deal rationally with the fact that you don't know something, or that you have imperfect knowledge of somthing. You choose a prior distribution that captures what you do and don't know about it and marginalise. The perfect Bayesian calculation reflects the consequences of that uncertainty.
", except for what we infer through these tests."
This is incorrect, the operational priors are estimated from epidemiological studies, not just from diagnostic tests followed by biopsies.
"But you don't subject patients to tests that tell 1 in 5 healthy people they have cancer"
Neither a competent Bayesian nor frequentist statisticians would do so.
Eric L. @19: I agree there, however given sufficient data it is similarly virtually always possible to get a statistically significant result even if the effect size is negligible, which is the flip side to the same coin. A common criticism of frequentist statistical tests is that we almost always know from prior knowledge that the null hypothesis is false from the outset. For instance with temperature trends, do we really think the trend is actually exactly zero?
Anyway the differences between the two frameworks is a fascinating topic in its own right, you need a really solid understanding of both frameworks to know which tool to use for which job.
But may I just add that significance tests are perhaps not as innocent as he makes them out to be. Indeed, they are usually only a small part of the evidence, but I have been involved in discussions where an important part of the argument was whether a certain link, as measured by linear correlation, was "significant" (in the statistical meaning). This was very much an instance of explorative data analysis, where some link was posited, with only tenuous indications this link should be there, and where significance tests were an important part of the argument. Interestingly, that claimed link has now become part of mainstream climate literature (I am referring to "annular modes" which appear to indicate a connection between Atlantic and Pacific pressure patterns) and a large number of people have by now stopped to worry whether this implied link is really present. This is a feature of significance tests in general: perhaps many people do not mean to say that a low p-value is evidence for their hypothesis, but by publishing the low p-value along with phrases such as, "this or that effect is significant at the 95% level" certainly seems to imply that that want to use these statistics as positive evidence at face value.
The two phrases we should use would be something along the lines of "we can reject the null hypothesis" or "we are unable to reject the null hypothesis" - the frequentist test doesn't really give a basis to make any statement about the alternative hypothesis (note the alternative hypothesis doesn't actually appear in the frequentist test - so perhaps that isn't surprising!).
Perhaps I misunderstand you and Tamino, but a low p-value cannot objectively be used to reject a null-hypothesis; it simply does not contain the required information to do so. I formalize this in my paper, if you like to know more.
On the other hand, a high p-value indicates that the presented evidence is easily consistent with the null hypothesis. This is not evidence that the null-hypothesis is true; the evidence could also be consistent with the alternative hypothesis. A significance test simply contains no information either way. Using Occam's razor we can then conclude that there is no evidence for our hypothesis, so we better stick with the null-hypothesis. It is Occam's razor that makes the argument here, not the significance test.
Maarten
Thank you for taking the time to shed some light on your paper. It's appreciated.
The Yooper
In short - I agree!
BTW, the p-value fallacy doesn't just appear in science, I have seen this error made in statistical methodology papers I have reviewed. It certainly isn't limited to climatology!
We are discussing n=1 papers here but I accept your criticism, I over-stated the point. We're all capable of mistakes, as Maartens work suggests.
Maarten(n=1) I don't suppose you want comment to what extent you agree or disagree with the statements contained in this link?
We all have bad days; I certainly still have my share. :)
Of course, just how bad is a matter of degree; how often, a matter of conjecture (speaking of my bad days, no-one else's). ;)
The Yooper
Perhaps I just need to see an example of Bayesian significance testing done right to understand the way you and Dr Anbaum think this should be done.
"one of the most important benefits of the Bayesian approach is that it gives mechanism to properly incorporate the fact that you know you don't know something, by assigning a non- or minimally-informative prior on it and marginalising it out of the analysis"
Does a Bayesian analysis with a minimally informative prior often lead to a different result than a frequentist approach?
I must confess that my knowledge of Bayesian statistics comes entirely from studying data mining/machine learning, so there may be a side to this I'm missing from not having studied more stats. In that class one thing we were taught is that if you don't really know the prior the most common thing to do is assume it's 50/50. Is that the sort of thing you mean by minimally informative prior?
"Bayesian conclusions are only as strong as the priors used, if you could show the priors were unreasonable then you could reject the result of the test (and the paper). If you can't question the prior, you are logically forced to accept the result of the test."
It still seems to me to be a question of what the point of the work you're doing is and what you can add to the body of knowledge. Let's assume I am an expert in dendrochronology, and I core a few trees in my backyard. Now I need to calculate a prior probability for observing warming in that data set. One way I might do that is by looking at the evidence from atmospheric physics and other areas outside my expertise and decide how likely this should be, but why would I be the one to do this when that really isn't my field and I'm likely to screw it up, I just know all there is to know about tree rings? Or are you suggesting I use a non-informative prior? Let's say I did the full analysis and found that with 99% confidence given changes in various forcings and our range of sensitivity estimates the data should show an upward trend of .15 degree/decade or more. And then I did some calculations on my little data set that any frequentist would sneer at and calculated a posterior probability of 99% for my hypothesis. Have I used my knowledge as a dendrochronologist to contribute anything to the state of our knowledge about climate? My result comes from my prior calculation, the part of my work I'm least qualified to do, meanwhile the actual data I've collected is superfluous (and I should have collected more of it, as a frequentist statistical significance test would have told me).
I do think a Bayesian analysis by someone who was an expert in such things that combined varous lines of evidence from many subfields of climate research and tried to establish probabilities for various climate related hypothesis would be an interesting work, but it's not reasonable or useful to expect every researcher to do this in the process of establishing their result, and indeed Dr. Anbaum's research shows pretty conclusively that most would not be competent to do it. If on the other hand you want most scientists to replace frequentist significance tests with Bayesian tests with non-informative priors to show they've learned at least that much about stats and know what their confidence values mean, I guess I'm okay with that, but I doubt it would change anyone's results much beyond changing the confidence values by a small amount.
I do think scientists should not put their confidence values front and center as if they are the results, better to focus on estimating the magnitudes of effects, but do some kind of confidence calculation just to keep yourself honest and make yourself less likely to publish garbage. But if you think that's the main value that comes from confidence calculations in science (and I do) rather than determining whether we should be 96% certain or 99.3%, then a frequentist approach will generally work okay and if the result is your paper leads people to believe that climate sensitivity is 3.2 when you really do have good reason to believe it is 3.2, then your paper isn't particularly misleading just because there may be a better way you could have done your confidence calculation.
In my opinion you are making too much of the frequentist vs Bayesian discussion. I think it is not that central to whether you think significance tests are useful or not. Also a frequentist would agree with the statement that the p-value does not contain enough information to calculate the probability of the truth of a hypothesis, or the null hypothesis (such statements can be perfectly well framed in frequentist terms).
Regarding the dendrochronologist, this is an example that is very interesting. Equation 6 in my paper states how to view this. It is simply Bayes equation written in terms of prior and posterior odds:
posterior odds = prior odds x p(M| not N) / p(M|N)
where I used the notation as in the post above (note the p(M|N) is the p-value). So whether your confidence in the global warming hypothesis has been increased by your tree work depends on whether the p-value is smaller than the probability to see your measurement in situations that we know there is global warming. This statement is independent of the prior odds; the actual posterior odds of course do depend on the prior odds. In other words, every single measurement increases our knowledge (changes our confidence in a hypothesis) in the same way; this is independent of whether you were a "believer" or not to start with.
This discussion is getting quite long now. I will probably write another post with some of this stuff in sometime soon where I can also comment on the suggestion by HumanityRules. I think John Cook agreed that I could send in another guest post about this subject anyway.
Best wishes to all and thank you very much for your interest in this post and for an interesting discussion,
Maarten Ambaum
The cloud ionisation work with Harrison caught my eye. The most recent article on SkepticalScience about GCRs seems to have neglected the insights from Harrison's work.
(Apologies for going OT)
It's either true or false. Of course it is entirely possible we are ignorant about its truth value; in that case one should say I do not know (a perfectly legitimate scientific stance), but it surely has a truth value, even if no one was able to determine it so far (provided of course the hypothesis makes sense in the first place).
The Bayesian method you describe could only serve as a heuristic device, but only if we had clear (quantifiable!) picture of prior probabilities regarding our own ignorance. That's almost never the case. If we knew how ignorant we were (having a reliable structural model of our own ignorance), most of the job required to overcome this ignorance would already be completed. However, when heuristics is most needed, we are at the edge of utter darkness, just feeling our way around, not even equipped to make educated guesses about Bayesian priors of our own state of mind regarding the subject matter. In cases like that almost any fractional understanding is better than fake formal methods to arrive at a reasonable conclusion regarding the way forward.
It may be different for decision makers (like politicians or business people) who rely on expert advice in certain matters, but are not equipped to actually understand and evaluate the detailed reasoning behind those expert opinions (they only digest the executive summary, anyway). They may well wonder how likely it is the experts have got it right, and in complicated cases it makes perfect sense for them to seek a quantified description of uncertainty. To ask an independent group of experts to give an estimate of prior probabilities and build a Bayesian model to evaluate reliability of expert propositions may be a way forward. However, in practice extra rounds like that are seldom better than honest expert meta-opinion, expressed in plain language.
There is a more restricted domain where statistics can (and do) come into play in natural sciences. That's measurement laden with noise.
However, in this case there is no room for theoretical ambiguity. We should know pretty much everything how the signal we are looking for is supposed to look like along with the statistical properties of noise behind which it is hiding. This knowledge should take the form of a bunch of true propositions about the phenomenon under scrutiny, neither of which has a dubious truth value expressible in a probabilistic form.
If this knowledge is given, we should be able to build an adequate statistical model which enables us to recover the signal from noise as much as possible.
Of course the first thing to do is not to rely on statistical speculations, but to improve the signal to noise ratio of measurement whenever it is practicable. Unfortunately in climate studies most of the noise is not from the measurement procedure itself, but it is weather noise, that is, an inherent property of the system itself. There is no way to get rid of it during the measurement phase.
Weather is an open thermodynamic system, and as such it works on the edge of chaos, in other words it is always in critical state (by way of SOC - Self Organized Criticality). Systems like this are characterized by system variables with pink noise characteristics (the noise has random phase and the same power in each octave).
Pink noise is scale invariant with no lower cutoff frequency, therefore system variables like this do not make a natural distinction between weather and climate, no matter how long is the averaging window used (how low the upper cutoff). Pink noise is never stationary, it has an arbitrarily long autocorrelation scale.
This is why it is a bit tricky to look for trend (as signal) in a climate variable laden with weather noise. A simple model of a linear trend plus some stationary noise would surely not do (even if mainstream climate science is almost always guilty of using such simplistic models).
Pink noise can have spontaneous excursions on all scales, including extremely low frequency ones (well in the supposed climate range of 30+ years).
You say "A standard answer [to the question if temperatures are rising or not] is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do."
Yes, but it is not wrong just because the result of an otherwise correctly applied significance test is misused, but in most cases people also apply the wrong significance test (that fails to take into account the very long autocorrelation timescale).
The above statements on weather (or climate) noise, critical state, self-organized criticality, pink noise, etc. are simply true statements with no further qualification whatsoever. It is not likely they are true, not even 100% sure, they are simply adequate descriptions of certain aspects of the behavior of open thermodynamic systems with many degrees of freedom.
Still, they are entirely missing from IPCC reports, prepared by experts for decision makers. Phrases like "pink noise" (or "1/f noise") are not even mentioned under http://ipcc.ch. Funny.
"Also a frequentist would agree with the statement that the p-value does not contain enough information to calculate the probability of the truth of a hypothesis, or the null hypothesis (such statements can be perfectly well framed in frequentist terms)."
The first part is certainly true, however the second is not; the frequentist framework does not allow probabilistic statements to be made concerning particular hypotheses. Frequentist statistics can assign probabilities to the ocurrence of errors in repeated application of statistical tests, but that is not the same thing (I checked this with my vastly experienced frequentist colleague and he concurs).
If it were true, frequentists could construct a credible interval, rather than a confidence interval by considering the hypothesis that the true value of a statistic lying within a particular interval. But as far as I know, frequentists cannot construct a credible interval - however I'd be very interested to hear otherwise.
(1) You state that weather is in a state of Self Organized Criticality - SOC. I have been unable to find any references that indicate this; do you have a paper to link to on this subject? A statistical analysis of unforced noise in the climate? While water vapor, ice, and condensation are critical point transitions, weather doesn't seem to display the same behavior as a whole.
In particular, a pink noise 1/f relationship would indicate the largest variations on low frequencies, where what we observe (glacial cycles, for example) is a fairly direct tracking of climate variables (temperature, ice cover, etc.) to historic forcings.
(2) The universe is what it is - that's the final arbitrator of our theories. However, our knowledge is imperfect, and our hypotheses are probablistic, as per the first definition of probability. We can only state that a particular hypothesis is more probable than others given the evidence, the statistics of our data. And whether using Bayesian or frequentist methods, we can estimate from the statistics the probability (second definition) that our hypotheis is supported by that data. That's how induction works, and how we can learn something new.
We can be pretty sure, but we can only work with the evidence we have - we don't have perfect knowledge of anything.
At a certain point we become certain enough to label a particular hypothesis a fact. Gravity, evolution, and it appears climate change falls into that category as well. But even the strongest "fact" is supported by our inductive conclusion that the laws of physics are consistent over space and time, and won't change on us - incredibly well supported, but the rules could change tomorrow. Crystalline proofs of the type you describe would be nice, but they don't exist.