The difference is significant

The difference is significant

If you have had reports from us before, you may have seen tables which bear some coded markings, like this:

* p < 0.05, ** p < 0.01, *** p < 0.001 (Chi-squared)

Asterisks show where the differences seen between figures are statistically significant, that is, unlikely to be due to chance.  Just how unlikely is shown by the number of asterisks, where *** says that fewer than one in a thousand similar studies would yield a result as different as this by chance alone.

Schools and colleagues in local authorities are a bit more used to seeing these runes than when I first started in research, but just in case that was a bit fast:

Imagine you're watching a friend toss a coin.  The first time it turns up is heads.  The second is also heads.  And the third.  And the fourth...  fifth... sixth... seventh!? So just when do you say, Hold on a minute, let me look at that, that's not a normal coin!

Well, you know that getting a sequence of heads from a normal coin is not impossible, but there comes a point where a sequence seems too unlikely for you to carry on assuming it's normal coin behaviour.  Significance testing is the approach which calculates if a given result (six heads in a row) is too different and therefore too unlikely (less than one chance in twenty) than the result we would expect to see (50:50) from a normal two-sided coin behaving randomly (our null hypothesis is that nothing unusual is going on). 

A variety of tests may have been used in our reports, including chi-squared (X2), Student's t and analysis of variance (ANOVA). Chi-squared is a common test which is well-suited to the cross-tabular results that our questionnaires produce.

We have been changing the way we do these tests in some cases.  I was pulled up recently at a meeting for placing in a document only the cryptic:

"Bold type indicates a significant difference."

Hang on, how significant, and using which test?

Well, when comparing two groups from a study (say, boys with girls) there are usually hundreds of items to compare from a long questionnaire, so that if we take the conventional cut-off of probability (alpha) of 0.05, we would expect several items to show differences large enough to reach this criterion just by chance. You may get a clue about 'false positives' because they may be inconsistent with other findings, but a statistical approach can be used to reduce how often we see them.

Until recently, it was usual to seek to control the Familywise Error Rate (FWER) – that is, the probability that we reject a true assumption that results are not different (the so-called null hypothesis). However, this approach is generally too conservative, so that we will accept many false null hypotheses – that is, assume no differences when they actually exist (in the jargon, this approach has low power).

The proportion of 'false positives' is called the false discovery rate (FDR). Benjamini and Hochberg (1995) have argued that this is the appropriate statistic to control when doing multiple tests and they also suggested a procedure for so doing.

Benjamini and Yekutieli (2001) went on to show that the suggested procedure was still appropriate under conditions where the variables being tested may be correlated (as is often the case in surveys). Moreover, this procedure can be implemented using readily available spreadsheets or other software (Thissen, 2002; Watson, 2009). So, for the comparisons in this report, we have adopted this procedure to control the FDR at 0.05, the level advised by Benjamini and Gavrilov (2009).

So, in order to adopt this procedure, I perform my hundreds of tests, comparing the results from boys and girls, but I don't simply use p

The alpha will be consistent for any one subgroup of the sample but this will be different to the alpha used for a different analysis - say, if we are also showing differences between north and south as well as boys and girls on the same table.  This also means that results flagged as statistically significant throughout the report will not have a consistent alpha, and is why I made only a cryptic reference to differences being "significant".


Benjamini Y & Gavrilov Y (2009). A simple forward selection procedure based on false discovery rate control. Ann. Appl. Stat. 3(1): 179-198.

Benjamini Y & Hochberg Y (1995). Controlling the false discovery rate - a practical and powerful approach to multiple testing. J Roy Stat Soc B Met, 57(1): 289-300. [Also available at]

Benjamini Y & Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4): 1165-1188. [Also available at]

Thissen D (2002). Quick and Easy Implementation of the Benjamini-Hochberg Procedure for Controlling the False Positive Rate in Multiple Comparisons. Journal of Educational and Behavioral Statistics, 27(1): 77-83.

Watson P (2009). Adjusted p-values in SPSS and R (CBU Statistics Pages). Cambridge: MRC Cognition and Brain Sciences Unit, last accessed 23rd November 2010.

SHEU gratefully acknowledges the use of the spreadsheet hierps.xls made available by Ian Nimmo Smith and Peter Watson of the MRC Cognition and Brain Sciences Unit at, version last edited on 23rd June 2009.