Health Education Research, Vol. 14, No. 6, 713-715,
December 1999
© 1999 Oxford University Press
Editorial |
Getting past the statistical referee: moving away from P-values and towards interval estimation
Sheffield Children's Hospital, University of Sheffield, Sheffield S10 2TH, UK
The earth is round (P < 0.05) (Cohen, 1994
)
The increased involvement of statisticians in the peer review process has lead to a greater communication between statisticians and journal editors (Marks et al., 1988
). Despite this, the quality of statistical reporting in many scientific journals remains below par. Statistical errors are commonplace and the problem appears to be long-lasting [see, e.g. (Schor and Karten, 1966
; Feinstein, 1974
; Gore et al., 1977
; White, 1979; Glantz, 1980
; Altman, 1982
; Bland and Altman, 1987; Pocock et al., 1987
)]. The main problem for the researcher would appear to be a lack of understanding of even the most basic of statistics (Mathews and McPherson, 1987
). Part of the blame has been apportioned to poor statistical teaching (Rigby, 1998
); others blame lack of interest by the researcher (Mathews and McPherson, 1987
). Whatever the reasons, inappropriate use of statistics can lead to rejection of manuscripts. According to Sorenson et al. (Sorenson et al., 1998
) two-thirds of peer reviewed manuscripts submitted to Health Education Research are rejected, usually after one review. A common reason for rejection is flaws in the data analysis.
What potential authors need to know is how to overcome rejection at the first stage. As a statistical referee for many scientific journals, one of my pet hates lies with authors who present their results with a plethora of P-values (e.g. P < 0.05, P < 0.01, P < 0.001, P < 0.0001). What I prefer to see, in any scientific article that merits a statistical analysis, is a statement about effect size rather than an overemphasis on P-values.
But what is it about P-values and their presentation that I find so distracting. Two decades ago the clinical epidemiologist Alvan Feinstein (Feinstein, 1977
) commented `...scientific reputations are made or lost on the basis of the magisterial phrase and number: statistical significance at P
0.05'. To understand Feinstein's concern it is necessary first to define a P-value. Altman (Altman , 1991
) defines a P-value as `...the probability of having observed our data when the null hypothesis is true'. This is a useful starting point in trying to understand more about P-values. I think of a P-value as an error probability (Table I
). The information presented in Table I
represents a cross-classification of the outcome of a statistical test with the `truth'. This results in a 2x2 contingency table with four possible outcomes in terms of statistical errors. If the test statistic concludes that there is a difference between groups of observations when a difference does not exist, this results in a statistical error. This type or error is called a type I error. The type I error is also known as an
error or P-value or statistical significance. Type I errors are usually set at a threshold of 5%. However, in an increasingly evidence-based world the 5% threshold for statistical significance has no evidence-base itself; it is an entirely arbitrary threshold. The great and the good in statistics have tried to find out when, where and why `P < 0.05 as statistically significant' came into being. Consensus is that it just emerged and the mud has stuck, so to speak. This conclusion is reinforced by Feinstein (Feinstein, 1975
) who commented `The role of this number [0.05] has become so widely accepted and worshipped that one might expect to find a record of time and place when the apotheosis occurred. No such record exists'. P = 0.05 means that it is accepted that a statistically significant difference will occur 5% of the time if the null hypothesis is true (Table I
).
|
Consider the statement P = 0.000001 and think of this as an error probability (Table I
What about non-significant P-values? Perhaps surprisingly they can be as difficult to interpret as significant P-values. Two explanations can account for non-significant results. First, there really could be no difference between the treatment groups (bottom left-hand cell of Table I
). Second, a type II error could have been committed (bottom right-hand cell of Table I
). To minimize type II errors, the power of the study should be increased. Power is also defined in Table I
(top right-hand cell). Thus, power is defined as the chance (or probability) of showing a difference between treatment groups, if a difference actually exists. However, the research evidence suggests that many studies are underpowered. For example, in a much publicized study, Freiman et al. (Freiman et al., 1978
) found that only 30% of trials published in the New England Journal of Medicine were sufficiently large to have a 90% chance (i.e. power) of detecting even large differences in effectiveness of treatments being compared. To coin a phrase `Absence of evidence is not evidence of absence' (Altman and Bland, 1995
).
Both significant and non-significant P-values require explanation which is not always as easy as it may seem. So, in line with others of my profession, I do not recommend reporting of results using P-values alone. So, if not by P-values alone then by what other method? I advocate presenting results of statistical analyses using confidence intervals. A confidence interval is a measure of the precision of the sample statistic (e.g. mean, odds ratio, regression coefficient, correlation coefficient). The narrower the confidence interval, the more precise the sample statistic and vice versa. Although confidence intervals give more information about the findings of a study, confidence intervals and P-values are related (Gardner and Altman, 1986
; Pocock et al., 1987
). In a statistical comparison of two means (for example) a treatment difference that is significant at the 5% level has a 95% confidence interval that does not include a zero difference.
The advocacy of confidence interval estimation for presentation of results in scientific journals is not new [see, e.g. (Yates, 1951
; Savage, 1957
; Rozeboom, 1960
; Gardner and Altman, 1986
; Simon, 1986
; Bulpitt, 1987
) and others]. One of the first to make such recommendation was Frank Yates (Yates, 1951
). However, Yates' ideas did not `catch fire' until Gardner and Altman's seminal paper which was published in the British Medical Journal (Gardner and Altman, 1986
). The widespread readership of the British Medical Journal has been cited as one reason as to why it has been Gardner and Altman rather than others who have taken the credit for the better reporting of statistics today (Rigby, 1998
). Whomsoever takes (or rather should take) the credit, statistical reporting in scientific journals is improving, albeit slowly. Although many journals still do not have the services of a statistical referee (statistics is after all an uncommon profession), journal editors today are becoming more aware of the issues (Finney and Harper, 1993
). If journal editors do not encourage the use of confidence interval estimation the `P-value culture' identified by John Nelder (Nelder, 1986
, 1999
) will remain.
Can budding authors to Health Education Research learn anything about getting their work published? Articles which require statistical presentation should quote effect size with 95% confidence intervals; P-values should be used sparingly. With this in mind potential authors should not fail at the first hurdle; getting past the statistical referee.
References
Altman, D. G. (1982) Statistics in medical journals. Statistics in Medicine, 1, 5971.[Medline]
Altman, D. G. (1991) Practical Statistics for Medical Research. Chapman & Hall, London.
Altman, D. G. and Bland, J. M. (1995) Statistics notes: absence of evidence is not evidence of absence. British Medical Journal, 311, 485.
Bland, J. M. and Altman, D. G. (1986) Caveat Doctor: a grim tail of medical statistics textbooks. British Medical Journal, 279, 979.
Bulpitt, C. J. (1987) Confidence intervals. Lancet, i, 494497.
Cohen, J. (1994) The earth is round (P < .05). American Psychologist, 49, 9971003.
Feinstein, A. R. (1974) A survey of statistical procedures in general medical journals. Clinical Pharmacology and Therapeutics, 15, 97107.[Web of Science][Medline]
Feinstein, A. R. (1975) Biological dependency, `hypothesis testing', unilateral probabilities, and other issues in scientific direction vs. statistical duplicity. Clinical Pharmacology and Therapeutics, 17, 499513.[Web of Science]
Feinstein, A. R. (1977) Clinical Biostatistics. Mosby, St Louis, MO.
Finney, D. J. and Harper J. L. (1993) Editorial code for presentation of statistical analyses. Proceedings of the Royal Society of London Series B, 254, 287288.
Freiman, J. A., Chalmers, T. C., Smith, H. and Keubler, R. R. (1978) The importance of beta, type II error and sample size in the design and interpretation of the randomized controlled trial. New England Journal of Medicine, 299, 690694.[Abstract]
Gardner, M. J. and Altman, D. G. (1986) Confidence intervals rather than p-values: estimation rather than hypothesis testing. British Medical Journal, 283, 600602.
Glantz, S. A. (1980) Biostatistics: how to detect, correct and prevent errors in the medical literature. Circulation, 1, 17.
Gore, S. M., Jones, I. G. and Rytter, E. C. (1977) Misuse of statistical methods: critical assessment of articles in the BMJ from January to March 1976. British Medical Journal, 1, 8587.
Marks, R. G., Dawson-Saunders, E. K., Bailar, J. C., Dan, B. D. and Verran, J. A. (1988) Interactions between statisticians and biomedical journal editors. Statistics in Medicine, 7, 10031011.[Web of Science][Medline]
Mathews, D. R. and McPherson K. (1987) Doctors' ignorance of statistics. British Medical Journal, 294, 856857.
Nelder, J. A. (1986) Statistics, science and technology. Journal of the Royal Statistical Society Series A, 149, 109121.
Nelder, J. A. (1999) From statistics to statistical science (with comment). The Statistician, 48, 257269.
Pocock, S. J., Hughes, M. D. and Lee, R. J. (1987) Statistical problems in the reporting of clinical trials. New England Journal of Medicine, 317, 426432.[Abstract]
Rigby, A. S. (1998) Statistical methods in epidemiology. I. Statistical errors in hypothesis testing. Disability and Rehabilitation, 20, 121126.[Web of Science][Medline]
Rozeboom, W. W. (1960) The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416428.[Web of Science][Medline]
Savage, I. R. (1957) Nonparametric statistics. Journal of the American Statistical Association, 52, 331344.
Schor, S. and Karten, I. (1966) Statistical evaluation of medical journal manuscripts. Journal of the American Medical Association, 195, 145150.
Simon, R. (1986) Confidence intervals for reporting of results of clinical trials. Annals of Internal Medicine, 105, 429435.
Sorenson, J. R., Steckler, A. and Bernhardt, J. (1998) Eighteen months and 100 manuscripts later (Editorial). Health Education Research, 13, iii.
White, S. J. (1977) Statistical errors in papers submitted to the British Journal of Psychiatry. British Journal of Psychiatry, 135, 336342.
Yates, F. (1951) The influence of statistical methods for research workers on the development of the science of statistics. Journal of the American Statistical Association, 46, 1934.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
G.-J. de Bruijn, S. P. J. Kremers, W. van Mechelen, and J. Brug Is personality related to fruit and vegetable intake and physical activity in adolescents? Health Educ. Res., December 1, 2005; 20(6): 635 - 644. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
