# Statistics

#### General information

Statistical evaluation is a vital part of many communications. These guidelines have been written for the benefit of sound scientific work and to help authors prepare their manuscripts in accordance with good statistical standards. The guidelines are applicable to retrospective clinical studies as well as to experimental studies, randomized clinical trials and epidemiological studies. However, all aspects are not equally important for all types of studies. For instance, randomized clinical trials typically include a given number of patients based on calculations of statistical power. In exploratory experimental studies, the number of units studied may be based on other considerations, but may still be justified.

The following general principles should also be followed: The investigator should ensure that his data are of high quality. All data should also be stored and retrievable at request. The use of a statistical method presupposes appropriate knowledge and understanding. Presentation of statistical results should focus on their clinical, not statistical, importance.

#### Introduction

State clearly the aim of the study and the primary hypothesis.

#### Patients and methods

State the number of subjects studied and why this number was chosen. Describe the sources of subjects, how the subjects were selected and the inclusion or exclusion criteria that were employed. Present information on subjects who declined to participate, withdrawals and subjects with incomplete follow-up. Describe in detail how measurements were made and techniques used. All statistical methods should be mentioned and, when necessary, (for unusual methods) referenced; for every statistical result, the method used should be clearly described.

All tests should be two-sided, unless the use of one-sided tests is specifically justified. No data should be removed, imputed, weighted, adjusted or trimmed unless this action is specifically described and justified and its consequences are presented. Use non-parametric techniques when data have been measured on an ordinal scale or on an interval scale or non-normality is suspected and normality cannot be induced by transformation. In addition, for small unbalanced data sets with many ties or a poor distribution, exact methods may be needed to produce reliable result Matched data should be analyzed using conditional techniques, e.g. paired t-test, Wilcoxon's signed ranks test, McNemar's test or conditional logistic regression.

When measurements are repeated on the same subject, they should not be treated as independent observations; use repeated measures ANOVA or multilevel models. A possible alternative would be to summarize all values from each subject into an individual estimate of a clinically relevant entity, e.g. the magnitude of a peak value, area under curve, doubling time, etc., and then use these estimates as input in an analysis with one observation per subject. When multiple hypothesis testing is performed in a study with the aim of confirming a pre-specified hypothesis, care should be taken to avoid spurious significance by using techniques for simultaneous inference.

#### Randomized trials

Reports of randomized controlled trials should comply with the CONSORT statement (see www.consort-statement.org).

#### Think twice about using stepwise regression analysis

**Background**

Some statistical software packages include programmes for stepwise multiple regression analysis. In short, this is a technique for building statistical models automatically, by selecting variables from a pre-defined set of candidate variables using a test related criterion, e.g. F- or p-value. Two main selection procedures exist: forward and backward. The former alternative selects explanatory variables by consecutive inclusion, the latter by consecutive exclusion. The two procedures often produce different outcomes.

**Statistical tests and parameter estimates**

It is often argued that statistical testing is an important part of a scientific manuscript because p-values represent an objective method for assessing scientifically important differences in data. This is a nice idea, but it is false.

Statistical tests cannot be used to "assess" important differences. Statistical significance is used for checking if an observed difference or effect can be explained by chance alone. When this is the case, the observation should of course be interpreted with caution. However, a statistically insignificant hypothesis never indicates that a difference or effect "does not exist", because absence of evidence is not evidence of absence.

Furthermore, scientific importance is related to two different issues, which should not be confused: clinical and statistical significance. For example, whether a body temperature difference of 0.5 degree Celsius is clinically significant or not depends on biology. The difference may be significant when predicting ovulation but insignificant when predicting recovery after hip fracture. In contrast, statistical significance depends entirely on statistical issues; the 0.5 degree difference in body temperature may be statistically significant in a study of recovery after hip fracture with 400 subjects but not in one with 12.

In addition, statistical test results are not objective. The outcome of statistical tests depends on the characteristics of the sample. A true difference in revision risk between two sorts of prostheses may show up in one sample, but not in another one, because a risk difference can easily be confounded by association with other factors affecting revision risk.

It is therefore, at least in observational studies, always necessary to take possible confounding factors into consideration. Sex and age are two common confounders. If not adjusted for, differences in the distributions of sex and age can bias a risk estimate and produce an arti-factual risk where none exists or mask an existing one.

**Building statistical models**

Adjustment for confounding can be performed using statistical (regression) models. Programmes for fitting statistical models are generally available in commercial software packages.

The testing and parameter estimation performed using a statistical model clearly depends on the variables included in the model. It is therefore crucial for confounding adjustment that known clinically significant variables are included in the regression model. The statistical significance of an adjustment variable is, however, irrelevant. A clinically significant variable may well be an important confounder also when it is statistically insignificant.

**Stepwise regression analysis**

The common practice (1) to screen a dataset using simple hypothesis tests and include statistically significant variables in a multiple statistical model to find out if they are "really significant" is therefore inappropriate. This technique should be used neither for confounding adjustment, nor for prediction purposes.

Stepwise regression analysis also uses p-value related criteria for building a statistical model. This is thus also an inappropriate method (2-3). The technique can perhaps be used for generating hypotheses about completely unknown phenomena, but a sound strategy for selecting variables in clinical and epidemiological studies where some knowledge do exist, is to use clinical judgment.

In addition, stepwise regression have for many years been criticised by statisticians (4-6) for overestimating precision and producing biased regression coefficients. It seems, however, that little of this criticism has reached the medical society. Inappropriate use of stepwise regression analysis appears to be increasingly common in medical publications (7-8).

Other scientific journals like Annals of Internal Medicine have also recently included statistical guidelines to
"avoid stepwise methods of model building" in their Information for Authors

(http://www.annals.org/shared/author_info.html#multivariable-analysis, July 9, 2007).

In conclusion, there are good reasons for thinking twice about using this method in medical research. Our recommendation is always to avoid stepwise regression.

**References**

1. Sun SW et al. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol. 1996;49:907-16.

2. Harrell FE, Jr, et al. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist Med. 1996;15:361-87.

3. Mickey RM, Greenland S. The impact of confounder selection criteria on effect estimation. Am J Epidemiol. 1989;129:125-37.

4. Mantel N. Why stepdown procedures in variable selection. Technometrics 1970;12:621-625.

5. Copas JB. Regression, prediction and shrinkage (with discussion) JRSS 1983;B45:311-354.

6. Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992;45:265-282.

7. Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol 2006;75:1182-1189.

8. Malek MH, Berger DE, Coburn JW. On the inappropriateness of stepwise regression analysis for model building and testing. Eur J Appl Physiol 2007 May 23; Epub ahead of print.

#### Results

When summarizing the data, always include measures of variability and the number of subjects. When presenting medians, describe also the range within parentheses, e.g., median age was 60 (35–70) years and when presenting means use standard deviation, e.g., mean age was 59 (SD 15) years. Present the frequencies for nominal data. Results from matched data should be presented in relevant form, e.g., the distribution of pairwise differences.

Hypothesis tests (p-values) should be used in combination with a defined effect size and when statistical power has been considered. Present p-values with real numbers if these are greater than 0.001 (1 digit except zeros), otherwise use 'p < 0.001'. Do not use 'ns', 'p > 0.05' or asterisks. Use 95% confidence intervals in exploratory analyses and when estimating effects or differences.

#### Discussion

The discussion section should, when it is relevant, contain a critical discussion about the results. Questions like the quality of the data (selection and information bias) and the adequacy of the statistical analysis (confounding bias) should then be addressed.

webmaster at actaorthop org