UW Home | Contact Us

John Copas - Short Course

Co-sponsoring units: Statistics, Center for American Politics and Public Policy, Comparative Law and Society Studies (CLASS) Center, Biostatistics.

Please fill out our seminar evaluation form.


Professor John Copas will visit the Center for Statistics and the Social Sciences in July 2001. During his visit he will present a series of five seminars on aspects of selection bias. The seminars will look at how selection biases can occur in many areas of statistics, to present recent work giving a theoretical framework for exploring the extent of such biases, and to discuss sensitivity analyses in a number of practical applications, mainly in the social and medical sciences. The series will focus on three areas: missing data, selection bias in observational comparisons, and publication bias in meta analysis. The typical effect of selection bias is to reduce estimated treatment effects, increase levels of uncertainty and, in several examples, to call into serious question the validity of standard methods of analysis.

  • Seminar 1: Overview of Selection Bias - Thursday 5 July 1.30 pm
  • Seminar 2: Publication Bias - Friday 6 July 1.30 pm
  • Seminar 3: Missing Data - Theoretical Aspects - Tuesday 10 July 1.30 pm
  • Seminar 4: Observational Comparisons - Thursday 12 July 1.30 pm
  • Seminar 5: Assessing Uncertainty - Monday 16 July 12pm

Although ideas will move in sequence from one seminar to the next, each seminar will briefly summarizing previous work when necessary, and thus should be fairly self-contained. Since these talks are seminars, registration will not be necessary.

Seminar 1: Problems of Selection Bias, Overview and Examples

Thursday 5 July 1.30 pm, Balmer 313

Selection bias occurs in many different guises, and is one of the most common and most difficult problems in observational data analysis. To fix ideas we will look at an observational study comparing the side effects of two medical treatments. The published analysis shows that the side effect of the new treatment reduced over time, and by the end of the study was no worse than that of the standard treatment. However, the trend over time can equally well be explained by the way in which patients are allocated to the treatments, leading to a very different conclusion. The problem is the lack of randomization. Standard statistical methods, as Fisher demonstrated, are ''justified'' by the conventional assumption of randomization, but can be very sensitive to violations of this assumption.

The general set-up involves a response variable y and a subsidiary variable z, which could be the response indicator for missing data, the group identifier for comparative studies, or the event of publication in meta analysis. Standard methods (eg imputation for missing data) assume that y and z are independent after appropriate conditioning. Key questions are: can we develop a useful family of models which introduce dependence? How do we monitor the sensitivity of inference to different members of this family? Do the data allow us to discriminate between these models? Can we interpret a measure of dependence in terms of potentially observable quantities?

Seminar 2: Passive Smoking and Lung Cancer, A Case of Publication Bias?

Friday 6 July 1.30 pm, Balmer 313

Many studies have suggested that there is an association between passive smoking (exposure to other people's tobacco smoke) and the risk of lung cancer, but have made very different estimates of the size of this risk. Recent meta analyses of these studies have concluded that the relative risk is about 1.25. However, this figure is very sensitive to publication bias (that studies accepted for publication may be biased in favor of those with more ''significant'' results).

A model for selection bias envisages a population of studies (outcome y) and a sampling mechanism to describe the selection (z) of studies for review. The idea is to compare estimates of the population average with goodness-of-fit to the funnel plot (the plot of study effect against study precision). In the passive smoking/lung cancer meta analysis it seems that only a small number of ''unpublished studies'' are needed to cast doubt on the significance of the result.

Seminar 3: Attitude and Sex, A Case of Missing Data Bias?

Tuesday 10 July 1.30 pm, Balmer 313

Sample surveys have been used to compare the social attitudes (y) of men and women. But whether a person agrees or refuses to answer the question (z=1 or 0) is itself a statement of attitude. What should we do about non-response in such surveys?

We look at models for the joint distribution of y and z --- f is the model if the data are ''missing at random'' and g is the ''true'' model. We can explore how bias in the mean of y depends on the ''distance'' between f and g. Plotting contours of \hat{\mu} for given values of interpretable functions of this ''distance'' can be a useful sensitivity analysis, the conventional confidence ellipsoid giving a benchmark for comparison.

Seminar 4: Prison Works, A Case of Judicial Selection Bias?

Thursday 12 July 1.30 pm, Balmer 313

A recent study has claimed that offenders sent to prison are less likely to re-offend than those given non-custodial sentences. Hence the slogan ''Prison Works''. But differences between the subjects who are given the different sentences can only be partially explained by observed covariates. Sentences are decided by judges, not by randomization. How do we allow for hidden confounders?

The model in Seminar 3 is extended to a counter-factual model in which z labels the categories being compared. A randomized experiment corresponds to y and z being conditionally independent given the values of covariates. A sensitivity analysis suggests a much more cautious interpretation of the prison study.

Seminar 5: Double the Variance to be on the Safe Side?

Monday 16 July 12 pm, Balmer 313

It is impossible to ''correct'' for selection bias without making untestable assumptions. Sensitivity analysis helps us understand and think about the problem, but doesn't solve it. In practice we usually end up making strong assumptions like ''missing at random'' for reasons of pragmatism rather than conviction. Is it possible to use such an assumption as a tentative model without assuming it is actually ''true''?

For missing data, we could imagine that the complete data are eventually obtained so that we can later test whether the ''missing at random'' assumption (or any other assumption about selection bias) is reasonable. A weaker interpretation of the assumption is that, {\em if} complete data were to be obtained {\em then} we would pass this test. Using the model from Seminar 3, and conditioning on an appropriate test statistic, suggests that the extra uncertainty is captured by increasing the variance of all estimators by a factor which is bounded above by two. A similar conclusion also holds for the ''hidden confounders'' setting of Seminar 4.


Copas, J. B. and Li, H. G. (1997)
Inference for non-random samples (with discussion).
J. Roy. Statist. Soc. B, 59, 55-96.
Copas, J. B. and Shi, J. Q. (2000)
Meta analysis, funnel plots and sensitivity analysis.
Biostatistics, 1, 247-262.
Copas, J. B. and Shi, J. Q. (2000)
Reanalysis of epidemiological evidence on lung cancer and passive smoking.
British Medical Journal, 7232, 417-418.
Copas, J. B. and Eguchi, S. (2001)
Local sensitivity approximations for selectivity bias.
(under revision for JRSSB)
Copas, J. B. and Lu, G. (2001)
Double the variance for missing data.


John Copas is currently Professor of Statistics at the University of Warwick, UK. Former editor of Applied Statistics, Vice-President of the Royal Statistical Society, Statistical Consultant to the UK Home Office, and the author of many papers in mathematical and applied statistics. His CV includes a number of discussion papers read to the Royal Statistical Society in London. His current interests include methodology for selection bias, meta analysis, risk assessment, semi-parametric inference, and models for discriminant functions.