Co-sponsoring units: Statistics, Center for American Politics and Public Policy, Comparative Law and Society Studies (CLASS) Center, Biostatistics.

Please fill out our seminar evaluation form.

Professor John Copas will visit the Center for Statistics and the Social Sciences in July 2001. During his visit he will present a series of five seminars on aspects of selection bias. The seminars will look at how selection biases can occur in many areas of statistics, to present recent work giving a theoretical framework for exploring the extent of such biases, and to discuss sensitivity analyses in a number of practical applications, mainly in the social and medical sciences. The series will focus on three areas: missing data, selection bias in observational comparisons, and publication bias in meta analysis. The typical effect of selection bias is to reduce estimated treatment effects, increase levels of uncertainty and, in several examples, to call into serious question the validity of standard methods of analysis.

- Seminar 1: Overview of Selection Bias - Thursday 5 July 1.30 pm
- Seminar 2: Publication Bias - Friday 6 July 1.30 pm
- Seminar 3: Missing Data - Theoretical Aspects - Tuesday 10 July 1.30 pm
- Seminar 4: Observational Comparisons - Thursday 12 July 1.30 pm
- Seminar 5: Assessing Uncertainty - Monday 16 July 12pm

Although ideas will move in sequence from one seminar to the next, each seminar will briefly summarizing previous work when necessary, and thus should be fairly self-contained. Since these talks are seminars, registration will not be necessary.

Thursday 5 July 1.30 pm, Balmer 313

Selection bias occurs in many different guises, and is one of the most common and most difficult problems in observational data analysis. To fix ideas we will look at an observational study comparing the side effects of two medical treatments. The published analysis shows that the side effect of the new treatment reduced over time, and by the end of the study was no worse than that of the standard treatment. However, the trend over time can equally well be explained by the way in which patients are allocated to the treatments, leading to a very different conclusion. The problem is the lack of randomization. Standard statistical methods, as Fisher demonstrated, are ''justified'' by the conventional assumption of randomization, but can be very sensitive to violations of this assumption.

The general set-up involves a response variable *y* and a subsidiary
variable *z*, which could be the response indicator for missing data,
the group identifier for comparative studies, or the event of
publication in meta analysis. Standard methods (eg imputation for
missing data) assume that *y* and *z* are independent after
appropriate conditioning. Key questions are: can we develop a useful
family of models which introduce dependence? How do we monitor the
sensitivity of inference to different members of this family? Do the
data allow us to discriminate between these models? Can we interpret a
measure of dependence in terms of potentially observable quantities?

Friday 6 July 1.30 pm, Balmer 313

Many studies have suggested that there is an association between passive smoking (exposure to other people's tobacco smoke) and the risk of lung cancer, but have made very different estimates of the size of this risk. Recent meta analyses of these studies have concluded that the relative risk is about 1.25. However, this figure is very sensitive to publication bias (that studies accepted for publication may be biased in favor of those with more ''significant'' results).

A model for selection bias envisages a population of studies (outcome
*y*) and a sampling mechanism to describe the selection (*z*) of
studies for review. The idea is to compare estimates of the population
average with goodness-of-fit to the funnel plot (the plot
of study effect against study precision). In the passive smoking/lung
cancer meta analysis it seems that only a small number of
''unpublished studies'' are needed to cast doubt on the significance
of the result.

Tuesday 10 July 1.30 pm, Balmer 313

Sample surveys have been used to compare the social attitudes (*y*) of
men and women. But whether a person agrees or refuses to answer the
question (*z=1* or *0*) is itself a statement of attitude. What should
we do about non-response in such surveys?

We look at models for the joint distribution of *y* and *z* --- *f* is
the model if the data are ''missing at random'' and *g* is the
''true'' model. We can explore how bias in the mean of *y* depends on
the ''distance'' between *f* and *g*. Plotting contours of *\hat{\mu}*
for given values of interpretable functions of this ''distance'' can
be a useful sensitivity analysis, the conventional confidence
ellipsoid giving a benchmark for comparison.

Thursday 12 July 1.30 pm, Balmer 313

A recent study has claimed that offenders sent to prison are less likely to re-offend than those given non-custodial sentences. Hence the slogan ''Prison Works''. But differences between the subjects who are given the different sentences can only be partially explained by observed covariates. Sentences are decided by judges, not by randomization. How do we allow for hidden confounders?

The model in Seminar 3 is extended to a counter-factual model in which
*z* labels the categories being compared. A randomized experiment
corresponds to *y* and *z* being conditionally independent given the
values of covariates. A sensitivity analysis suggests a much more
cautious interpretation of the prison study.

Monday 16 July 12 pm, Balmer 313

It is impossible to ''correct'' for selection bias without making untestable assumptions. Sensitivity analysis helps us understand and think about the problem, but doesn't solve it. In practice we usually end up making strong assumptions like ''missing at random'' for reasons of pragmatism rather than conviction. Is it possible to use such an assumption as a tentative model without assuming it is actually ''true''?

For missing data, we could imagine that the complete data are eventually obtained so that we can later test whether the ''missing at random'' assumption (or any other assumption about selection bias) is reasonable. A weaker interpretation of the assumption is that, {\em if} complete data were to be obtained {\em then} we would pass this test. Using the model from Seminar 3, and conditioning on an appropriate test statistic, suggests that the extra uncertainty is captured by increasing the variance of all estimators by a factor which is bounded above by two. A similar conclusion also holds for the ''hidden confounders'' setting of Seminar 4.

- Copas, J. B. and Li, H. G. (1997)
- Inference for non-random samples (with discussion).
- J. Roy. Statist. Soc. B, 59, 55-96.
- Copas, J. B. and Shi, J. Q. (2000)
- Meta analysis, funnel plots and sensitivity analysis.
- Biostatistics, 1, 247-262.
- Copas, J. B. and Shi, J. Q. (2000)
- Reanalysis of epidemiological evidence on lung cancer and passive smoking.
- British Medical Journal, 7232, 417-418.
- Copas, J. B. and Eguchi, S. (2001)
- Local sensitivity approximations for selectivity bias.
- (under revision for JRSSB)
- Copas, J. B. and Lu, G. (2001)
- Double the variance for missing data.
- (manuscript).

John Copas is currently Professor of Statistics at the University of Warwick, UK. Former editor of Applied Statistics, Vice-President of the Royal Statistical Society, Statistical Consultant to the UK Home Office, and the author of many papers in mathematical and applied statistics. His CV includes a number of discussion papers read to the Royal Statistical Society in London. His current interests include methodology for selection bias, meta analysis, risk assessment, semi-parametric inference, and models for discriminant functions.