# Research

## Grant Awards

### Active

PI: Elena Erosheva
Funding Agent: NIH
Amount: $384,661 Date: June 1, 2013 - May 31, 2015 Title: "Respondent-driven sampling for highly structured populations" Abstract: A network-based type of sampling technique and the corresponding set of estimates, known as Respondent- Driven Sampling (RDS), is the current method of choice for many researchers studying hard-to-reach or hidden populations. RDS exploits social networks by starting with a small set of individuals and allowing the respondents at each wave to recruit the next wave of the sample from their contacts. However, it is often unclear whether important assumptions of RDS estimators about the population-specific network structure and the chain-referral recruitment process are satisfied. In this project, focusing on population clustering structures, we will (1) Infer relational structures from egocentric data that are important for RDS feasibility; (2) develop a comprehensive simulation study framework for assessing RDS feasibility; and (3) extend the model-assisted approach to inference from RDS data to account for population clustering. We will apply these new methods to unique observational data on the size and structure of social networks of older GLBT adults from the study Caring and Aging with Pride to inform computer simulations of both social networks and RDS chain-referral processes in order to systematically study the quality of potential RDS estimators in this hard-to-reach population. We will make these methods available in the R-package RDSAnalyst so they can be used by applied RDS researchers to decide whether RDS is warranted in a fashion similar to the sample size computation prior to a funding request for traditional survey research. PI: Tyler McCormick Funding Agent: US Army Research Office (ARO) Amount:$100,000
Date: August 3, 2012 - December 2, 2014
Title: "Taming Twitter: Using social media networks to identify deviant behavior"
Abstract: Our goal is to identify actors in social media networks who are likely to engage in non-normative or deviant behavior (such as being arrested or drunk driving). Our research will be informed by sociological theories on stigma and deviance. More specifically, we hope to use these theoretical paradigms to understand why people choose to disclose deviant behavior and the characteristics of the social networks of these individuals.
Funding Agent:NIH
Amount: $1,842,130 Date: March 1, 2012 – February 28, 2017 Title: "Probabilistic Population Projections for All Countries " Abstract: The United Nations publishes updated estimates and projections of the populations of all the world's countries, broken down by age and sex. These are widely used by international organizations, governments, the private sector and researchers, for example for climate modeling and for assessing progress towards the Millenium Development Goals. The UN's current projections are deterministic, but assessing uncertainty about population estimates and projections is important for policy-making and other purposes. We propose to develop a fully probabilistic population projection methodology. We will develop methods for probabilistic projection of fertility and mortality, taking account of within-country and between-country correlations. We will develop methods for probabilistic projection of international migration. We will develop methods for probabilistic population projections in countries with generalized sexually transmitted infectious disease epidemics, which require special methods because the demographic impact of such diseases is massive and different from most other diseases, being concentrated among the least vulnerable parts of the population, namely young sexually active adults. We will develop methods for reconstructing past populations with uncertainty from fragmentary data. We will produce publicly available software for implementing the new methods. PI: Adrian Dobra Funding Agent:NSF Amount:$342,591
Date: August 1, 2011 – July 31, 2015
Title: "ATD Collaborative Research: Statistical Ensembles for the Identification of Bacterial Genomes "
Abstract: As defined by the Center for Disease Control and Prevention, a bioterrorism attack is the deliberate release of viruses, bacteria, or other germs used to cause illness or death in people, animals, or plants. The use of micro-organisms to cause disease is a growing concern for public health officials and national defense agencies, in light of the terrorist attacks of September 11, 2001, and the subsequent releases of anthrax to individuals in congress and the media. There exists biological agents that, if used effectively as biological weapons, could cause substantial public health challenge in terms of our ability to limit the damage to both our citizens and our nations. One of the scientific initiatives to reduce the threat of bioterrorism is the development of mathematical and statistical methods for the rapid identification of genome differences and the accurate classification of bacterial genomes as harmless or potentially pathogenic. The main objective of this proposal is the development of high dimensional classification and clustering tools for this purpose. We consider three statistical approaches to the identification of bacterial genomes in a given bacterial “soup”: (1) classification by overlap enrichment; (2) comparison of empirical clusterings and consensus genomes; and (3) shrinkage estimation and model selection in hierararchial log- linear models.
Funding Agent:NIH
Amount: $1,857,552 Date: January 15, 2012 – December 31, 2016 Title: "Bayesian Estimation of Prevalence and At-Risk Group Size in Sexually Transmitted Infection Epidemics " Abstract: The goal of this proposal is to develop new statistical methods for estimating prevalence and the size of at-risk groups in sexually transmitted infection epidemics. We also aim to estimate other policy-relevant quantities such as the number of orphans and children impacted, and treatment needs. We consider two types of epidemic: generalized epidemics, in which the disease is spread throughout the general population, and concentrated epidemics, in which the disease is largely confined to at-risk groups such as intravenous drug users, sex workers and men who have sex with men. Our goal is to develop methods appropriate for countries with sparse data, most of which are developing countries. For generalized epidemics, we propose a susceptible-infected model with a stochastic infection rate. We will develop a Bayesian approach to estimating the model from clinic data over time and sparse household surveys. We will extend the model to take account of changes in treatment availability, and to produce provincial as well as national estimates. For concentrated epidemics, we will first develop new integrated Bayesian methods for estimating the sizes of the main at-risk groups from fragmentary data, including mapping or hotspot data, behavioral surveillance data, program enrollment data and the overlaps between them. Much recent data comes from two relatively new network-based data collection methods, respondent-driven sampling (RDS) and the network scale-up method. We will develop methods for estimating unknown population size from multiple data sources, including RDS and network scale-up. We will then develop methods for estimating at-risk group size and prevalence over time, using a dynamic Bayesian model. We will produce publicly available software to implement our new methods and make them available to the research community and policy-makers. PI: Peter Hoff Funding Agent:NIH Amount:$866,196
Date: September 1, 2011 - August 31, 2015
Title: "Analyzing Social Networks and Behavior"
Abstract: The goal of this grant is to develop statistical methods and software for the joint analysis of networks and nadal attribute data. The methods will be based on extensions of well-studied and familiar data analysis methods such as factor analysis, linear regression and probit models. The project will provide:
• statistical tests and descriptions of the relationship between a network and nadal attributes.
• predicition and imputation of network information based on nodal attribute data.
• prediction and imputation of nodal attirbutes based on network data.
• estimation and inference in the presence of missing network and nodal data.
• a class of dynaic network models that can be extended into the time domain.
• open source statistical software that will be accessible to researchers.

### Complete

Funding Agent:NIH
Amount: $145,639 Date: September 1, 2010 – October 31, 2011 Title: "ARRA – Assessing Uncertainty" Abstract: This is an application for an Administrative Supplement to the NICHD grant, "Assessing Uncertainty in Population Projections and Linked Demographic-Disease Models via Bayesian Melding," to accelerate the work by hiring three additional research assistants. The parent grant aims to develop Bayesian melding methods for probabilistic population projections in collaboration with the United Nations Population Division (UNDP). Under the parent grant, we have already developed new methods for probabilistic population projections that perform extremely well. The UNPD has assessed the results and decided to move towards producing probabilistic projections using our methods. This will require rapid development of three elements of the parent proposal: parametric modeling of age-specific mortality rates for countries with sparse data, parametric modeling of age-specific fertility rates, and methods for probabilistic population projections for countries with generalized HIV/AIDS epidemics. Under this supplement we will hire three research assistants and each of them will work on one of these three elements. PI: Thomas S. Richardson (PI-James Robbins) Funding Agent:NIH, Subcontract with Harvard School of Public Health Amount:$257,226
Date: April 1, 2010 – March 31, 2015
Title: "Analytical Methods for HIV Treatment and Co-Factor Effects (UW Subcontract)"
Abstract: Under this subcontract, Thomas Richardson will collaborate with PI James Robbins to develop theory and methods for causal models for binary outcomes. More specifically he will develop and implement parametric and semi-parametric models for partial compliance /instrumental variable settings; the analysis of interactions and treatments at multiple time points such as structural nested mean models.
PI: Thomas S. Richardson (PI-Jeff Bilmes)
Funding Agent:NSF (Sub-budget portion)
Amount: $193,468 Date: August 1, 2009 – July 31, 2012 Title: "Cl-ADDO-EN: Software Infrastructure for Temporal Modeling" Abstract: It is proposed to significantly enhance the availability of software infrastructure for graphical- model based time series modeling. This will be done by vastly extending the graphical models toolkit (GMTK)\cite (bilmes2003d), currently the most widely used graphical-model based software system for speech recognition. The extensions that will be performed, however, are optimized not just for speech recognition but for all time-series applications. New infrastructure will be developed that will significantly enhance GMTK’s {\em abilities}, {\em speed}, {\em documentation}, {\em source code availability}, and {\em pedagogical structure}. This work will give both to the student and the researcher an absolutely enormous number of dynamic graphical model facilities. The primary means that this will be done is by the hiring of a full time programmer. In addition to this, coding and planning time is request by PI himself, as well as 4 years of graduate student time. With its new features, GMTK will be able to perform computationally difficult time-series processing on extremely large and diverse data sets. The infrastructure provided by this proposal will enable the researcher to perform scientific experimentation with statistical time-series models that have never before been feasible. PI: Adrian Dobra (PI -Mihai Podoreanu) Funding Agent:NIH (Sub-budget Portion) Amount:$112,230
Date: September 1, 2007 - August 31, 2011
Title: "Cross Species Analysis of Myocardial Susceptibility to Perioperative Stress"
Abstract: The research objectives and research design of the overall project are described in the main application from Duke University, Mihai Podgoreanu, PI. The portion of the project to be carried out at the University of Washington relate to the processing and statistical analysis of the data generated at Duke University. The main objectives of this work are:
1. Participate in the design of the micro-array experiments.
2. Perform quality checks, preprocessing and exploratory analysis of the data.
3. Make use of state-of-the-art bioinformatics tools to combine the gene expression data across species and to identify orthologous genes between human and model organisms.
4. Develop predictive statistical models for PMI in order to identify relevant sets of genes for further study.
5. Develop statistical techniques customized to the needs of this project that perform cross-species data mining.
6. Create and visualize genome-wide association networks from micro-array data and assess their biological relevance with respect to corresponding networks from Ingenuity Pathways Analysis and Duke Integrated Genomics.
7. Closely collaborate with the Duke investigators to understand, refine and validate the predictive models for "perioperative myocardial injury" (PMI).
8. Develop a web-based resource for this project summarizing the up-to-date progress of the team.
9. Make public the findings of the project through conference talks and research articles.

PI: Adrian E. Raftery (PI -Ka Yee Yeung)
Funding Agent:NIH (Sub-budget Portion)
Amount: $615,507 Date: September 1, 2008 - June 30, 2013 Title: "Prediction and Network Construction using High-throughput Data" Abstract: Gene expression microarray data are used extensively to classify tissues into types, including various types of tumor, and to predict survival, time to relapse, and other temporal quantities. Microarrays measure the expression levels of thousands of genes, all of which are potential predictors. This poses difficult statistical problems since the number of genes is far larger than the number of tissue samples typically available. We propose to develop Bayesian model averaging (BMA) methods to deal with this problem, and produce simple, reliable, robust and interpretable predictions of the presence or type of tumor, and probability of and time to relapse. This also provides a probabilistic gene selection method. We also propose to investigate using properties from expression networks (e.g. highly connected hub genes) to identify biologically meaningful predictive genes. We will also apply and extend the BMA methods to determine predictive network modules and known gene categories (e.g. GO categories, KEGG pathways). As part of this we will develop and extend recent methods for social networks, the latent position cluster model, to infer expression networks and to identify gene modules. The main thrusts of the research will be: (1) BMA for multi-class classification and survival analysis using gene expression data; (2) latent position cluster model for inferring expression networks and identifying network modules; (3) prediction using network modules and gene categories; (4) generation of expression perturbation data to test our network construction methods in yeast; and (5) production and distribution of software tools. PI: Mark S. Handcock (PI -Michael Rendall) Funding Agent:NIH (Subcontract from Rand Corp.) Amount:$64,144
Date: August 1, 2007 - July 31, 2009
Title: "Immigration, Emigration, and Age-by-Country Structure of Mexican Cohort Lifetimes"
Abstract: This research is in part motivated by the very large discrepancy between the assumptions of the demographic models addressing the question of the effect of immigration on population aging and the empirical evidence about the migration processes of the US's single largest immigrant-contributing country, Mexico. The nature of the demographic models is that they assume that immigrants settle in the receiving country. The empirical evidence with regard to Mexico is that large numbers of immigrants do not settle in the US, and instead return to Mexico. The study aims to understand the consequences of substantial levels of return migration first on the likelihood of aging in the US among Mexican-born male and female immigrants to the US, and second on the selectivity of those who stay in the US into old age. US and Mexican 1990 and 2000 census microdata are used together first to estimate a two-region cohort migration model. The extent to which later migrant streams from Mexico contain more people who settle in the US and are increasingly balanced by gender is assessed by comparing late 1980s and late 1990s migration estimates.

The study is a first step towards developing a better understanding the future impact of Mexican immigration on the Mexican-born population's age structure and composition by education, family, and health characteristics. It is also a first step towards a broader understanding of the impact of immigration in general on the age structure of the US population and thus on the US's ability to support an older population with greater health needs.
PI: Adrian E. Raftery (PI -Susan Joslyn)
Funding Agent:NSF (Sub-budget portion)
Amount: $281,053 Date: October 1, 2007 - September 30, 2010 Title: "DRU - Weather Forecast Uncertainty" Abstract: Information about weather forecast uncertainty, which has been available for some time, is rarely communicated in public forecasts, although it is theoretically beneficial to weather related decisions with important economic and safety consequences. One concern is the difficulty the general public might have in understanding such information. To date, however, very little research has investigated the psychological processes involved in understanding using weather forecast uncertainty in realistic contexts among non-expert users. In order to determine how best to communicate forecast uncertainty to the general public, their needs and information processing requirements must be first understood. This project will conduct both naturalistic and experimental research to accomplish these goals. Then we will develop new probabilistic forecasting methods for extreme events and warnings, and new methods for verifying their performance. Finally we will design and create uncertainty products that are compatible with identified user needs and cognitive requirements. These products will be based on output from the University of Washington regional ensemble system. Some of these products, which provide weather warnings for extreme events, will require the development of innovative probabilistic forecasting methods. Weather warnings for extreme events have important safety implications but there has been little attention, from either the psychological or statistical research communities, given to a probabilistic approach to these issues. Targeted research such as this, using state of the art uncertainty products among non-expert end users is virtually unique and will provide important foundational research for the study of communicating forecast uncertainty. PI: Elena Erosheva Funding Agent:NIH Amount:$121,280
Date: September 1, 2007 - August 31, 2009
Title: "Operational Definition of Chronic Disability in the National Long-Term Care Survey."
Abstract: This application requests two years of funding to investigate the operational definition of chronic disability in the National Long-Term Care Survey (NLTCS). Published studies that use data or refer to results from the NLTCS vary in the amount of detail they provide on the definition of chronic disability employed by the survey. Most of these studies, however, oversimplify the NLTCS's operational definition of chronic disability by ignoring longitudinal features of the survey. This practice may lead not only to erroneous conclusions but also to misspecified policy implications.

The NLTCS began in 1982 and now extends over six waves through 2004. It provides an important source of information on possible changes in disability over time among the elderly Americans. The NLTCS data on basic and instrumental activities of daily living have been used to generate some major findings such as showing a decline in chronic disability among the elderly Americans. However, complexity of the design, influenced by many decisions made in the early years of the survey, presents conceptual and analytic challenges for secondary users of the NLTCS data. In particular, the operational definition of chronic disability employed by the NLTCS is difficult to track down comprehensively. As a result, it often gets misinterpreted toward an oversimplification. Our preliminary study shows that the NLTCS by design measures some combination of chronic and short term disability as opposed to chronic disability as commonly stated in the literature. The primary aims of this project are to develop a comprehensive description of the operational definition of chronic disability used in the NLTCS and to investigate the impact of the design choices made by the NLTCS on the measurement of chronic disability. This project will illuminate the interplay between the basic definition of chronic disability, as a disability lasting more than 90 days, and the complex longitudinal design of the NLTCS. It will also investigate whether there are subgroups of the elderly population that are differentially affected by the NLTCS design choices as they relate to the measurement of chronic disability. Finally, it will explore whether additional data from the NLTCS can be used to obtain a valid chronic disability measure. Findings of this study will benefit secondary users of the NLTCS data and future designers of longitudinal surveys that aim to track chronic disability status of the elderly over time.
PI: Adrian E. Raftery
Funding Agent:NIH
Amount: $1,310,400 Date: August 15, 2007 - May 31, 2011 Title: "Assessing Uncertainty in Population Projection Models via Bayesian Melding." Abstract: The goal of our proposal is to develop a statistical framework for probabilistic population projections and for assessing uncertainty in linked demographic-disease models. The most common approach to communicating uncertainty in population projections is the scenario, or High-Medium-Low, approach, which has no probabilistic basis and leads to inconsistencies. We propose Bayesian melding as an alternative that can take account of all the available evidence and uncertainties about inputs and outputs from population projection models, to yield a predictive distribution of any quantity of policy interest. Uncertainty is even more important for linked demographic-disease models, when the goal is to forecast future population and disease prevalence in the presence of an epidemic. The United Nations Population Division has decided to assess Bayesian melding as a method for assessing uncertainty in its population projections. UNAIDS has decided to use Bayesian melding as the basis for assessing uncertainty in their demographic and prevalence projections. The specific aims of the research will be: (1) Methodological development of Bayesian melding to assess probabilistic forecasts, to deal with measurement and systematic errors, to provide a framework for model improvement, model selection and model uncertainty, and to develop more computationally efficient methods. (2) Develop Bayesian melding methods for probabilistic population projections, including fertility, mortality and migration. (3) Develop Bayesian melding methods for linked demographic-disease models, including the incorporation of multiple data sources, and the assessment of behavior change. (4) Produce and distribute software implementing the new methods produced by our research. PI: Peter Hoff Funding Agent:NSF Amount:$400,000
Date: November 15, 2006 - October 31, 2009
Title: "Longitudinal Network Modeling of International Relations Data."
Abstract: Empirical analyses of international relations data have become one of the principal methods by which researchers evaluate theories of trade, conflict and other interactions between countries. For example, regression modeling has recently been used as a method of evaluating the question of whether or not the community of democratic countries is inherently peaceful. The data used in these analyses are inherently longitudinal, involving measured relations between nations over time. Despite this fact, and that many of the core approaches in scientifically oriented studies of international politics spring from strong policy concerns, very seldom do these statistical modeling efforts account for the temporal nature of the data, or attempt to gauge the validity of the obtained model-fitting results by comparison to unfolding events.

To address these issues, we will develop and implement statistical models for relational data that take into account (a) the complex dependencies inherent to relational, social network data, and (b) the evolution of international relations over time. This will be done by extending regression and latent factor models for network data to the time domain, allowing for the analysis of complicated longitudinal relational data using tools that are familiar to social science researchers. We will leverage the longitudinal nature of the data to evaluate candidate statistical modeling approaches and estimation methods, and will use the methodology to better understand the dynamics of international conflict and trade data.
PI: Sibel Sirakaya
Funding Agent:UW Royalty Research Fund (RRF)
Amount: $23,454 Date: September 16, 2006 - September 15, 2007 Title: "Sovereign Lending Under Limited Enforcement." Abstract: This paper will develop a model of a two-sector small open economy with limited enforcement of foreign debt. I will study an incentive constrained self-enforcing lending scheme in which a debtor's repayment utility never falls below its default option. A general purpose numerical method will be developed to carry out the simulations of the model under a wide range of parameter sets to demonstrate the extent of inefficiencies due to limited enforcement. PI: Adrian E. Raftery (PI-Alan Borning) Funding Agent:NSF (Sub-budget portion) Amount:$86,395
Date: January 1 2006 - December 31 2008
Title: " Modeling Uncertainty in Land Use and Transportation Policy Impacts: Statistical Methods, Computational Algorithms, and Stakeholder Interaction."
Abstract: In computational statistics, we are developing , analyzing, and validating techniques for representing and propagating uncertainty through a sophisticated modeling system. Our approach uses promising but preliminary results in Bayesian melding. We propose to develop new statistical methods adapted to the challenges posed by UrbanSim (a sophisticated system to model urban development), which include model stochasticity, large effects of measurement and systematic errors, high dimension of model inputs and outputs, and significant running time for the underlying model. In addition to the statistical challenges, however, undertaking this approach makes extreme computational demands; and achieving acceptable performance will require algorithmic advances, as well as sound software engineering. In human computer interaction, among the research challenges, are supporting meaningful stakeholder access to and interaction with complex situations, including representations of uncertainty. Finally, in the emerging area of science and design, and important question is: how can we design and evaluate the system overall, in a principled way, to support such basic values as accurate presentation of results (including their limitations and uncertainties) and transparency? If we succeed in this work, UrbanSim has the potential to significantly aid in public deliberation over major decisions regarding urban sprawl, economic health, sustainability, and other issues. Our system is Open Source and freely available, and has already attracted considerable interest and use. Further, the results in computational statistics should be applicable to a broad range of simulations of economic or environmental processes to inform public policy development and deliberation. Finally, the interaction techniques and findings should be applicable to a range of other stakeholder interactions with complex models and sources of information.
PI: Ross Matsueda
Funding Agent:NIH
Amount: $993,535 Date: September 1 2005 - August 31 2008 Title: "Life Course Trajectories of Substance Use and Crime" Abstract: This proposal estimates trajectories of substance use and crime through the life course, and builds models to explain those trajectories. It uses three datasets, the Denver Youth Survey, National Youth Survey, and Add Health Survey. It identifies key risk factors from social learning (coercive parenting, deliquent peers, deliquent attitudes, delinquent identity), rational choice (risk of arrest, rewards of drug use), stable trait (impulsivity), and life course theories (high school graduation, employment, marriage). The analysis begins by estimating individual growth curves of marijuana, cigarettes, alcohol, other drugs, and deliquency. It then uses multi-level models tests three hypotheses: (1) A cormobidity hypothesis in which a latent variable underlies one or more trajectories. (2) A stable context and stable trait hypothesis, in which trajectory parameters are predicited by stable traits like impulsivity, and stable contexts, like SES and family functioning. (3) A life course hypothesis, in which life course transitions are treated as time-varying covariates predicting substance use trajectories. (4) A social process hypothesis, in which process variables, like deliquent peers or perceived risk of arrest influence trajectories. We then examine latent classes of trajectories using Nagin's nonparametric mixed model. We test Moffitt's hypothesis that at least two groups--life course persistent and adolescence limited--underly trajectories of illegal behavior, and extned the hypothesis to substance use. We revisit the cormobidity hypothesis by examining whether cormobidity varies within and across latent classes. We then test whether contextual variables and stable traits can explain the group classifications, and using twin data, estimate genetic effects. Finally, we will test our process and life course theories by testing whether their effects are moderated by latent classes (e.g., Are life course persistent drug users immune to the threat of arrest? Do adolscence limited learn from life course persistents? Such results have important health policy implications for prevention and education. PI: Peter Hoff Funding Agent:NSF Amount:$150,000
Date: October 1 2004 - September 30 2006
Title: "Network Modeling of International Peace and Trade Data"
Abstract: Despite the desire to focus on the interconnected nature of politics and economics at the global scale, most empirical studies assume that the major actors are not only sovereign countries, but also that their relationships are independent. This means, for example, that trade is often studied without taking into account the interdependence of one country's trade with another. Similarly in international politics, it is often assumed that the policies of one country are entirely independent of the policies in another, even though we may observe consultation between them. Statistical studies have typically assumed that these kinds of dependencies must be ignored. In contrast we employ newly developed statistical methods to reveal these heretofore hidden interdependencies among both trade and international politics. In particular, we develop and estimate statistical models for dependent dyadic data that simultaneously estimate the correlation of actions having the same initiator, the correlation of actions having the same recipient, as well as the reciprocity of actions between a pair of actors and third-order dependencies involving the clustering of three or more actors. In particular, we re-examine some of the claims of the democratic peace hypothesis to see whether they may be explained in part by the dependencies among the actions of countries. In addition, we also re-examine standard models of international trade to gauge whether international commerce can be better understood in the context of dependencies among trading patterns. Preliminary results suggest that there is considerable leverage to be gained by focusing on the dependencies in dyadic data of the kind represented by international trade as well as international conflict and cooperation.

The application of this approach has the promise of transforming empirical studies of dyadic, or transactional, data in political science, geography, and economics. In so doing, it may help to re-energize examination of the impacts of international dependencies upon international cooperation and commerce. Understanding the second and third order dependencies among trading countries will help provide a clearer picture of the global opportunities and barriers to increased levels of global trade.

PI: James Kitts
Funding Agent:NSF
Amount: $146,223 Date: October 1 2004 - March 31 2007 Title: "Creating Dynamic Social Network Models from Sensor Data" Abstract: In most situations our decision making is influenced by the actions of others around us. Informal networks of collaboration that coexist within the formal structure of the institution and can enhance the productivity of the organization. The physical structure of an institution can binder or encourage communication. The dynamics of communication influences the diffusion of information. Existing techniques for capturing the relationship between actions and various environmental and organizational attributes rely heavily on tedious manual techniques or situation based approaches. We propose a data-driven approach, where we build a computational framework for learning the structure and dynamics of social networks automatically from low-level sensor data. This effort involves; (i) Collecting data or human activity and interpersonal interactions using a number or complimentary technologies, including machine vision, audio analysis, and wearable GPS (global positioning system) units; (ii) Developing probabilistic reasoning algorithms that can robustly infer patterns of interaction even in the face of noisy and incomplete data; and (iii) Modeling and analyzing these patterns of interaction using probabilistic graphic models to gain insight into the structure and dynamics of human communities. Because our technology allows us to pinpoint the time and place of individual interactions, our approach allows us to create dynamic models that reveal the evolution of social networks over time. We will be able to explore now, for example, communities are reshaped by stimuli such as the gain or loss of members or changing work assignments. PI: Katherine Stovel Funding Agent:NSF Amount:$148,527
Date: April 1 2004 - March 31 2006
Title: "About a Job: Networks, Information and Segregation in Labor Markets"
Abstract: In this project we study how matching processes and structural conditions interact to produce various levels of segregation in labor markets. Empirical evidence reveals that labor markets are often highly segregated with respect to the ascribed attributes of workers. Most of the traditional explanations that have been proposed to account for segregation in labor markets can be classified as either 'supply-side' (worker qualifications or preferences) or 'demand side' (job requirements or discrimination by employers) accounts. Neither of these accounts addresses the structure of information that links potential workers and employers, or how these actors evaluate the information they do acquire. However, how potential workers hear about vacant jobs, and how employers view referred employees, are crucial parts of the hiring process, and have implications for the level of segregation in a labor market.

Our project has three specific aims: (1) To refine and extend our existing two-sided matching model of a labor market to incorporate key aspects of labor market institutions and the information structures (including networks) that are relevant for recruiting; (2) to calibrate this model with data describing empirical labor markets; (3) to use this model as an experimental framework to generate testable hypotheses about the relative importance of supply-side, demand-side, and matching based mechanisms that can influence the level of segregation in a labor market.

PI: Elena Erosheva
Funding Agent:NIH, Subcontract with Harvard Medical School
Amount: $8,784 Date: November 1 2003 - May 31 2004 Title: "Epidemiology: National Comorbidity Survey Replication" Abstract: This project proposes to study patterns of co-morbidity with the Grade of Membership (GoM) model, a statistical model for discrete data analysis. The data set contains dichotomous responses which provide presence or absence of 16 mental disorders from about 10,000 individuals in the U.S., and 250,000 individuals around the world. Assuming existence of extreme (basis) categories in the data, the GoM model postulates that individuals can have mixed membership in the extreme categories. The GoM model was developed in the 1970s and has been applied to a wide spectrum of health-related studies. This project proposes to use recently developed Bayesian estimation methods for analysis of co-morbidity patterns via the GoM model. Specifically, the goals of the proposal include: (1) to explore whether there is evidence of extreme categories in the data and whether patterns of co-morbidity are likely to exhibit mixed membership structure; (2) to estimate the GoM model parameters using Bayesian framework; and (3) to determine whether the GoM model provides a reasonable fit for the co-morbidity data. PI: Elena Erosheva Funding Agent:NIH, Subcontract with Carnegie Mellon University Amount:$192,000
Date: September 30 2003 - August 31 2006
Title: "Modeling Longitudinal Disability Survey Data"
Abstract: Survey data on disability among the elderly are available from several sources, most prominently the Nat-ional Long Term Care Survey (NLTCS). The NLTCS began in 1982 and now extends over five waves through 1999, making it a rich source of information on possible changes in disability over time. But these data pose challenges for both statistical modeling and the protection of confidentiality of the information provided by survey respondents, especially when the data for individuals are linked across waves. Most statistical approaches used to analyze NLTCS data are based on disability scales that cannot account for the complexity of disability manifestations. Attempts to deal with such complexity include traditional multivariate methods for both discrete and continuous data, and approaches based on the grade of membership model. These methods typically require either making heroic simplifying assumptions or need to be adapted. This project aims to develop new statistical models and approaches for the analysis of such survey data. It also proposes to take a fresh look at the risk of inadvertent disclosure of information on NLTCS respondents and to develop new approaches to protect against disclosure while preserving access to the maximal amount of information in the data required for their proper analysis using the new models and methods.
PI: Mark S. Handcock
Funding Agent:NIH/NICHD
Amount: $1,095,133 Date: February 1 2003 - January 31 2007 Project Website: http://www.csss.washington.edu/Research/combining Title: "Combining Survey and Population Data on Births and Family" Abstract: This study's overall objective is to develop statistical methods for combining surveys and population data collections (especially of births and marital and non-marital unions) for the improved estimation of these birth and childhood circumstances. The family and socio-economic circumstances of children's parents at birth and during the childrearing years are fundamental determinants of children's health and well-being. Specific aims are to (1) Develop and test statistical methods to combine multiple sources of survey data and population data; (2) Improve estimates of the parameters of fertility and marital and non-marital union regression equations, and of simulated life-course fertility and union duration measures; and (3) to expand and disseminate the statistical capabilities to the demographic community. It will be shown that combining population and survey data in the estimation allows for more modeling detail than when using population data alone, and more precise estimates than when using survey data alone. Further statistical development will allow for survey data to be combined from more than one data set, thereby obtaining some of the same benefits as from combining survey and population data. Methods for incorporating degrees of inaccuracy in the population data, and imperfect matches between the population collection and the survey's sampling frame and collection methods, will also be developed and applied. Comparative applications of the methods between the U.S. and the U.K. will be made to explore their advantages and challenges over a greater range of population data collection types than available in the U.S. alone. Applications across multiple developed countries will demonstrate that methods for combining survey and population data can be used to overcome the otherwise severe restrictions placed on cross-national comparisons. PI: Elena Erosheva Funding Agent:The Center for Statistics and the Social Sciences Amount:$37,424
Date: 2003-2004 Academic Year
Title: "Statistics and Social Work Collaborative Research Initiative."
Abstract: This project initiates a development of collaborative research between the Center for Statistics and the Social Sciences and the School of Social Work. Potential involvement includes collaborative work on two projects lead by senior Social Work faculty members, Roger Roffman and David Takeuchi.

Roger Roffnam's project is an intervention study that focuses on adults who are both batterers and substance abusers. The primary interest is in developing outcome measures that would allow the assessment of the efficacy of the experimental intervention. The study plan is unique in its focus on a group of people who are both substance abusers and are abusive to their intimate partners. Currently existing measurement and intervention procedures have been developed with the focus on either substance abuse or domestic violence. Studying the co-occurrence of these two behaviors will require new methodological developments in this area.

David Takeuchi's research examines how mental illness and medical care are distributed across race, ethnicity, and socio-economic status. Takeuchi's current research project is the National Latino and Asian American Study (NLAAS) that is intended to investigate the social, cultural, and contextual factors that are associated with mental illness and helpseeking in these large ethnic categories. One facet of NLAAS includes extensive questions of quality of life to better understand how Asian Americans and Latinos cope with stressful conditions of life. Sophisticated statistical analyses have not been typically performed on these quality of life indicators. The goals of this research include determining the psychometric properties of these scale items and assessing the social and cultural factors associated with these different levels of functional status.

PI: Adrian E. Raftery
Funding Agent: Office of Naval Research
Amount: $5,156,827 Date: May 1, 2001 Title: "Integration & visualization of multi-source information for mesoscale meteorology: statistics & cognitive approaches to visualizing uncertainty." Abstract: Current methods of meteorological forecasting produce predictions with unknown levels of uncertainty, particularly in regions with few observational assets. Forecast errors and uncertainties also arise from shortcomings in model physics. With the ability to estimate the uncertainty in predictions, forecasters would have a powerful tool to make decisions and to judge the likelihood of mission success. The goals of our proposed project are to develop methods for evaluating the uncertainty of mesoscale meteorological model predictions, and to create methods for the integration and visualization of multisource information derived from model output, observations and expert knowledge. We will do this by extending the recently developed Bayesian melding approach. We will also develop statistical methods for combining results from model ensembles, taking account of model uncertainty. This will build on the general idea of Bayesian model averaging. We will also develop tools and methods for visualizing predictions of quantities of interest and the uncertainty about them by (i) choosing appropriate quantities of interest for display based on cognitive factors, and (ii) developing appropriate plots, maps, three-dimensional displays, and video displays for decision support. PI: Mark S. Handcock Funding Agent: National Institutes of Health Amount:$259,352
Date: July 1, 2001
Title: "Modeling HIV and STDs in Drug User and Sexual Networks"
Abstract: Infectious diseases are distinguished from other diseases by being transmissible. Our understanding of disease transmission, and the preventive strategies that arise from such understanding, are therefore rooted in an implicit or explicit theory of population transmission dynamics. For infectious diseases like STDs and BBIs, that are only transmitted through the exchange of bodily fluids, the structure of the transmission network plays a particularly critical role. The epidemiology of these diseases - how quickly they spread and who gets infected - is driven by the network of person-to-person contact. Mathematical models of this process have provided a number of insights that have led to changes in STD control strategies. With the advent of HIV, however, new modeling challenges have emerged. In this research we develop new models for drug user and sexual networks as a means to understand the factors that influence the spread of HIV and other STDs.

PI: Mark S. Handcock
Funding Agent: National Science Foundation
Amount: $23,526 Date: January 1, 2001 Title: "Collaborative Research: Hybrid Population-Average and Individual-Specific Models for Clustered Longitudinal Data" Abstract: We propose the merger of two good ideas in the context of social and behavioral statistical models for longitudinal data. The first, model-based clustering, allows the researcher to locate subgroups in a population, should they exist, for a larger class of datasets. The second is to model individual-specific variation using data-adaptive proto-splines. We will extend the set of available heterogeneity models for longitudinal data to include several mean and covariance structures via latent classes, and in turn model the covariance structures using the adaptive proto-spline class of Hancock and Scott (1999). We will also develop the theory and practice of inference for these models, develop objective measures for model comparison and model goodness-of-fit, and will also make appropriate software publicly available. As a case study of the use of these methods, we will analyze long-term trends in wage inequality and the dynamics of change of these trends in the period from the mid-1960s to the mid-1990s. The analysis will be based on young workers from the National Longitudinal Survey (NLS). This analysis will be important to help characterize changes in the experiences of workers in the post-industrial economy. PI: Mark S. Handcock Funding Agent: National Science Foundation Amount:$27,957
Date: August 1, 2000
Title: "Collaborative Research: Nonparametric Models for Incomplete Clustered Data with Applications to the Social Sciences"
Abstract: Clustered data are very common in social sciences research and other fields. For example, in a study involving school children, school districts form clusters and schools form sub-clusters within each cluster. In this context, researchers want to explain a certain variable of interest (the response variable) in terms of certain categorical variables (factors) while adjusting for the presence of other incidental variables (covariates) which might influence the response. This project aims at developing statistical methods for analyzing such data. Though the classical statistical methods accommodate the lack of independence which is inherent to data arising from cluster sampling, they are very often unsuitable for data arising from social science research. This is because they require a set of restrictive assumptions (such as normality and homogeneity of the residuals, linearity, scale dependence) which are rarely satisfied in the social sciences. In addition, data in social sciences research are often incomplete (censored or missing) in which case inference based on the classical statistical models cannot be implemented. Alternative approaches developed to deal with these issues also rely on assumptions which may or may not be satisfied for any given application. The research for this project will focus on the development of statistical models and methods that are free of restrictive assumptions. Central components of the project is the application of these methods to questions regarding routine activities and deviant behavior, and to the question of whether there has been a secular rise in job instability among young adults over the past three decades using two cohorts from the National Longitudinal Survey (NLS). Programs for formal hypothesis testing, graphical summaries of effects and exploratory data analysis plots, will be made available on the web for use by the social sciences community.

PI: Elaina Rose
Funding Agent: National Institutes of Health
Amount: $71,054 Date: September, 2002 Title: "Marriage and Assortative Mating" Abstract: The role of marriage has undergone profound change in recent decades. Changes in the patterns of "assortative mating," i.e., in the types of partners that individuals choose when they do form unions likely accompany the changes in marriage patterns. The objective of this proposal is to expand the literature on marriage and assortative mating by refining the estimates of marriage and assortative mating patterns, and developing an econometric model of the joint union status and partner choice outcomes. The proposal includes a pilot study of the patterns in marriage and assortative mating with respect to education which suggests the following specific research questions: (1.) How does the relationship between education and the likelihood of marriage differ when cohabitors are treated as married couples? (2) Are assortative mating patterns different for cohabiting and married couples? (3.) What are the patterns in assortative mating with respect to characteristics such as parents' education and "unobserved ability"? (4.) Can the cohort differences in the marriage patterns be explained by differences in observables such as education, family policy, or marriage market conditions? (5.) Can changes in assortative mating be explained by changes in the pattern of selection into marriage? (6.) Do women face a tradeoff between partner quality and union "cohesion"? PI: Adrian E. Raftery Funding Agent: National Institutes of Health Amount:$1,090,322
Date: September, 2002
Title: "Model-Based Clustering Methods for Medical Images"
Abstract: Many problems in the health and medical sciences have at their core the task of finding cohesive groups of observations in data. Examples include a group of voxels in an MRI image that correspond to a tumor, genes whose mRNA expression levels track one another, and tissues whose gene expression patterns are similar. The statistical method for solving this problem is cluster analysis. Most cluster analysis methods used in practice have been ad hoc, but recently the development of more formal model-based clustering methods has provided a principled framework for answering central questions such as: How many clusters are there? Which clustering method should be used? How should one deal with outliers?

Our main goal is to develop new methods for problems in model-based clustering that arise in medical image segementation and gene expression data. The three major thrusts will be the development of: (A) model-based clustering methods for large numbers of variables; (B) automated medical image segementation methods appropriate for dynamic MRI breast images; and (C) model-based clustering methods for microarray gene expression data aimed at finding groups of genes that function together, and groups of tissues or tissue types that have similar gene expression patterns.

PI: Kevin Quinn
Funding Agent: National Science Foundation
Amount: $51,133 Date: September, 2002 Title: "Collaborative Research: The Dimensions of Supreme Court Decision-Making, 1946-2000" Abstract: We propose new statistical models that can be used to gain a better understanding of the dynamics of decision making on the U.S. Supreme Court. Substantively, we hope to obtain better answers to the following questions: To what extent do the policy preferences of justices outweigh purely formal, legal concerns when deciding cases? Have the decisions of lower courts become more liberal over time? In what manner have the policy outputs of the Court changed over time? The small number of justices on the Court creates a number of potential inferential problems. To alleviate these problems we adopt a Bayesian inferential approach. An additional benefit of such an approach is that it allows us to include previous qualitative work on the Court into our statistical models in a fairly direct fashion. The models we propose allow us to simultaneously estimate the unobserved ideal policy positions of the justices, the unobserved policy content of the lower courts' rulings, and the effects of measured covariates on the decision calculus of individual justices. PI: Peter Hoff Funding Agent: Office of Naval Research Amount:$275,000
Date: September, 2002
Title: "Statistical Modeling of Dependent Network Data"
Abstract: Network data summarizes relational information among interacting units, and are common in many areas of research. Applications include international conflict, international trade, telephone calling patterns, chain-of-command networks in businesses and other organizations, the behavior of epidemics and the interconnectedness of the world wide web. Such data differs from standard data in that it consists of observations on pairs of experimental units, and that the observations among pairs are typically not independent, but dependent in complicated ways. Past efforts at modeling dependencies in networks have focused on exponentially parameterized random-graph models (often referred to as the p* class of models), which have been difficult to estimate and often give a poor fit to actual network data. Additionally, such models have focused on the case of binary responses, and have difficulty modeling common types of network data such as continuous, count, time-series, and multivariate data. In contrast, the proposed project will develop a flexible modeling strategy for dependent network data using a novel random effects approach, which can easily be incorporated within well-known statistical methods such as linear regression, generalized linear models, semiparametric regression, and others. Preliminary results suggest such an approach has several advantages over current practice. The proposed approach allows for prediction and hypothesis testing; lends itself to a model-based method of network visualization; is highly extendible and interpretable in terms of well known statistical procedures; and has a feasible means of exact parameter estimation.