Prospective cohort studies, or longitudinal studies, are generally regarded as being more definitive than case-control studies because they are not subject to numerous potential biases that may affect case-control studies. In particular, the cohort study design entails enrolling a disease-free population at baseline, assessing their exposures at that and future time points, and then comparing the ultimate occurrence of disease among those exposed versus unexposed [1]. Since exposure is assessed prior to the occurrence of disease, cohort studies are not subject to temporal ambiguity and recall bias.
While widely used in epidemiologic research, cohort studies have been rarely used in linkage studies. The preferred study designs for linkage analysis has been large pedigrees, heavily loaded with affected individuals, or affected sibling pairs. However, the incorporation of family information, and continued recruitment into large cohort studies, such as the Framingham Heart Study, has provided a valuable opportunity to undertake linkage analyses in a population-based cohort study. Such studies will allow for temporal linkage analyses, and provide information about genetic risks directly applicable to the general population.
One potential problem with using repeated measures from cohort studies in linkage analyses is the large potential for missing data. Missing data is common in longitudinal studies, and may result in spurious or weakened results, complicating their interpretation [2]. For example, missing data can arise in cohort studies due to subject attrition at individual follow-up points, or complete withdrawal from the study [3].
The effect of missing data on one's results depends on the process underlying the incomplete data collection. This can be classified as follows: 1) missing completely at random (MCAR), wherein the missingness is independent of the observed and unobserved data; 2) missing at random (MAR), wherein the missingness depends only on the observed data; and 3) not missing at random (MNAR), wherein the missingness is dependent upon the missing values only [4]. The presence of the latter two situations may introduce follow-up bias into a study. MAR is less restrictive than MCAR because the probability of the missing value depends only on the observed data [5].
Methods for handling missing data can be categorized with regard to the following four types of procedures: 1) complete subject; 2) weighting; 3) imputation-based; and 4) model-based [4]. The complete-subject approach – the simplest imputation method – removes all individuals with missing data. If missing data is not random among the exposed and unexposed groups, complete-subject analysis may introduce a bias. In addition, complete-subject analysis may be less efficient than other approaches [6]. In the weighting approach, individuals with and without missing data are grouped on variables recorded for both. The nonrespondents receive a weighting of zero, while the matching respondents are assigned a proportionately inflated weight to compensate for the missing values. The imputation-based procedures estimate and fill in the missing values, commonly using mean- and regression-based values, allowing one to use standard analysis methods on a complete data set. Finally, model-based procedures define a model for the missing data and make inferences on the likelihood or posterior distribution under that model [4]. The impact of such methods on linkage analysis of longitudinal data is unclear. Therefore, we investigate here the effect of using six different techniques for handling missing data on linkage analyses in the Genetic Analysis Workshop 13 (GAW13) Framingham data.