Exploratory factor analysis allows the researcher to investigate the structure of complex data by looking for commonalities between variables. These commonalities become expressed as the function of an unobserved latent variable that manifests on the system in question through the measured variables with which it is correlated. This concept works well with classical genetics, which explains the latent variables as underlying genetic factors exhibiting pleiotropic effects on the system of measured traits.
In testing genetic factor models, a key issue is identifiability. Appropriately applied rotations can modify the loadings and the resulting factor scores, so that factor solutions are not unique. Thus, the factor loadings to be used in genetic modeling must correspond to what is known about the biology of the system under investigation. In our study, the first and third factors can be identified as components of lipid homestasis. The factor loadings for the first factor (see Table 1) have the highest degree of identifiability, and seem to describe a latent variable which is highly correlated with high-density lipoprotein (HDL) levels, and correspondingly inversely correlated with triglycerides, smoking, and body mass index (BMI). These loadings are in line with current understanding of the biology of high density lipoproteins. Thus, in some way, this factor can be thought of as having a significant effect on HDL levels (perhaps coding for a component of HDLs). Looking across timepoints, it appears that this definition remains fairly stable, differing only in later timepoints with the addition of blood pressure and fasting glucose as predictive factors, and the removal of smoking (perhaps as older individuals cease, or at the very least reduce, the smoking habit of their earlier years). It is interesting to note that, although apparently similar in factor loadings, the factors from the first and subsequent time points differ significantly, as exemplified by the correlation between factor scores (the correlation between time1 and other times' scores for this factor is low – on the order of 0.1–0.2, whereas the correlation between other times for this factor are fairly high – on the order of 0.8–0.9). This also explains the difference observed later in peak LOD scores and positions in time1. The exact source of this difference is unknown, but may be due to numerical instability in the factor solution for the first time point (the loading of 1.0 on HDL levels, and the concomitant communality of 1.0 for HDL as a variable, indicates the presence of a Heywood instability, or a problem with the factor solution).
Variance component-linkage analysis using the predicted factor scores highlights several regions of the genome. Most prominent in these analyses are the regions on chromosome 6 (at roughly 127 cM) and chromosome 7 (at 160 cM), which correspond with previous studies of HDL variability [5, 6]. The stability of the location estimates for these peaks (and their LOD scores) are impressive, and lend considerable confirmatory evidence to support our multivariate modeling.
Factor 3 may also be related to some component of the cholesterol homeostasis mechanism, as it seems to be correlated with triglyceride levels, fasting glucose levels, HDL levels, BMI, and systolic blood pressure. As with Factor 1, correlations between the different time point factor models are high. This component appears to be distinct from that described by Factor 1 based on the location of linkage signals produced by the Factor 3 structures.
For the other two factors, identifiability is an issue. Factor 2 resembles to some degree Factor 1, in that it is composed of contributions from triglyceride levels, total cholesterol levels, and blood pressure measures. However, this factor does not provide any unique (or significant) linkage results on its own, in spite of good correlations between factor scores derived from each time point. These findings suggest a limit to the correlation between factor scores that will generate significant, reproducible findings. Based on the correlations in Table 2, correlations of 70% or more appear to be required to get reproducible linkage findings from factor models. In this way, the table of factor score correlations could possibly be useful in predicting the strength of further analysis.
Factor 4 also defies our attempts at identification, as no clear pattern is observable in the factor loadings. These factor models may actually demonstrate the effect of extraneous variables in a factor model, as extraneous variables should appear in factor structures independently of the other variables. Because this happens in at least two of the time points for this factor, it is possible that these structures simply represent the residual variation left in the system after the effects of other latent variables have been taken into account. Additionally, no clear, reproducible linkage signals appear.