Description of animals and genotyping data
The Kinsella beef composite population has been described in our recent studies on QTL mapping, candidate gene identification and genomic selection [e.g., 14-17]. Here we recapitulate the essential details of this population. It was produced by crossing between Angus, Charolais, or University of Alberta hybrid bulls and a hybrid dam line. The hybrid dam line was obtained by crossing among three composite cattle lines, namely beef synthetic 1 (SY1), beef synthetic 2 (SY2) and dairy × beef synthetic (SD) for more than 10 years after 30 years (1960-1990) single-sire crossbreeding. SY1 was composed of approximately 33% each of Angus and Charolais, 20% Galloway, 5% Brown Swiss, and small amounts of other breeds. SY2 was composed of approximately 60% Hereford and 40% other beef breeds mainly including Augus, Charolais and Galloway. SD was composed of approximately 60% dairy cattle (Holstein, Brown Swiss, or Simmental) and approximately 40% of other breeds, mainly including Angus and Charolais [22]. The blood samples of 1023 beef steers were collected and genotyped using the Illumina Infinium genotyping system with the BovineSNP50 Beadchip. All steers were produced from multi-sire breeding group natural service on pasture. The sire genotype of each calf was determined in a parentage test by using the BovineSNP50 Beadchip, but the parentage of about 100 animals remained unknown because these animals were either sires at initial crossing or sires without progeny. There were 116 sire families with varying family sizes ranging from one to 54 progeny per family. It is estimated that there have been about 4-5 generations since initial crossing.
A total of 51,828 SNP markers were originally obtained in the genotyped data. These markers were distributed across 29 autosomes and one sex chromosome in the entire bovine genome. For our analyses, we only used 43,124 SNPs after removing those markers (i) with monomorphism, (ii) with unknown genomic position and (iii) on the sex chromosome, (iv) with minor allele frequency (MAF) of ≤ 2% [1], and (v) with a Chi-square value >600 for the HWD test.
Components of zygotic linkage disequilibrium
For two loci, each with two alleles, A and a at locus A and B and b at locus B, there are nine possible genotypes (ten if the coupling and repulsion double heterozygotes are distinguishable). Following Yang [23], we wrote frequencies of these genotypes as, , which result from union of gametes uy and vz with u v = A or a, and y z = B or b. The genotypic frequencies at individual loci are the marginal totals of the appropriate two-locus genotypic frequencies. For example, the frequency of genotype AA is,
(1)
With the genotypic frequencies at locus A, the frequency of allele A is, and that of allele a is p
a
= 1- p
A
.
Departures from HWE at locus A are, and those at locus B are, .
In a random mating population, HWD disappears (i.e., D
A
= D
B
= 0). In a non-random mating population, nonzero HWD is measured by the fixation index which can be either positive when there is inbreeding or negative when inbreeding is avoided. For example, the HWD at locus A can be written as D
A
= f
A
p
A
p
a
, where , is the fixation index at locus A, with -1 ≤ f
A
≤ +1.
It was established [11, 12] that the total zygotic LD between loci A and B could be defined in terms of zygotic LDs for individual genotypes with each zygotic LD being a complex function of digenic, trigenic and quadrigenic disequilibria. For example, the zygotic LD for double homozygote AABB () would simply be the deviation of the frequency of double homozygote from the product of the corresponding homozygotes at loci A and B
(2)
where each genic disequilibrium (D) is the deviation of a frequency from that based on random association of genes and accounting for any lower order disequilibria. The usual gametic LD () would be the deviation of frequency of gamete AB from the product of frequencies of allele A at locus A and allele B at locus B with .
When zygotes arise from random union of gametes as often assumed in most LD studies, all non-gametic disequilibria including HWD would disappear (e.g., ). In this case, the zygotic LD for genotype AABB () would reduce to, .
This formula is the basis for possible use of double homozygosity to measure gametic LD in a random mating population [24, 25].
Since the two types of double heterozygote (AB/ab and Ab/aB) in our unphased SNP data could not be distinguished, we used the composite LD (Δ
AB
) and a composite quadrigenic component (Δ
AABB
) in place of gametic and quadrigenic disequilibria. Thus, the zygotic LD for genotype AABB () in equation (1) was rewritten as
(3)
where
(4)
and
(5)
It should be noted from equations (1) and (2) that the two trigenic disequilibria in (2) were rewritten without superscripts for notational simplicity.
Maximum likelihood estimation
Following Weir and Cockerham [19] and Weir [20], we used the procedure of statistical inference based on the assumption of multinomial sampling of individual diploids from a population. The observed frequencies and disequilibria with tildes (~) were maximum likelihood (ML) estimates of corresponding parametric values. Since the additive models described earlier allowed for defining the same number of parameters as there would be degrees of freedom, the ML estimates were simply replacing all parametric values of frequencies and disequilibria with corresponding observed values. For example, the ML estimates of composite LD were simply given by,
However, the ML estimates might be biased because they would involve quadratic terms of multinomial variables. For example, the expectation of the squared gene frequency of allele A over replicate samples of size n would be,
where D
A
is the HWD measure at locus A[20]. With the sufficiently large sample (n = 1023 animals) in our data set, we invoked large-sample theory for statistical inference about genic disequilibria. Thus, we ignored the possible biases of order 1/n.
Hypothesis testing and power
With a ML estimate ( or ) of a given genic disequilibrium D or Δ, along with its sampling variance, [Var() or Var()] being given in Appendix in the Additional file 1 section, we constructed a test statistic, or
to test the hypothesis of zero disequilibrium (i.e., H0: D = 0 or H0: Δ = 0). Assuming the asymptotic normality of the ML estimate, X2 under the hypothesis of zero disequilibrium would be distributed as chi-square with one degree of freedom.
As usual, each chi-square test would commit two kinds of error: a true hypothesis may be rejected (Type I error) or a false hypothesis may not be rejected (Type II error). The probability of Type I error is measured by the significance level whereas the probability of Type II error is often related to the power of the test. Generally, as the power is the probability of rejecting a false hypothesis, it equals to one minus the probability of Type II error. In the present study, however, we adopted a different use of the power as proposed by Weir and Cockerham ([19], p. 100) and Weir ([20], p. 110): we calculated the power when the hypothesis being tested is true. In this particular case, a power value equals to the significance level.
Chi-square statistic and correlation
In the past, the squared correlation (r2) has been routinely used as a measure of gametic LD (), composite LD (), or zygotic LD (). We used a chi-square statistic (i = GLD CLD or ZLG) to test for the significance of the LD estimate. It is known from the literature [21] that the relationship of would hold exactly only for a 2 × 2 contingency table. This was the case for GLD and ZLD, but not for CLD. When dropping out three- and four-gene disequilibria in testing for zero composite LD (), we would obtain an approximate chi-square statistic, .
Which would be equal to as given in Weir [26]. Similar approximations or restrictions would be needed if the relationship of were desired for three- and four-gene disequilibria. Thus, to avoid such approximations or restrictions, we used a generalized measure of square correlation [21] in place of r2 as a standardized measure of genic disequilibria. As pointed out above, the relationship of would hold only for a 2 × 2 contingency table.
Data analysis
All data analysis and required computation were done using SAS 9.3 [27]. The calculation of gametic and composite LD was carried out using PROC ALLELE of SAS/Genetics 9.3. In this calculation, the SNP marker data was read in as columns of genotypes using the GENOCOL and DELIMITER= options in the PROC ALLELE statement; gametic LD was calculated if the HAPLO= EST option in the PROC ALLELE statement was invoked, whereas composite LD was calculated if the HAPLO= NONEHWD option was specified. Zygotic LD and its components as well as hypothesis testing were calculated using SAS Macro language and SAS/IML procedure.