### Appendix A: Offset choice when Y is binary

The following considers the offset choice for the coded trait T when Y is binary. Assume the phenotype of interest is binary and the genotype of interest follows an additive model. Let *r*_{0}, *r*_{1}, and *r*_{2} denote the number of cases with 0, 1, and 2 disease alleles, respectively. Let *R* denote the total number of cases. Let *S* denote the total number of controls. Let *n*_{0}, *n*_{1}, and *n*_{2} denote the number of cases and controls with 0, 1, and 2 disease alleles, respectively. Let *N*=*S*+*R* denote the total number of cases and controls. In this scenario, the standard statistical method used is the Cochran-Armitage Trend test which can be written as follows:

{z}_{\text{Cochran}}=\frac{N\left({r}_{1}+2{r}_{2}\right)-R\left({n}_{1}+2{n}_{2}\right)}{\sqrt{\left(\frac{\mathit{\text{SR}}}{N}\right)\left(N\left({n}_{1}+4{n}_{2}\right)-{\left({n}_{1}+2{n}_{2}\right)}^{2}\right)}}

(6)

In this scenario, let the coded phenotype *T*_{
i
} = *Y*_{
i
} - *μ*_{
y
} where *μ*_{
y
} is the offset. The NPBAT statistic has the following form:

\phantom{\rule{-6.0pt}{0ex}}\frac{N\left({r}_{1}+2{r}_{2}\right)-R\left({n}_{1}+2{n}_{2}\right)}{\sqrt{\left(\left(\frac{N{\mu}_{y}^{2}}{R}\right)+\left(\frac{N{\left(1-{\mu}_{y}\right)}^{2}}{S}\right)\right)\left(\frac{\mathit{\text{SR}}\left(N({n}_{1}+4{n}_{2})-{({n}_{1}+2{n}_{2})}^{2}\right)}{N-1}\right)}}

(7)

Note that the numerators of both statistics are the same. The ratio of the test statistics can be written as follows:

\phantom{\rule{-10.0pt}{0ex}}\frac{\mathit{\text{Sta}}{t}_{\text{Cochran}}}{\mathit{\text{Sta}}{t}_{\text{NPBAT}}}=\sqrt{\frac{N}{N-1}}\sqrt{\left(1+\frac{1}{\gamma}\right){\mu}_{y}^{2}+\left(1+\gamma \right){(1-{\mu}_{y})}^{2}}

(8)

where \gamma =\frac{\mathit{\text{\#cases}}}{\mathit{\text{\#controls}}}. Given this ratio, the power of the NPBAT statistic relative to the Cochran-Armitage trend test is maximized for the offset choice {\mu}_{y}^{\mathit{\text{optimal}}}=\frac{\gamma}{1+\gamma}=\frac{\mathit{\text{\#cases}}}{N}. For example, if the ratio of the cases versus the controls is 1, the offset choice *μ*_{
y
} is \frac{1}{2}. This corresponds to equally weighting the cases and controls in the conditional test statistic. For large sample size N, such that \sqrt{\frac{N}{N-1}}\approx 1, the ratio of the test statistics is approximately one when the offset is set to {\mu}_{y}^{\mathit{\text{optimal}}}=\frac{\mathit{\text{\#cases}}}{n}. Consequently, for the optimal offset choice, the test statistics are approximately the same.

### Appendix B: asymptotic distribution when the secondary phenotype is available for both the cases and controls

To derive the asymptotic distribution of the NPBAT statistics for various phenotypic offset choices, let {\sigma}_{X}^{2} denote the variance of X and {\sigma}_{Y}^{2} denote the variance of Y. Let ||*a*|| denote the Euclidean norm. Let *T*_{offset} = ((*Y*_{1} - *Y*_{offset})...(*Y*_{
n
} - *Y*_{offset}))^{t} and let {T}_{\mu}={({T}_{{\mu}_{1}},\mathrm{...},{T}_{{\mu}_{n}})}^{t}={\left(\right({Y}_{1}-\stackrel{\u0304}{Y}\left)\mathrm{...}\right({Y}_{n}-\stackrel{\u0304}{Y}\left)\right)}^{t} where {T}_{{\mu}_{i}}=({Y}_{i}-\stackrel{\u0304}{Y}). Let {X}^{t}=({X}_{1}-\stackrel{\u0304}{X},\mathrm{...},{X}_{n}-\stackrel{\u0304}{X}). Define {Z}_{i}=\frac{({X}_{i}-\stackrel{\u0304}{X}){T}_{{\mu}_{i}}}{\left|\right|{T}_{\mu}\left|\right|\widehat{{\sigma}_{x}}}. Then \sum _{i=1}^{n}{Z}_{i}=\frac{{X}^{t}{T}_{\mu}}{\left|\right|{T}_{\mu}\left|\right|\widehat{{\sigma}_{x}}}. By treating X as random given Y is fixed, it can be shown that the *Z*_{
i
}s are independent, *E*(*Z*_{
i
}) = 0 and \mathit{\text{Var}}\left(\sum _{i=1}^{n}{Z}_{i}\right)=1. The Lindberg condition [25] for *Z*_{
i
}, which ensures asymptotic normality of \sum {Z}_{i}, is then given by

\forall \mathit{\u03f5}>0:\mathit{\text{li}}{m}_{n\to \infty}\left\{\sum _{i=1}^{n}{\int}_{\left|{Z}_{i}\right|\ge \mathit{\u03f5}}{Z}_{i}^{2}\mathit{\text{dP}}\right\}=0

(9)

Since *Z*_{
i
} has a discrete distribution, the Lindberg condition can only be fulfilled when the integration set {|*Z*_{
i
}| ≥ *ϵ*} is empty for *n* → *∞*. Since X is the coded genotype and Y is a biological quantity, assume \widehat{{\sigma}_{x}}\ne 0, \widehat{{\sigma}_{y}}\ne 0 and both are finite. Then, there exists some constant K such that \frac{\left|({X}_{i}-\stackrel{\u0304}{X})\right|\left|{T}_{{\mu}_{i}}\right|}{\widehat{{\sigma}_{x}}\widehat{{\sigma}_{y}}}\le K. Hence we rewrite the Lindberg condition by

\phantom{\rule{-10.0pt}{0ex}}\forall \mathit{\u03f5}>0:\mathit{\u03f5}\le \left|{Z}_{i}\right|=\frac{\left|({X}_{i}-\stackrel{\u0304}{X})\right|\left|{T}_{{\mu}_{i}}\right|}{\widehat{{\sigma}_{x}}\left|\right|{T}_{\mu}\left|\right|}\le \frac{K}{n}\to 0\phantom{\rule{2.56865pt}{0ex}}\text{as}\phantom{\rule{2.56865pt}{0ex}}n\to \infty

(10)

Hence the integral in the Lindberg condition is always computed over a set that is empty for *n* → *∞*. Thus the Lindberg condition is always fulfilled when the regularity condition holds. Then the Lindberg theorem [26] implies convergence to normality. Then

\left(\frac{\left|\right|T\left|\right|}{\left|\right|{T}_{\mu}\left|\right|}\right)\mathit{\text{Sta}}{t}_{\text{NPBAT}}=\sum _{i=1}^{n}{Z}_{i}{\to}^{d}N\left(0,1\right)

(11)

Note that the statistic is maximized and has a standard normal distribution when *Y*_{offset} = *E*[*Y*].

### Appendix C: asymptotic distribution when the secondary phenotype is only available for the cases

Here, we derive the asymptotic distribution of the NPBAT statistic for secondary phenotypes in case/control studies. Consider a case control study where genetic information is available for both the cases and the controls, but the phenotypic information is only available for the cases. Here *n* is only the number of cases and all summations are only over the number of cases since the phenotypic information is not available for the controls where as in Appendix Appendix B: asymptotic distribution when the secondary phenotype is available for both the cases and controls, *n* is the number of cases and controls and the summation is over both the number of cases and controls. Let {\stackrel{\u0304}{X}}_{\text{cases}} denote the sample mean of the genotypes of the cases and {\sigma}_{X}^{2} be the true variance of the genotypes. Let {E}_{x}={\stackrel{\u0304}{X}}_{\text{controls}} be the sample mean of the genotypes of the controls. Under the null hypothesis and assuming no population stratification, the sample mean of the genotypes of the cases and the sample mean of the genotypes of the controls both converge to *E*[*X*] since X is not associated with Y. Let {X}_{\text{text}}={({X}_{1}-{\stackrel{\u0304}{X}}_{\text{text}}\mathrm{...}{X}_{n}-{\stackrel{\u0304}{X}}_{\text{text}})}^{t} where *text* = cases or controls, meaning *X*_{1}..*X*_{
n
} is the coded genotype of the cases but \stackrel{\u0304}{X} can be computed based on the cases, the controls, or both. Define

{Z}_{i}=\frac{\left({X}_{i}-{\stackrel{\u0304}{X}}_{\text{control}}\right)\left({Y}_{i}-{Y}_{\text{offset}}\right)}{\widehat{{\sigma}_{x}}\sqrt{\left|\right|{T}_{\mu}|{|}^{2}+2{(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}^{2}}}

(12)

then

\begin{array}{ll}\sum _{i=1}^{n}{Z}_{i}& =\frac{{X}_{\text{control}}^{t}T}{\widehat{{\sigma}_{x}}\sqrt{\left|\right|{T}_{\mu}|{|}^{2}+2{(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}^{2}}}\\ =\frac{{X}_{\text{case}}^{t}{T}_{\mu}+n({\stackrel{\u0304}{X}}_{\text{case}}-{\stackrel{\u0304}{X}}_{\text{control}})(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}{\widehat{{\sigma}_{x}}\sqrt{\left|\right|{T}_{\mu}|{|}^{2}+2{(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}^{2}}}\end{array}

(13)

It is important to note that the *Z*_{
i
}s are independent, *E*(*Z*_{
i
}) = 0 and \mathit{\text{Var}}\left(\sum _{i=1}^{n}{Z}_{i}\right)=1, which is obtained by first taking the conditional expectation treating X as random and Y as fixed. The Lindberg condition [25] for *Z*_{
i
}, which ensures asymptotic normality of \sum {Z}_{i}, is then given by

\forall \mathit{\u03f5}>0:\mathit{\text{li}}{m}_{n\to \infty}\left\{\sum _{i=1}^{n}\underset{\left|{Z}_{i}\right|\ge \mathit{\u03f5}}{\int}{Z}_{i}^{2}\mathit{\text{dP}}\right\}=0

(14)

Since *Z*_{
i
} has a discrete distribution, the Lindberg condition can only be fulfilled when the integration set {|*Z*_{
i
}| ≥ *ϵ*} is empty for *n* → *∞*. Since X is the coded genotype and Y is a biological quantity, assume \widehat{{\sigma}_{x}}\ne 0, \widehat{{\sigma}_{y}}\ne 0 and both are finite. Then, there exists some constant K such that \frac{\left|({X}_{i}-{\stackrel{\u0304}{X}}_{\text{control}})\right|\left|{T}_{i}\right|}{\widehat{{\sigma}_{x}}\widehat{{\sigma}_{y}}}\le K. Hence we rewrite the Lindberg condition by

\begin{array}{ll}\forall \mathit{\u03f5}>0:\mathit{\u03f5}\le \left|{Z}_{i}\right|& =\frac{\left|({X}_{i}-{\stackrel{\u0304}{X}}_{\text{control}})\right|\left|{T}_{i}\right|}{\widehat{{\sigma}_{x}}\sqrt{\left|\right|{T}_{\mu}|{|}^{2}+2{(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}^{2}}}\\ \le \frac{\left|({X}_{i}-{\stackrel{\u0304}{X}}_{\text{control}})\right|\left|{T}_{i}\right|}{\widehat{{\sigma}_{x}}\left|\right|{T}_{\mu}\left|\right|}\le \frac{K}{n}\to 0\phantom{\rule{0.3em}{0ex}}\text{as}\phantom{\rule{0.3em}{0ex}}n\to \infty \end{array}

(15)

Hence the integral in the Lindberg condition is always computed over a set that is empty for *n* → *∞*. Thus the Lindberg condition is always fulfilled when the regularity condition holds. Then the Lindberg theorem [26] implies convergence to normality. Then

\phantom{\rule{-10.0pt}{0ex}}\frac{\left|\right|T\left|\right|}{\sqrt{\left|\right|{T}_{\mu}|{|}^{2}+2{(\stackrel{\u0304}{Y}-{Y}_{\text{offset}})}^{2}}}\mathit{\text{Sta}}{t}_{\text{NPBAT}}=\sum _{i=1}^{n}{Z}_{i}{\to}^{d}N(0,1)

(16)

Then the NPBAT statistic is normally distributed with mean zero and variance given above. Note that the variance is always greater than or equal to one and equals one when *Y*_{offset} = *E*[*Y*]. Note that if {Y}_{\text{offset}}=\stackrel{\u0304}{Y} and {E}_{x}={\stackrel{\u0304}{X}}_{\text{controls}} then NPBAT has a standard normal distribution. As seen in the Simulations section and Figure 1, when *E*_{
x
} is based on the the controls and the phenotype information is only available for the cases, then the power is maximized when {Y}_{\text{offset}}\ne \stackrel{\u0304}{Y} because the variance equals the minimum when *Y*_{offset} ≈ *E*[*Y*].

### Appendix D: NPBAT software

A software package implemented in C++ to compute both single phenotype and multiple phenotypes NPBAT statistics is available for download at the following website: https://sites.google.com/site/genenpbat/. In addition to NPBAT statistics, other population based statistics such as the Armitage Trend Test, Fisher Exact Test are also available. Currently, only two platforms are supported: linux64 and windows64. The NPBAT software package reads in genetic data through the PLINK style pedigree (ped), map (map) and phenotype (phe) files. The website provides detail information on how to use the software package.