Entropy Model
First we give some definitions and introduce the basic notation.
Let P be the population to be studied. Denote by C the set of cases with a particular disease in P and by Ccthe complementary, that is, the set of controls. Let N
ca
and N
co
be the cardinality of the sets C and Ccrespectively and let N = N
ca
+ N
co
be the total amount of individuals in the population. Each SNP
i
in each individual e ∈ P can take only one of the three possible values, AA
i
, Aa
i
or aa
i
. Let S
i
= {AA
i
, Aa
i
, aa
i
}. Moreover, each individual e ∈ P belongs to either C or Cc, therefore we can say that a SNP
i
takes the value (X
i
, ca) if e ∈ C or (X
i
, co) if e ∈ Cc, for X
i
∈ S
i
. We will call an element in S
i
× {ca,co} a symbol. Therefore we can define the following map
defined by f
i
(e) = (X
i
, t) for X
i
∈ S
i
and t ∈ {ca,co}, that is, the map f
i
associates to each individual e ∈ P the value of its SNP
i
and whether e is a control or a case. We will call f
i
a symbolization map. In this case we will say that individual e is of (X
i
, t) -type. In other words, each individual is labelled with its genotype, differentiating whether the individual is a control or a case.
Denote by
and
that is, the cardinality of the subsets of P formed by all the individuals of (X
i
, ca) -type and (X
i
, co) -type respectively. Therefore
is the number of individuals of X
i
-type.
Also, under the conditions above, one could easily compute the relative frequency of a symbol (X
i
, t) ∈ S
i
× {ca, co} by:
and
Hence the total frequency of a symbol X
i
is
.
Now under this setting we can define the symbolic entropy of a SNP
i
. This entropy is defined as the Shannon's entropy of the 3 distinct symbols as follows:
Symbolic entropy, h(S
i
), is the information contained in comparing the 3 symbols (i.e., the 3 possible values of the genotype) in S
i
among all the individuals in P.
Similarly we have the symbolic entropy for cases, controls and case-control entropy by
and
respectively.
Construction of the entropy test
In this section we construct a test to detect gene effects in the set C of cases with all the machinery defined in Section 1. In order to construct the test, which is the aim of this paper, we consider the following null hypothesis:
that is,
against any other alternative.
Now for a symbol (X
i
, t) ∈ S
i
× {ca, co} and an individual e ∈ P we define the random variable
as follows:
that is, we have that
= 1 if and only if e is of (X
i
, t) -type,
= 0 otherwise. Therefore, given that an individual e is a case, t = ca, (respectively e is a control t = co), the variable
indicates whether individual e has genotype X
i
(taking value 1) or not (taking value zero).
Then
is a Bernoulli variable with probability of "success" either
if t = ca or
if t = co, where "success" means that e is of (X
i
, t) -type. Then we are interested in to know how many e's are of (X
i
, t) -type for all symbol (X
i
, t) ∈ S
i
× {ca, co}. In order to answer the question we construct the following variable
The variable
can take the values {0,1,2,..., N}. Therefore, it follows that the variable
is the Binomial random variable
Then the joint probability density function of the 6 variables
is:
where a1 + a2+ a3+ a4+ a5+ a6 =N. Consequently the joint distribution of the 6 variables
is a multinomial distribution.
The likelihood function
of the distribution (15) is:
where
. Also, since
it follows that the logarithm of this likelihood function remains as
In order to obtain the maximum likelihood estimators
and
of
and
respectively for all i = 1,2,3,, we solve the following equations
to get that
Then, under the null H0, we have that
and thus,
Therefore the likelihood ratio statistic is (see for example [15]):
and thus, under the null H0 we get that λ
i
(Y) remains as:
On the other hand, GE
i
= -2ln(λ
i
(Y)) asymptotically follows a Chi-squared distribution with 2 degrees of freedom (see for instance [15]). Hence, we obtain that the estimator
of GE
i
is:
Therefore we have proved the following theorem.
Theorem 1. Let SNP
i
be a single nucleotide polymorphism. For a particular disease denote by N the number of individuals in the population, N
ca
the number of cases and by N
co
the number of controls. Denote by h(C,Cc) the case-control entropy and by h(S
i
), h(S
i
, ca) and h(S
i
, co) the symbolic entropy in the population, in cases and in controls respectively, as defined in (5, 6 and 7). If the SNP
i
distributes equally in cases than in controls, then
is asymptotically
distributed.
Let α be a real number with 0 ≤ α ≤ 1. Let
be such that
Then the decision rule in the application of the GE
i
test at a 100(1-α)% confidence level is:
Furthermore, an entropy allelic test can be developed in a similar manner. More concretely, let now define the set A
i
= {A
i
, a
i
} formed by the two possible alleles that form the SNP
i
.
Let
Denote by
and
the total allele frequency. Then we can easily define the allele entropies of a SNP
i
by
Now, with this notation and following all the steps of the proof of Theorem 1, we get the following result.
Theorem 2. Let A
i
= {A
i
, a
i
} be the alleles forming a single nucleotide polymorphism SNP
i
. For a particular disease denote by N the number of individuals in the population, N
ca
the number of cases and by N
co
the number of controls. Denote by h(C,Cc) the case-control entropy and by h(A
i
), h(A
i
, ca) and h(A
i
, co) the allele entropy in the population, in cases and in controls respectively. If the allele A
i
distributes equally in cases than in controls, then
is asymptotically
distributed.
Consistency of the entropy test
Next we prove that the GEi test is consistent for a wide variety of alternatives to the null. This is a valuable property since the test will reject asymptotically that the SNPi distributes equally between cases and controls whenever this assumption is not true. The proof of the following theorem can be found in Appendix section. Since the proof is similar for both statistics we only prove it for GEi.
Theorem 3. Let SNPi be a single nucleotide polymorphism. If the SNPi does not distribute equally in cases than in controls, then
for all real number 0 < C < ∞.
Since Theorem 3 implies GEi → +∞ with probability approaching 1 always SNPi does not distribute equally in cases than in controls, then upper-tailed critical values are appropriated.