### Entropy Model

First we give some definitions and introduce the basic notation.

Let *P* be the population to be studied. Denote by *C* the set of cases with a particular disease in *P* and by *C*^{c}the complementary, that is, the set of controls. Let *N*_{
ca
}and *N*_{
co
}be the cardinality of the sets *C* and *C*^{c}respectively and let *N* = *N*_{
ca
}+ *N*_{
co
}be the total amount of individuals in the population. Each *SNP*_{
i
}in each individual *e* ∈ *P* can take only one of the three possible values, *AA*_{
i
}, *Aa*_{
i
}or *aa*_{
i
}. Let *S*_{
i
}= {*AA*_{
i
}, *Aa*_{
i
}, *aa*_{
i
}}. Moreover, each individual *e* ∈ *P* belongs to either *C* or *C*^{c}, therefore we can say that a *SNP*_{
i
}takes the value (*X*_{
i
}, *ca*) if *e* ∈ *C* or (*X*_{
i
}, *co*) if *e* ∈ *C*^{c}, for *X*_{
i
}∈ *S*_{
i
}. We will call an element in *S*_{
i
}× {*ca,co*} a *symbol*. Therefore we can define the following map

defined by *f*_{
i
}(*e*) = (*X*_{
i
}, *t*) for *X*_{
i
}∈ *S*_{
i
}and *t* ∈ {*ca,co*}, that is, the map *f*_{
i
}associates to each individual *e* ∈ *P* the value of its *SNP*_{
i
}and whether *e* is a control or a case. We will call *f*_{
i
}a *symbolization map*. In this case we will say that individual *e* is of (*X*_{
i
}, *t*) -type. In other words, each individual is labelled with its genotype, differentiating whether the individual is a control or a case.

Denote by

and

that is, the cardinality of the subsets of *P* formed by all the individuals of (*X*_{
i
}, *ca*) -type and (*X*_{
i
}, *co*) -type respectively. Therefore is the number of individuals of *X*_{
i
}-type.

Also, under the conditions above, one could easily compute the relative frequency of a symbol (*X*_{
i
}, *t*) ∈ *S*_{
i
}× {*ca*, *co*} by:

and

Hence the total frequency of a symbol *X*_{
i
}is .

Now under this setting we can define the *symbolic entropy* of a *SNP*_{
i
}. This entropy is defined as the Shannon's entropy of the 3 distinct symbols as follows:

Symbolic entropy, *h*(*S*_{
i
}), is the information contained in comparing the 3 symbols (i.e., the 3 possible values of the genotype) in *S*_{
i
}among all the individuals in *P*.

Similarly we have the symbolic entropy for cases, controls and case-control entropy by

and

respectively.

### Construction of the entropy test

In this section we construct a test to detect gene effects in the set *C* of cases with all the machinery defined in Section 1. In order to construct the test, which is the aim of this paper, we consider the following null hypothesis:

that is,

against any other alternative.

Now for a symbol (*X*_{
i
}, *t*) ∈ *S*_{
i
}× {*ca*, *co*} and an individual *e* ∈ *P* we define the random variable as follows:

that is, we have that = 1 if and only if *e* is of (*X*_{
i
}, *t*) -type, = 0 otherwise. Therefore, given that an individual *e* is a case, *t = ca*, (respectively e is a control *t = co*), the variable indicates whether individual *e* has genotype *X*_{
i
}(taking value 1) or not (taking value zero).

Then is a Bernoulli variable with probability of "success" either if *t* = *ca* or if *t* = *co*, where "success" means that *e* is of (*X*_{
i
}, *t*) -type. Then we are interested in to know how many *e*'s are of (*X*_{
i
}, *t*) -type for all symbol (*X*_{
i
}, *t*) ∈ *S*_{
i
}× {*ca*, *co*}. In order to answer the question we construct the following variable

The variable can take the values {0,1,2,..., *N*}. Therefore, it follows that the variable is the Binomial random variable

Then the joint probability density function of the 6 variables

is:

where *a*_{1} + *a*_{2}+ *a*_{3}+ *a*_{4}+ *a*_{5}+ *a*_{6} =*N*. Consequently the joint distribution of the 6 variables is a multinomial distribution.

The likelihood function of the distribution (15) is:

where . Also, since

it follows that the logarithm of this likelihood function remains as

In order to obtain the maximum likelihood estimators and of and respectively for all *i* = 1,2,3,, we solve the following equations

to get that

Then, under the null *H*_{0}, we have that and thus,

Therefore the likelihood ratio statistic is (see for example [15]):

and thus, under the null *H*_{0} we get that *λ*_{
i
}(*Y*) remains as:

On the other hand, *GE*_{
i
}= -2ln(*λ*_{
i
}(*Y*)) asymptotically follows a Chi-squared distribution with 2 degrees of freedom (see for instance [15]). Hence, we obtain that the estimator of *GE*_{
i
}is:

Therefore we have proved the following theorem.

**Theorem 1**. Let *SNP*_{
i
}be a single nucleotide polymorphism. For a particular disease denote by *N* the number of individuals in the population, *N*_{
ca
}the number of cases and by *N*_{
co
}the number of controls. Denote by *h*(*C,C*^{c}) the case-control entropy and by *h*(*S*_{
i
}), *h*(*S*_{
i
}, *ca*) and *h*(*S*_{
i
}, *co*) the symbolic entropy in the population, in cases and in controls respectively, as defined in (5, 6 and 7). If the *SNP*_{
i
}distributes equally in cases than in controls, then

is asymptotically distributed.

Let *α* be a real number with 0 ≤ *α* ≤ 1. Let be such that

Then the decision rule in the application of the *GE*_{
i
}test at a 100(1-*α*)% confidence level is:

Furthermore, an entropy allelic test can be developed in a similar manner. More concretely, let now define the set *A*_{
i
}= {*A*_{
i
}, *a*_{
i
}} formed by the two possible alleles that form the *SNP*_{
i
}.

Let

Denote by and the total allele frequency. Then we can easily define the allele entropies of a *SNP*_{
i
}by

Now, with this notation and following all the steps of the proof of Theorem 1, we get the following result.

**Theorem 2**. Let *A*_{
i
}= {*A*_{
i
}, *a*_{
i
}} be the alleles forming a single nucleotide polymorphism *SNP*_{
i
}. For a particular disease denote by *N* the number of individuals in the population, *N*_{
ca
}the number of cases and by *N*_{
co
}the number of controls. Denote by *h*(*C,C*^{c}) the case-control entropy and by *h*(*A*_{
i
}), *h*(*A*_{
i
}, *ca*) and *h*(*A*_{
i
}, *co*) the allele entropy in the population, in cases and in controls respectively. If the allele *A*_{
i
}distributes equally in cases than in controls, then

is asymptotically distributed.

### Consistency of the entropy test

Next we prove that the *GE*_{i} test is consistent for a wide variety of alternatives to the null. This is a valuable property since the test will reject asymptotically that the *SNP*_{i} distributes equally between cases and controls whenever this assumption is not true. The proof of the following theorem can be found in Appendix section. Since the proof is similar for both statistics we only prove it for *GE*_{i}.

**Theorem 3**. Let *SNP*_{i} be a single nucleotide polymorphism. If the *SNP*_{i} does not distribute equally in cases than in controls, then

for all real number 0 < *C* < ∞.

Since Theorem 3 implies *GE*_{i} → +∞ with probability approaching 1 always *SNP*_{i} does not distribute equally in cases than in controls, then upper-tailed critical values are appropriated.