Data collection
All sequences, species, and gene datasets collected in this study were based on Ensembl 102 (http://nov2020.archive.ensembl.org/index.html), scheme of which is depicted in Fig. 6 .
84 species were selected, which encompassed orders of vertebrates and one non-vertebrate species (D. melanogaster) (Fig. 2). Throughout the study, all species were compared with the human sequence, as reference. The list of species was extracted via RESTful API, in Java language. In parallel, a list of available gene datasets of the selected species was collected by using the “biomaRt” package [44, 45] in R language. In the next step, in each selected species, all protein-coding transcripts of protein-coding genes were extracted. To that end, identical gene names were used across the selected species to group orthologous/paralogous genes in those species.
Subsequently, the 120 bp upstream flanking sequence of all annotated protein coding TISs were retrieved and analyzed. All steps of data collection were performed by querying on the Biomart Ensembl tool via RESTful API, which was implemented in the Java language, except fetching the primary list of available species and gene datasets. For each species, its name, common name and display name were retrieved. For each gene in each species, its gene name, Ensembl ID and the annotated transcript IDs were retrieved, and finally, for each transcript, the coding sequence, the TIS, the upstream flanking sequence of the TIS, and the protein sequence were retrieved.
All collected data was stored in a MySQL database which is accessible at https://figshare.com/search?q=10.6084%2Fm9.figshare.15405267 .
A candidate sequence was considered a TR if it complied with the following four rules: (1) for mononucleotide cores, the number of repeats should be ≥6. (2) for 2–9 bp cores, the number of repeats should be ≥3. (3) for other core lengths, the number of repeats should be ≥2. (4) TRs of the same core sequence should not overlap if they were in the same upstream flanking sequence.
We categorized the TRs based on the core lengths as follows: Category 1: 1–6 bp, Category 2: 7–9 bp, Category 3: 10–15 bp, and Category 4: ≥16 bp. This was an arbitrary classification to allow for possible differential effect of various core length ranges in evolutionary and biological terms.
Retrieval of data across species
Using the enhanced query (Additional Table 5) form on the Biomart Ensembl tool along with the RESTful API tools, a Java package was developed to retrieve, store, and analyze the data and information. The source codes and the Java package are available at: https://github.com/Yasilis/STRsMiner-JavaPackage_PaperSubmission/tree/develop .
Identification of human-specific TRs
The 120 bp upstream flanking sequence of TISs of all annotated protein-coding transcripts of protein-coding genes were screened in 84 species for the presence of TRs in four categories based on the TR core length. The data obtained on the human TRs was compared to those of other species, and the TRs which were specific to human were identified.
To identify human-specific TRs, in the first step, the selected genes of all species were grouped based on gene name. Therefore, all homologous genes, consisting of orthologous and paralogous genes, were placed in one group. In each group, all the TRs located in the upstream flanking sequence of every transcript were extracted. In the next step, the extracted TRs were grouped and specified according to the species. All the TRs that were detected in more than one species were removed. The remaining TRs belonged to only one species and were specific to that species. Subsequently, we identified the human-specific TRs for a specific gene name by selecting the human species. This process was repeated for each group of genes and the results were aggregated together to identify all the TRs which were specific and non-specific in reference to human.
Evaluation of TIS homology
Identifying the degree of homology between two transcripts requires assigning a weight value to each position of the sequence. Weighted homology scoring was performed in two different weight settings, as weighing vectors W1 (originally used by our group for studying a link between STRs and TIS selection) [33] and W2, which can be distinguished by k = {1, 2}. These two weighing vectors are defined as follow (Eq. 1, 2):
$${W}_1=\left\{0,25,25,25,12.5,12.5\Big\}\right.$$
(1)
$${W}_2=\left\{0,20,20,20,20,20\Big\}\right.$$
(2)
If M is the first methionine amino acid of the two peptide sequences (position of 0 in the two weighing vectors), for all next five successive positions represented by i in the formula (Eq. 9), we defined five weight coefficients wk, 1 to wk, 5, observed in the Wk vector.
Homology of the first five amino acids (excluding the initial methionine), and, therefore the TIS, was inferred based on the value of pair-wise similarity scoring between human, as reference, and other species. A similarity of ≥50% was considered “homology”. This threshold was achieved following BLASTing three thousand random pair-wise similarity checks of the initial five amino acids of randomly selected proteins as previously described [33].
Scoring human-specific and non-specific TR co-occurrences with homologous and non-homologous TISs
In both weighing methods, the initial five amino acid sequence (excluding the initial methionine) of the human TISs that were flanked by human-specific and non-specific TRs were BLASTed against all the initial five amino acids (excluding the initial methionine) of the orthologous/paralogous genes in the remaining 83 species. The above was aimed at comparing the number of events in which human-specific and non-specific TRs co-occurred with homologous and non-homologous (TISs) in reference to human. For computing the number of homologous and non-homologous TISs, we needed to consider a number of assumptions. We defined G as the set of all human protein coding genes. Therefore, g denoted a gene that belonged to the G set (Eq. 3).
$$G=\left\{g|g\ is\ a\ human\ protein\ coding\ gene\right\}$$
(3)
We also defined TH(g) and \({T}_{\overline{H}}(g)\) as the set of all annotated transcripts in a gene g, which belonged to human and other species, respectively (Eqs. 4 and 5).
$${T}_H(g)=\left\{t\ |\begin{array}{c}\ t\ \mathrm{was}\ \mathrm{a}\ \mathrm{human}\ \mathrm{protein}\ \mathrm{coding}\ \mathrm{transcript}\ \mathrm{which}\ \\ {}\mathrm{belonged}\ \mathrm{to}\ \mathrm{the}\ \mathrm{gene},g\end{array}\right\}$$
(4)
$${T}_{\overline{H}}(g)=\left\{t\ |\begin{array}{c}\ t\ \mathrm{was}\ \mathrm{a}\ \mathrm{protein}\ \mathrm{coding}\ \mathrm{transcript}\ \mathrm{which}\ \mathrm{belonged}\ \\ {}\ \mathrm{to}\ \mathrm{the}\ \mathrm{gene},g\ \mathrm{but},\mathrm{did}\ \mathrm{not}\ \mathrm{exist}\ \mathrm{in}\ \mathrm{human}\end{array}\right\}$$
(5)
Moreover, T∗ denoted all filtered transcripts of T which had at least one human-.
specific TR at the 120 bp interval upstream of the TIS, while, T+ denoted all filtered transcripts of T, which had at least one TR at the 120 bp interval upstream of the TIS.
The following formula was developed to measure the degree of similarity of two peptides in the two weighing settings (Eq. 6).
$${H}_k=\sum_{g\epsilon G\ }\sum_{t_a\epsilon {T}_H^{\ast }(g)\ }\sum_{t_b\epsilon {T}_{\overline{H}}^{+}(g)\ }{\Theta}_k\left({t}_a,{t}_b\right)$$
(6)
In this formula, Θ is a binary function that decides whether the transcripts are homologous or not, and k = {1, 2} refer to each weight setting. If S function measures the similarity score, Θ can be defined as follow (Eq. 7):
$${\Theta}_k\left({t}_a,{t}_b\right)=\left\{\ \begin{array}{c}1, if\ {S}_k\left({t}_a,{t}_b\right)\ge 50\\ {}0,o.w.\end{array}\right.$$
(7)
For calculating the similarity score, we used another binary function. We defined Φ as follows: (Eq. 8):
$$\Phi \left(x,y\right)=\left\{\begin{array}{c}1, if\ x=y\\ {}0,o.w.\end{array}\right.$$
(8)
This function takes two amino acids as argument and returns 1 as output if they are the same, and zero if they are not the same. Therefore, S(ta, tb) is defined by the following formula (Eq. 9):
$${S}_k\left({t}_a,{t}_b\right)=\sum_{i=2}^6{w}_{k,i}\Phi \left({P}_i\left({t}_a\right),{P}_i\left({t}_b\right)\ \right)$$
(9)
In this function, the ith amino acid in the sequence of the transcript t, is denoted by Pi(t).
We replicated the comparisons in 10-fold cross-validation. In each-fold, genes with human non-specific TRs were randomly selected according to the number of genes in the group with human-specific TRs. This process was repeated for the two methods (two different weight vectors) and for each of the four categories of TRs. For each category and weighing method, the mean of the result of each round was calculated as a final result. Finally, the Fisher’s exact test was run for each-fold (Additional Table 2).