In this study, we aimed to identify the tandem repeats present inside the protein coding region from mouse genome, and to suggest potential functional features of PTR alleles. We findings suggested that (i) mouse proteins contain tandem repeats, (ii) PTR alleles can also be present inside the evolutionary conserved domains, (iii) protein folding properties can diverge from their wild-type state upon the presence of PTR alleles, and (iv) disease associated genes could also retain PTR alleles. Together, the novel mouse PTR datasets generated in this study suggested that these repeats could potentially impact protein functions by modulating protein stability and folding.
We previously have shown that the SNPs, indels and SVs can play a major role in mouse phenotypic variations [15, 24]. However, these and other studies focused on finding the association of genetic variations to mouse phenotypes lack power to fully explain phenotypic variations. This limitation could be diminished by analysing additional types of genetic variations such as PTRs. Here, we documented PTR alleles in 562 proteins from 71 mouse genomes, and their potential to contribute towards protein folding. Previous studies have established that the presence of even one additional amino acid can impact the function and stability of the protein [25]. Our results indicate that a large variation due to PTR alleles is present in the mouse proteins which could alter wildtype protein folding. We also observed, a set of 165 proteins that contain PTR alleles, but no SNP or indel alleles. This set included several crucial proteins such as homeobox factors, for example Hoxa11, Hoxb3 and Hoxd13. This observation shows that a large group of repeat alleles were unnoticed previously and could contribute to deviating predictability of phenotypic variations.
Additionally, we have shown several crucial features of PTR alleles (as mentioned above). Recently reported homo, small and micro-repeats that are located at both N- and C-terminal [26], we also observed here, the mouse PTRs were present in almost the same numbers at both terminals. Previous findings suggested that the most frequent PTR containing protein domains in eukaryotes include WD40, zf-C2H2, LRR_8 and RRM [26]. Our results suggested the RRM domain is the most frequent domain-type from our studied strains (Fig S1). The RRM domains are typically 90 amino-acid long and considered as the multifunctional regulators of development, cell differentiation, signalling, and gene expression [4]. In addition, PTRs present within homeobox domains were also identified. Homeobox domains regulate gene expression during the cell differentiation at early embryogenesis stages. Unsurprisingly, genetic anomalies in these regions cause developmental defects with severe consequences such as loss or deformation of body segments [27].
Perhaps the most interesting PTR feature is the detection of these alleles from disease associated proteins. Previous understanding about these disease related proteins was based on variations that are not PTR. This observation shows that a disease associated protein might not carry disease causing SNP/Indel/SV, but PTR allele(s). For instance, the rare extension PTR alleles present within the Gigyf2 and Hectd4 proteins, could have been left undetected if SNP or indel variations were the focus of a study to explain phenotypic variation. The inclusion of PTR alleles alongside with other type of alternative alleles can aid in providing a comprehensive map of mouse genomic variations. Future studies should take advantage of such datasets to perform more effective mouse genotype to phenotype association analysis. Together, the datasets produced in this study potentially facilitate depth of analyses to future studies identifying more broadly the phenotype regulatory factors.
The availability of highly accurate protein models from novel algorithms like AlphaFold made it feasible to analyse and produce reliable results. Moreover, new sequencing technologies such as long-read sequencing can further enhance analyses of genomic variations. As we relayed of short-read data which traditionally suffer limitation in identification of variations when length of an allele in under consideration. In this regard, our study might have limitations. Nevertheless, we are hoping that future studies will contribute to the identification of additional PTR alleles with the use of the above-mentioned technologies and add depth to the remaining missing links between phenotype and genotype.
In conclusion, we have shown that the PTR alleles from mouse genomes have several functional features, and that a better understanding of these alleles could help improve the apprehension of outcomes from mouse phenotype-based experiments. We showed that (i) the PTR alleles are present within functional protein regions and domains, (ii) they potentially can impact protein folding, (iii) and that disease associated genes also carry PTR alleles. With this study, we contribute to further establishing the importance of protein repeat regions in the mouse genome and to stressing the need to include repeat alleles in future studies.