Skip to main content

Revised eutherian gene collections

Abstract

Objectives

The most recent research projects in scientific field of eutherian comparative genomics included intentions to sequence every extant eutherian species genome in foreseeable future, so that future revisions and updates of eutherian gene data sets were expected.

Data description

Using 35 public eutherian reference genomic sequence assemblies and free available software, the eutherian comparative genomic analysis protocol RRID:SCR_014401 was published as guidance against potential genomic sequence errors. The protocol curated 14 eutherian third-party data gene data sets, including, in aggregate, 2615 complete coding sequences that were deposited in European Nucleotide Archive. The published eutherian gene collections were used in revisions and updates of eutherian gene data set classifications and nomenclatures that included gene annotations, phylogenetic analyses and protein molecular evolution analyses.

Objective

The most recent research projects in scientific field of eutherian comparative genomics included intentions to sequence every extant eutherian species genome in foreseeable future, so that future revisions and updates of eutherian gene data sets were expected [1,2,3,4,5,6,7,8,9,10,11,12,13]. For example, the human protein coding gene census remained unfinished: contemporary estimates included about 20,000–21,000 protein coding genes in human genome [14,15,16,17,18,19,20,21,22,23,24,25,26,27]. In addition, the proven utility of public eutherian reference genomic sequences could become compromised by potential genomic sequence errors, including analytical and bioinformatical errors, as well as Sanger DNA sequencing method errors [28,29,30,31,32,33].

Data description

Using public eutherian reference genomic sequence assemblies and free available software, the eutherian comparative genomic analysis protocol was published as guidance against potential genomic sequence errors [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49]. The protocol included 3 major processing steps that were integrated into one framework of eutherian gene data set descriptions: gene annotations, phylogenetic analysis and protein molecular evolution analysis. The protocol published 3 original genomics and protein molecular evolution tests. First, the test of reliability of public eutherian genomic sequences used genomic sequence redundancies of public eutherian reference genomic sequence assemblies. Second, the test of contiguity of public eutherian genomic sequences used multiple pairwise genomic sequence alignments. Third, the test of protein molecular evolution used relative synonymous codon usage statistics. The protocol was made available on Protocol Exchange [44].

In aggregate, the eutherian comparative genomic analysis protocol curated 14 eutherian gene data sets implicated in major physiological and pathological processes, including 2615 published complete coding sequences that were made available in public biological databases as third-party data gene data sets [50,51,52,53,54,55,56,57,58,59,60,61,62,63] (Table 1). The curated gene data sets were deposited in European Nucleotide Archive [7,8,9, 12, 13] in FASTA nucleotide sequence format. The published eutherian gene collections were used in revisions and updates of eutherian gene data set classifications and nomenclatures.

Table 1 Overview of eutherian third-party data gene data sets

Limitations

The revisions and updates of eutherian gene data sets were contingent on primary Sanger DNA sequencing information deposited in National Center for Biotechnology Information NCBI Trace Archive [12, 13, 46, 64,65,66]. For example, the positive correlation was calculated between genomic sequence redundancies of 35 public eutherian reference genomic sequence assemblies respectively and curated complete coding sequence numbers.

Availability of data and materials

The data described in present Data note could be freely and openly accessed in European Nucleotide Archive under accessions: FR734011-FR734074, HF564658-HF564785, HF564786-HF564815, HG328835-HG329089, HG426065-HG426183, HG931734-HG931849, LM644135-LM644234, LN874312-LN874522, LT548096-LT548244, LT631550-LT631670, LT962964-LT963174, LT990249-LT990597, LR130242-LR130508 and LR760818-LR761312. Please, see Table 1 and references [50,51,52,53,54,55,56,57,58,59,60,61,62,63] for details and URLs.

References

  1. Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009;100:659–74.

    Article  PubMed Central  CAS  Google Scholar 

  2. Koepfli KP, Paten B, Genome 10K Community of Scientists, O'Brien SJ. The genome 10K project: a way forward. Annu Rev Anim Biosci. 2015;3:57–111.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Lewin HA, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci U S A. 2018;115:4325–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gibbs RA. The human genome project changed everything. Nat Rev Genet. 2020;21:575–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Green ED, et al. Strategic vision for improving human health at the forefront of genomics. Nature. 2020;586:683–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Zoonomia Consortium. A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587:240–5.

    Article  CAS  Google Scholar 

  7. Arita M, Karsch-Mizrachi I, Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2021;49:D121–4.

    Article  CAS  PubMed  Google Scholar 

  8. Cantelli G, et al. The European bioinformatics institute: empowering cooperation in response to a global health crisis. Nucleic Acids Res. 2021;49:D29–37.

    Article  CAS  PubMed  Google Scholar 

  9. Harrison PW, et al. The European nucleotide archive in 2020. Nucleic Acids Res. 2021;49:D82–5.

    Article  CAS  PubMed  Google Scholar 

  10. Howe KL, et al. Ensembl 2021. Nucleic Acids Res. 2021;49:D884–91.

    Article  CAS  PubMed  Google Scholar 

  11. Murphy WJ, Foley NM, Bredemeyer KR, Gatesy J, Springer MS. Phylogenomics and the genetic architecture of the placental mammal radiation. Annu Rev Anim Biosci. 2021;9:29–53.

    Article  CAS  PubMed  Google Scholar 

  12. Sayers EW, et al. Database resources of the National Center for biotechnology information. Nucleic Acids Res. 2021;49:D10–7.

    Article  CAS  PubMed  Google Scholar 

  13. Sayers EW, et al. GenBank. Nucleic Acids Res. 2021;49:D92–6.

    Article  CAS  PubMed  Google Scholar 

  14. Clamp M, et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A. 2007;104:19428–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Temple G, et al. The completion of the mammalian gene collection (MGC). Genome Res. 2009;19:2324–33.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  16. Pertea M, et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 2018;19:208.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Pujar S, et al. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res. 2018;46:D221–8.

    Article  CAS  PubMed  Google Scholar 

  18. Salzberg SL. Open questions: how many genes do we have? BMC Biol. 2018;16:94.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. Mudge JM, et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 2019;29:2073–87.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zerbino DR, Frankish A, Flicek P. Progress, challenges, and surprises in annotating the human genome. Annu Rev Genomics Hum Genet. 2020;21:55–79.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Zhang D, et al. Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders. Sci Adv. 2020;6:eaay8299.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Blake JA, et al. Mouse genome database (MGD): knowledgebase for mouse-human comparative biology. Nucleic Acids Res. 2021;49:D981–7.

    Article  CAS  PubMed  Google Scholar 

  23. Blum M, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021;49:D344–54.

    Article  CAS  PubMed  Google Scholar 

  24. Frankish A, et al. GENCODE 2021. Nucleic Acids Res. 2021;49:D916–23.

    Article  CAS  PubMed  Google Scholar 

  25. Gene Ontology Consortium. The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–34.

    Article  CAS  Google Scholar 

  26. Tweedie S, et al. Genenames.org: the HGNC and VGNC resources in 2021. Nucleic Acids Res. 2021;49:D939–46.

    Article  CAS  PubMed  Google Scholar 

  27. UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–9.

    Article  CAS  Google Scholar 

  28. Hubisz MJ, Lin MF, Kellis M, Siepel A. Error and error mitigation in low-coverage genome assemblies. PLoS One. 2011;6:e17034.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD. Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics. 2012;13:5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Norgren RB Jr. Improving genome assemblies and annotations for nonhuman primates. ILAR J. 2013;54:144–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Denton JF, et al. Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol. 2014;10:e1003998.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Nagy A, Patthy L. FixPred: a resource for correction of erroneous protein sequences. Database (Oxford). 2014;2014:bau032.

    Article  CAS  Google Scholar 

  33. Meyer C, et al. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics. 2020;21:513.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Premzl M. Comparative genomic analysis of eutherian interferon-γ-inducible GTPases. Funct Integr Genomics. 2012;12:599–607.

    Article  CAS  PubMed  Google Scholar 

  35. Premzl M. Comparative genomic analysis of eutherian ribonuclease a genes. Mol Gen Genomics. 2014;289:161–7.

    Article  CAS  Google Scholar 

  36. Premzl M. Comparative genomic analysis of eutherian mas-related G protein-coupled receptor genes. Gene. 2014;540:16–9.

    Article  CAS  PubMed  Google Scholar 

  37. Premzl M. Third party annotation gene data set of eutherian lysozyme genes. Genom Data. 2014;2:258–60.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Premzl M. Initial description of primate-specific cystine-knot Prometheus genes and differential gene expansions of D-dopachrome tautomerase genes. Meta Gene. 2015;4:118–28.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Premzl M. Third party data gene data set of eutherian growth hormone genes. Genom Data. 2015;6:166–9.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Premzl M. Curated eutherian third party data gene data sets. Data Brief. 2016;6:208–13.

    Article  PubMed  Google Scholar 

  41. Premzl M. Comparative genomic analysis of eutherian tumor necrosis factor ligand genes. Immunogenetics. 2016;68:125–32.

    Article  CAS  PubMed  Google Scholar 

  42. Premzl M. Comparative genomic analysis of eutherian globin genes. Gene Rep. 2016;5:163–6.

    Article  Google Scholar 

  43. Premzl M. Comparative genomic analysis of eutherian kallikrein genes. Mol Genet Metab Rep. 2017;10:96–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Premzl M. Eutherian comparative genomic analysis protocol. Protoc Exch. 2018. https://doi.org/10.1038/protex.2018.028.

  45. Premzl M. Comparative genomic analysis of eutherian adiponectin genes. Heliyon. 2018;4:e00647.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Premzl M. Eutherian third-party data gene collections. Gene Rep. 2019;16:100414.

    Article  Google Scholar 

  47. Premzl M. Comparative genomic analysis of eutherian connexin genes. Sci Rep. 2019;9:16938.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  48. Premzl M. Comparative genomic analysis of eutherian fibroblast growth factor genes. BMC Genomics. 2020;21:542.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Premzl M. Comparative genomic analysis of eutherian interferon genes. Genomics. 2020;112:4749–59.

    Article  CAS  PubMed  Google Scholar 

  50. Premzl M. Accession numbers: FR734011-FR734074. Europ Nucleotide Arch. 2012; https://identifiers.org/ena.embl:FR734011.

  51. Premzl M. Accession numbers: HF564658-HF564785. Europ Nucleotide Arch. 2015; https://identifiers.org/ena.embl:HF564658.

  52. Premzl M. Accession numbers: HF564786-HF564815. Europ Nucleotide Arch. 2015; https://identifiers.org/ena.embl:HF564786.

  53. Premzl M. Accession numbers: HG328835-HG329089. Europ Nucleotide Arch. 2014; https://identifiers.org/ena.embl:HG328835.

  54. Premzl M. Accession numbers: HG426065-HG426183. Europ Nucleotide Arch. 2014; https://identifiers.org/ena.embl:HG426065.

  55. Premzl M. Accession numbers: HG931734-HG931849. Europ Nucleotide Arch. 2014; https://identifiers.org/ena.embl:HG931734.

  56. Premzl M. Accession numbers: LM644135-LM644234. Europ Nucleotide Arch. 2015; https://identifiers.org/ena.embl:LM644135.

  57. Premzl M. Accession numbers: LN874312-LN874522. Europ Nucleotide Arch. 2016; https://identifiers.org/ena.embl:LN874312.

  58. Premzl M. Accession numbers: LT548096-LT548244. Europ Nucleotide Arch. 2016; https://identifiers.org/ena.embl:LT548096.

  59. Premzl M. Accession numbers: LT631550-LT631670. Europ Nucleotide Arch. 2017; https://identifiers.org/ena.embl:LT631550.

  60. Premzl M. Accession numbers: LT962964-LT963174. Europ Nucleotide Arch. 2018; https://identifiers.org/ena.embl:LT962964.

  61. Premzl M. Accession numbers: LT990249-LT990597. Europ Nucleotide Arch. 2019; https://identifiers.org/ena.embl:LT990249.

  62. Premzl M. Accession numbers: LR130242-LR130508. Europ Nucleotide Arch. 2020; https://identifiers.org/ena.embl:LR130242.

  63. Premzl M. Accession numbers: LR760818-LR761312. Europ Nucleotide Arch. 2020; https://identifiers.org/ena.embl:LR760818.

  64. Blakesley RW, et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 2004;14:2235–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Margulies EH, et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci U S A. 2005;102:4795–800.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Lindblad-Toh K, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

MP would like to thank manuscript reviewers on their manuscript reviews.

MP would like to express his gratitude to data analysts, producers and providers of public eutherian reference genomic sequence data sets and free available software.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

MP conceived and prepared manuscript. The author read and approved final manuscript.

Corresponding author

Correspondence to Marko Premzl.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

No competing interests were declared.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Premzl, M. Revised eutherian gene collections. BMC Genom Data 23, 56 (2022). https://doi.org/10.1186/s12863-022-01071-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12863-022-01071-9

Keywords

  • Gene data set
  • Comparative genomics
  • Eutheria
  • RRID:SCR_014401