General workflow to develop new reference databases
In this study, we present DB4Q2, a set of detailed procedures to develop reference databases directly usable in the QIIME2 bioinformatics platform. To our knowledge, it is the first time that a detailed protocol explains from start to finish how to use a NCBI sequence dataset to develop such a database. Interestingly, this procedure can be applied for any dataset imported from NCBI, given their data structure uniformity. This means that the methodology presented can be applied to develop reference databases for any domain of life and not only for plants. In addition, it has been shown that some inconsistencies may be encountered while working with reference sequences directly imported from public databases [40]. In such cases, curating a subset of these sequences to develop a custom database is necessary and the workflow presented here should be of great help.
Newly developed plant ITS2 and rbcL reference databases
After having collected and formatted all necessary sequence and taxonomic information for the sequence barcode of interest, several filtering steps are applied in order to curate the database. Two of them, dereplication and amplicon restriction, are optional and lead to major drops in the sequence count (Table 1). When carrying out dereplication, only strictly identical sequences were clustered together. Several tens of thousands of sequences were thus discarded but the amount of represented species remained the same. This reflects the ‘uniq’ mode used during dereplication, which allowed keeping identical sequences with different taxonomic labels. This step enabled discarding of redundant information and to propose more computationally efficient databases. The second optional step involved the restriction of reference sequences to the portion amplified by commonly used PCR primers. This had a strong impact on the count of sequences and represented species in databases. The utility and relevance of these two optional steps are discussed below.
After having applied all filtering steps, a little more than 60,000 species were represented in the rbcL reference databases, reflecting a marked increase compared to the 38,409 plant species reported by Bell et al. in 2017 [18] and to the 49,936 species in the database developed in 2020 by Richardson et al. [24]. For the ITS2 sequence barcode, the almost 70,000 species represented in the databases also illustrated an increase in species count compared to the 54,164 plant species reported by Richardson et al. [24]. It is, however, a little less than the 72,325 species reported by Sickel et al. [17], which reflects the effect of the different filters applied in DB4Q2 and not present in the workflow of Sickel and colleagues.
Addressing existing bottlenecks when developing a new reference database
As previously mentioned, some ITS2 and rbcL reference databases have already been published but, for some of them, without precise explanations detailing how reference datasets were generated. Such a ‘black box’ system should be avoided in order to have a clear visibility on each step of the workflow. That is the reason why DB4Q2 has been extensively detailed and commented, so that the user can understand which operation is carried out at each step and evaluate the relevance according to its study specifications. Furthermore, with the advent of HTS technologies, many laboratories are launching new research activities using DNA metabarcoding but sometimes without extensive bioinformatics knowledge. In this kind of situation, it is not rare to see the use of existing tools in a rather blind manner or the outsourcing of analyses, which lowers the control and understanding that the user has on every database-processing step. The detailed procedures presented in DB4Q2 should also help those teams avoid this kind of problems.
When evaluating how current bottlenecks are addressed with DB4Q2, it is interesting to compare it with RESCRIPt since they are both intended to generate QIIME2-formatted databases. RESCRIPt is a remarkable tool built by the QIIME2 developer team with many useful applications. However, we noticed that the command used to import directly from the NCBI a reference dataset and format it in an automated way into a functional database could not handle large datasets, probably due to NCBI download limitations. This issue was faced when trying to retrieve the rbcL reference dataset and should probably occur often when dealing with other plant chloroplastic barcodes (or with mitochondrial barcodes commonly used e.g. in animal metabarcoding). Indeed, a part of the entries downloaded from the NCBI is actually complete chloroplast/mitochondrion genomes, which significantly increases the volume of data. The DB4Q2 provides an answer to this bottleneck since it allowed downloading both ITS2 and rbcL datasets without any issue. In addition, our workflow also proposes an almost completely offline procedure to skip this downloading step and associated difficulties.
Another bottleneck the user may face when developing a reference database is the inaccuracies of taxonomic identifications in NCBI records [41,42,43]. This sequence mislabeling can of course hinder accurate taxonomic assignment of sequencing reads but also lead to perpetuation of errors when using these data [44]. Given that fungi are often co-occurring in surface or inside plant tissues, this issue is particularly true in plant metabarcoding studies. Indeed, there is an additional risk of amplifying fungi DNA instead of, or together with, that of the target plant species [45], especially when working with ITS primers, even if they are plant-specific [46]. Beside this problem related to fungi sequences, a reference sequence may simply have been assigned to a plant taxa instead of another one. To remove these entries, blasting all database sequences against themselves allowed discarding those for which the expected taxonomy at the family rank was observed only once in the five best matches. This strategy should allow filtering out many misidentified entries but probably not all. Indeed, the comparison of expected and predicted taxonomies could not be carried out at a lower taxonomic rank since the exact same sequence can be shared by several species and even sometimes several genera when a barcode marker does not display enough sequence divergence. Hence, if expected and predicted taxonomies were compared at the genus or species level, the risk would be to discard sequences for which the identification was actually correct. This is the reason why we chose the family rank as an appropriate trade-off between filtering out enough mislabeled sequences while avoiding as much as possible the removal of sequences correctly identified. To evaluate the impact of these filters, it is interesting to note that the first parts of the DB4Q2 and RESCRIPt workflows are almost identical but there is no filter to remove fungi sequences nor more generally mislabeled plant sequences in RESCRIPt. The increase in prediction accuracy observed between RESCRIPt and DB4Q2 databases (Fig. 5) thus provides a good illustration of the beneficial effect of these filtering steps.
Among previously published reference databases and pipelines, several strategies are observed like the use of trimmed reference sequences provided by the user to build an amplicon-restricted dataset [24], querying public repositories using user-defined primers [23], the sequence dereplication taking their taxonomy into account [24] or even their clustering at 99% identity [25]. Despite being very interesting, these strategies may not be relevant for every research context. For example, it has been shown that the amplicon restriction of reference sequences can have a positive impact on taxonomic predictions for some barcode sequences [47], whereas it is not recommended for others [48]. In consequence, DB4Q2 has been written with some optional sections so that the user can decide whether or not to include these critical steps in the workflow.
The importance of (not) dereplicating database
Dereplication is a sequence-processing step commonly used to build a curated reference database [24, 25, 39]. It often allows a significant reduction of the database size, thus increasing its computational efficiency. When analyzing metabarcoding data, some widely used taxonomic classifiers are based on a consensus strategy by considering the taxonomic labels of e.g. the five or ten best matches from the database to assess the taxonomy of sequencing reads. Considering that, the dereplication step presents the additional advantage to give more weight to under-represented taxa in the database. On the counterpart, more frequent taxa are thus disadvantaged in such an approach by setting them on equal footing with under-represented ones, which is probably not the best strategy when working in deeply studied areas.
The most relevant dereplication approaches take taxonomic labels into account to discard identical sequences. In this work, the influence of this step was tested according to two dereplication settings. The first one is the ‘uniq’ mode, where two identical sequences with different taxonomies are both kept and their taxonomic labels remain unchanged. In the second mode (‘majority’), when identical sequences have different taxonomies, only one is retained together with the most common taxonomic label associated with these sequences. In k-fold CV tests, sequence dereplication did not have a significant impact for the rbcL barcode while it tended to decrease the prediction accuracy for ITS2 (Fig. 4). Conversely, in leaked CV tests, it did not have a major effect for the ITS2 databases while rbcL accuracy values seemed to be positively affected, especially in majority mode. This other example illustrates the effect that some parameter choices can have on metabarcoding analysis outcome and thus supports the flexibility of DB4Q2 with its optional sections, including at the dereplication step.
It must however be noted that the dereplication in ‘majority’ mode has been tested but is not advised nor proposed in the DB4Q2 workflow, at least for rbcL, as it can lead to a higher proportion of mislabeled sequences after dereplication. Despite the fact that relabeling of identical sequences with the most frequent taxonomic lineage can be seen as a convenient way to correct identification mistakes, it must be avoided when working with barcodes with insufficient sequence divergence (like rbcL). Indeed, it is not rare to face several species that have the exact same rbcL sequence and relabeling all them with the most frequent taxonomy would erroneously increase computed prediction accuracies (as observed for rbcL in Fig. 4) while not being representative of the taxonomic diversity anymore.
Comparison with other published databases
The ITS2 and rbcL reference databases developed in this work were compared to the published ones presented above. These comparisons allowed investigation of the sequence and taxonomic information held in each database, as well as evaluating the accuracy of their taxonomic assignments (Figs. 2, 3 and 5).
For both barcode sequences, the DB4Q2 databases showed the highest unique sequence count compared to other databases (Fig. 2A and C). This can be explained by their recentness compared to others. In addition, they did not undergo an amplicon extraction – which unavoidably provokes a sequence loss – while the databases developed in Curd et al. [23] and Richardson et al. [24] did. The only exception is the ITS2 and rbcL datasets built with BCdatabaser [26], which displayed a surprisingly high number of sequences, particularly for the ITS2 barcode. A deeper analysis showed that a part of the sequences in the database did not cover the ITS2 region (e.g. more than 27,000 sequences displayed the string “external transcribed spacer” in their definition line). This means that the query string inserted in the pipeline probably matched with more than only ITS2 sequences. A similar observation was made with the rbcL database for which almost 13,000 sequences did not exhibit the keywords “rbcL” or “ribulose” in their definition line (but they did in the article title section of their Genbank record for example, which could explain the confusion). These observations explain the higher peaks observed for BCdatabaser datasets in Fig. 2A and C. The databases developed by Banchi et al. for ITS2 [25] and Bell et al. in 2021 for rbcL [22] are dereplicated and thus showed identical counts for total and unique sequences.
The comparison of sequence length distribution showed that ITS2 sequences were on average shorter than rbcL ones (Fig. 2B and D). This is consistent with the fact that the ITS2 fragment is in the 200–250 bp range [49, 50] while the rbcL gene is about 1400 bp long [51]. This comparison also revealed a close relationship between the strategy used to generate these databases and their length distribution profile. For ITS2, Sickel et al. [17] and Banchi et al. [25] used hidden Markov models to extract only the ITS2 portion from downloaded sequences, which explains the shorter average length for these datasets. The databases developed by Richardson et al. [24] and Curd et al. [23] exhibited sequence length distribution centered around 300–400 bp and thus reflected the sequence amplicon extraction carried out in both workflows. The last group of ITS2 reference libraries included the ones developed by Keller et al. [26], Bell et al. [22] and those built using RESCRIPt [39] and DB4Q2. No sequence extraction step was performed in any of these studies, which explains why their length distribution profiles are more spread out. The peaks observed for these databases around 700 bp reflect mostly cases where the amplicon spanned the ITS1–5.8S-ITS2 region. On the rbcL side, besides the individual peak observed for the amplicon-restricted database from MetaCurator, the peaks visible around 600 bp and 1400 bp correspond respectively to the typical length of barcode markers used in Sanger sequencing on the one hand, and to the complete sequence of the rbcL-coding gene on the other hand.
Interestingly, when investigating sequence entropy, the databases developed using DB4Q2 compared well to published databases for both barcodes, despite having discarded several thousand sequences that did not meet quality requirements (Fig. 3A and C). This high sequence entropy can be attributed to several factors like the database recentness, the absence of an amplicon extraction step and the taxonomic coverage of downloaded sequences: most databases studied here cover the whole kingdom of plants whereas Bell et al. developed rbcL and ITS2 databases dedicated to the Spermatophyta clade (seed plants) and the Magnoliopsida class (flowering plants), respectively. The higher entropic values observed for BCdatabaser reference libraries must be analyzed with caution given that a fraction of their records are actually not ITS2 nor rbcL sequences, as previously mentioned.
To evaluate the amount of information at each taxonomic rank in each database, the taxonomic entropy was measured (Fig. 3B and D). The greater variability observed at the class level reflects two distinct phenomena. On the one hand, the databases from Bell et al. with restrained taxonomic breadth (see above) explain the lower class-level taxonomic entropies observed for these databases. On the other hand, it was noticed that several databases included in this comparison did not display any class label in their taxonomic lineages, and this problem occurred mostly for the Manoliopsida class. Instead, the labels showed annotations related to lower taxonomic ranks like ‘c__urs_o__Brassicales’ or ‘c__sub__asterids’. This significantly increased the amount of information in class-level labels, which led to an overestimation of the taxonomic entropy at this rank. The trends were much more similar between databases for the other taxonomic levels, BCdatabaser slightly differing from the other workflows at the species rank for the reasons mentioned earlier.
Finally, the classification accuracy of ITS2 and rbcL reference databases was evaluated (Fig. 5). In general, the rbcL databases showed lower accuracy scores compared to ITS2 datasets. This reflects the higher degree of conservation of the rbcL barcode compared to ITS2, as rbcL may not display enough sequence divergence across taxa, even sometimes at the genus level.
Classifying all sequences against themselves (leaked CV) revealed that the reference datasets generated in this work were among the ones with the highest scores at the species rank. Notably, the rbcL reference library developed by Bell et al. in 2021 [22] exhibited a significantly better classification accuracy. To build this database, the authors appended new reference sequences from Australian plant species to the one built in 2017. Despite the addition of new sequences, this updated database displayed fewer sequences (with identical counts of total and unique sequences) and an identical taxonomic entropy compared to the one built in 2017. Given that the authors mentioned further formatting work, the most probable hypothesis is that a sequence dereplication in majority mode has been carried out, which is known to increase prediction accuracy in leaked CV (Fig. 4). In addition, the lower amount of sequences in this dataset also favors better accuracy scores. Indeed, the fewer sequences are included in the comparison, the lower is the risk that the classification accuracy is confounded by other similar hits.
When assessing taxonomic accuracy in a k-fold CV scheme, the dereplicated databases should theoretically be disadvantaged. Indeed, in this case, it is less likely that a second sequence belonging to a particular taxon is present in the training dataset if a first one has already been extracted to be included in the test set. The comparison of ITS2 reference datasets could indeed highlight this phenomenon, the dereplicated databases exhibiting F-measures slightly (MetaCurator) or strongly (PLANiTS2) decreased compared to the best accuracy levels measured (Fig. 5A).
Interestingly, the comparison of results obtained in best possible (leaked CV) vs pseudo-realistic conditions (k-fold CV) highlighted very contrasting performances for databases developed by Banchi et al. (ITS2) and Richardson et al. (rbcL). This indicates that, according to the kind of samples analyzed and their expected level of challenge (i.e. composed of taxa either with or without existing reference sequences), these databases will probably show very different classification accuracies. In contrast, other reference datasets including the ITS2 and rbcL databases developed using DB4Q2 displayed consistent accuracy levels in both situations, making them versatile reference datasets to be used in DNA metabarcording analyses.