Building a reference transcriptome for Juniperus squamata (Cupressaceae) based on single-molecule real-time sequencing

Objectives Cupressaceae is the second largest family of coniferous trees (Coniferopsida) with important economic and ecological values. However, like other conifers, the members of Cupressaceae have extremely large genome (> 8 gigabytes), which limited the researches of these taxa. A high-quality transcriptome is an important resource for gene discovery and annotation for non-model organisms. Data description Juniperus squamata, a tetraploid species which is widely distributed in Asian mountains, represents the largest genus, Juniperus, in Cupressaceae. Single-molecule real-time sequencing was used to obtain full-length transcriptome of Juniperus squamata. The full-length transcriptome was corrected with Illumina RNA-seq data from the same individual. A total of 47,860 non-redundant full-length transcripts, N50 of which was 2839, were obtained. A total of 57,393 simple sequence repeats were identified and 268,854 open reading frames were predicted for Juniperus squamata. A BLAST alignment against non-redundant protein database was conducted and 10,818 sequences were annotated in Gene Ontology database. InterPro analysis shows that 30,403 sequences have been functionally characterized against its member database. This data presents the first comprehensive transcriptome characterization of Juniperus species, and provides an important reference for researches on the genomics and evolutionary history of Cupressaceae plants and conifers in the future.

of this work is to generate full-length transcriptome sequences for Juniperus squamata. Considering the importance of simple sequence repeats (SSRs) to plant population genetic analysis, we also developed SSRs for this species [8,9]. To functionally characterize the fulllength transcriptome, open reading frame (ORF) prediction and Gene Ontology (GO) annotation analysis were performed [10]. To functionally analyze the protein, the final isoforms were searched against InterPro's predictive models [11]. The full-length transcriptome data set of Juniperus squamata can provide an important reference for its downstream analysis, such as genomic basis of environmental adaptation and genome evolution of Cupressaceae and even conifers.

Data description
Fresh leaves, stems, and strobiles of one Juniperus squamata individual were collected from Kangding, Sichuan Province, China. For each tissue, the short paired reads were sequenced by Illumina platform. We also mixed the samples of each tissue and generated the long reads by the PacBio Sequel platform. Total RNA of the samples was isolated using the Plant RNA kit (Omega bio-Tech., USA) and then treated with RNase-free DNase I (NEB) to remove DNA. RNA degradation and contamination were monitored on 1% agarose gels and RNA purity was checked using the NanoPhotometer ® spectrophotometer (IMPLEN, CA, USA). RNA concentration was measured using Qubit ® RNA Assay Kit in Qubit ® 2.0 Fluorometer (Life Technologies, CA, USA). RNA integrity was assessed using the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). The Single-molecule real-time (SMRT) bell library was constructed with the Pacific Biosciences DNA Template Prep Kit 2.0 and SMRT sequencing was then performed on the Pacific Bioscience Sequel System. The sample used for Illumina sequencing was harvested using the same methods. The library was constructed using Illumina HiSeq X Ten. Adapter clipping and quality filtering of the Illumina raw reads was done using Trimmomatic version 0.36 [12]. Based on the quality check, the last two base pairs from each read were removed to minimize the overall sequencing error.
To obtain high quality corrected consensus sequence, additional nucleotide errors in polished consensus sequence were corrected using the Illumina RNA-seq data obtained from the same individual with the software LoRDEC version 0.7 [13] (parameters: -k 23 -s 3). Any redundancy in corrected consensus sequence was removed by CD-HIT version 4.6.1 [14] (parameters: -c 0.95 -T 6 -G 0 -aL 0.00 -aS 0.99 -AS 30) to obtain final a set of unique transcript isoforms. Benchmarking universal single-copy orthologs (BUSCO) version 3 was used to assess the quality of final transcript isoforms [15]. The summary statistics and length distributions of the PacBio SMART sequencing are shown in Data file 1 (Table S1 and Fig. S1). The results of BUSCO are shown in Data file 1 (Table S2). All three data sets obtained and their NCBI GenBank Accession numbers are listed in Table 1 (Data set 1, Data set 2, and Data set 3).
DIAMOND version 2.0.9.147 was used to align the final unique transcript isoforms against non-redundant protein database with a significance threshold of E ≤ 10 − 5 [17]. A custom python (https:// www. python. org/) script was used to carry out GO annotation (available at https:// github. com/ shanz ha09/ GO-annot ation. git). InterProScan version 5.52-86.0 was used to search the final isoforms against interPro database [18]. The results of BLASTX alignment, GO annotation, and interPro analysis are shown in Data file 4, Data file 5, and Data file 6, respectively.

Limitations
There is a shortcoming that we only collected one sample for single-molecule real-time sequencing of transcriptome.   [27]