A de novo assembly of the sweet cherry (Prunus avium cv. Tieton) genome using linked-read sequencing technology

The sweet cherry (Prunus avium) is one of the most economically important fruit species in the world. However, there is a limited amount of genetic information available for this species, which hinders breeding efforts at a molecular level. We were able to describe a high-quality reference genome assembly and annotation of the diploid sweet cherry (2n = 2x = 16) cv. Tieton using linked-read sequencing technology. We generated over 750 million clean reads, representing 112.63 GB of raw sequencing data. The Supernova assembler produced a more highly-ordered and continuous genome sequence than the current P. avium draft genome, with a contig N50 of 63.65 KB and a scaffold N50 of 2.48 MB. The final scaffold assembly was 280.33 MB in length, representing 82.12% of the estimated Tieton genome. Eight chromosome-scale pseudomolecules were constructed, completing a 214 MB sequence of the final scaffold assembly. De novo, homology-based, and RNA-seq methods were used together to predict 30,975 protein-coding loci. 98.39% of core eukaryotic genes and 97.43% of single copy orthologues were identified in the embryo plant, indicating the completeness of the assembly. Linked-read sequencing technology was effective in constructing a high-quality reference genome of the sweet cherry, which will benefit the molecular breeding and cultivar identification in this species.


INTRODUCTION
The sweet cherry (Prunus avium), originated in Asia Minor near the Black Sea and the Caspian Sea. It is known as one of the most economically significant fruit species in the world (Quero-García et al., 2017) and its production in China has increased dramatically over the last three decades with the expansion of acreage dedicated to its cultivation. 150 nt-Chromium-linked paired-end reads on an Illumina HiSeq X Ten sequencer (Illumina, http://www.illumina.com/). We filtered out raw reads with >5% undetermined bases (Ns), >30% nucleotides quality score lower than 20, and the adapter sequence overlap >5 bp.

De novo assembly and evaluation
We estimated the size of the sweet cherry genome based on the k-mer frequency of the sequence data using the k-mer counting program Jellyfish (v.2.0.8) (Marcais & Kingsford, 2011) and GenomeScope (v1.0.0) (Vurture et al., 2017). The genome was assembled and scaffolded using the Supernova assembler (v2.0, https://www.10xgenomics.com/). This program links sequencing reads to the originating HMW DNA molecule using barcoded information and constructs phased, whole-genome de novo assemblies form the Chromium-prepared library (Weisenfeld et al., 2017). Chromium-linked reads of different sizes (40×, 50×, 60×, 65×, 68×, 70× and 75×) were used as input data. The assembly, using 70× coverage of the reads, was selected for analysis based on superior quality, and higher contig N50 and scaffold N50. Default parameters were set and two pseudohap assemblies were generated; pseudohap1 was used for further analysis. A total of 150 million reads were sampled and aligned to the assembled genome sequence; the quality of the sweet cherry cv. Tieton genome assembly was evaluated using the Burrows-Wheller Alignment tool (BWA, 0.7.17-r1188) (Li & Durbin, 2009). Core Eukaryotic Genes Mapping Approach (CEGMA, v2.5) (Parra, Bradnam & Korf, 2007) and Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0, embryophyta_odb10) (Simao et al., 2015) were used to assess the completeness of the assembly.

Chromosome-scale pseudomolecule construction
Scaffolds were assembled using the Supernova assembler and were ordered and oriented using seven previously published sweet cherry genetic maps for the construction of the chromosome-scale pseudomolecules. Five of the seven maps were built by Shirasawa et al. (2017), Peace et al. (2012), Klagges et al. (2013), Calle et al. (2018) and Guajardo et al. (2015. The initials of the first author were used to name their respective maps and the maps are referred to as KS, CP, CK, AC and VG. The other two maps, named JWF (the framework map of the WxL map) and JWF1 (the second round of the WxL map), were both reported by Wang et al. (2015). Genetic markers and/or flanking sequences for these maps were aligned to the current scaffolds using GMAP (v2018-07-04) (Wu & Watanabe, 2005) as described by Hulse-Kemp et al. (2018). Markers were manually filtered out if they were aligned to more than one scaffold or the same scaffold in different linkage groups. The alignment results of GMAP were fitted into ALLMAPS (v0.8.4) (Tang et al., 2015) to generate the final consensus map and chromosome-scale pseudomolecules. Different weight parameters were tried for the seven linkage maps and the optimal weight settings with the largest number of anchored and oriented scaffolds were: KS = 2, CP = 3, CK = 1, AC = 1, VG = 1, JWF = 1 and JWF1 = 1.
Comparison between sweet cherry cv. Tieton genome and cv. Satonishiki genome D-GENIES (v1.2.0) was used to compare the sweet cherry cv. Tieton genome with the cv. Satonishiki genome (Cabanettes & Klopp, 2018;Shirasawa et al., 2017). The whole sequence synteny analysis of the two assemblies were compared in both scaffold level and pseudochromosome level.
To compare the gene content between the two genome assemblies, we used three annotation versions that are the sweet cherry cv. Tieton genome annotation, the cv. Satonishiki genome annotation (Shirasawa et al., 2017), and an improved and re-annotated assembly of cv. Satonishiki genome released by NCBI Eukaryotic Genome Annotation Pipeline (NCBI Prunus avium Annotation Release 100, https://www.ncbi.nlm. nih.gov/genome/annotation_euk/Prunus_avium/100/). OrthoFinder was used to compare the gene content among the three annotations (Emms & Kelly, 2015).

Sequencing summary
For sweet cherry cv. Tieton, a total of 121.61 GB of raw sequencing data was generated with more than 810 million Chromium-linked paired-end reads. Table 1 shows the statistics of the sequencing for the linked-read library. The low quality reads were filtered out and 750,890,534 clean reads were used for de novo assembly. The average Q20 was 97.52% and GC content was 40.8%. A cDNA library was constructed and sequenced to improve the precision of the genome annotation. As shown in Table S2, over 78 million 150-nt length paired-end reads were generated and assembled.

Determination of genome size and heterozygosity
The genome size of sweet cherry cv. Tieton was estimated to be 341.38 MB based on 37-nt k-mer, which is very close to the genome size of 338 MB estimated by flow cytometry (Arumuganathan & Earle, 1991). The k-mer distribution generated by GenomeScope was shown in Fig. S1. The sweet cherry cv. Satonishiki genome estimated by k-mer method was 352.9 MB (Shirasawa et al., 2017), larger than cv. Tieton genome. The genome size difference is probably due to the variety difference, but also may be caused by different library construction and sequencing methods. Heterozygosity of sweet cherry cv. Tieton genome was estimated to be 0.45%, and the repeat content was estimated to be 48.5% as shown in Fig. S1.

Genome assembly and quality-assessment
The Supernova assembler (version 2.0) was used in de novo assembly and different sizes (40×, 50×, 60×, 65×, 68×, 70×, and 75×) of the Chromium-linked reads were attempted (Weisenfeld et al., 2017). Table S3 listed these assembly results, illustrating that the assembly using 70× coverage reads has the best assembly quality, and was selected for following analyses. GapCloser filled gaps in the raw sequencing data , resulting in the draft genome assembly of sweet cherry cv. Tieton of 280.33 MB with contig N50 and scaffold N50 sizes of 63.65 KB and 2.48 MB, respectively. Our sweet cherry cv. Tieton genome assembly had tenfold better contiguity than the cv. Satonishiki genome assembly (Shirasawa et al., 2017). The whole assembly increased in size from 272.36 to 280.33 MB, whereas scaffold N50 increased from 219 KB to 2.48 MB (Table 2). A total of 150 million reads were sampled and 99.02% of the sampled reads were aligned to the sweet cherry cv. Tieton genome sequence using BWA (Li & Durbin, 2009), shown in Table S4. CEGMA (Parra, Bradnam & Korf, 2007) and BUSCO (Simao et al., 2015) were used to evaluate the completeness of the sweet cherry cv. Tieton genome and results were summarized in Table S5. Out of 248 core eukaryotic genes, 231 and 13 were found to be complete and partial genes in the CEGMA assessment, respectively. BUSCO analysis showed that our assembly captured 1,403 (97.43%) of the 1,440 single-copy orthologous genes of the embryo plant, of which 1,381 (95.9%) were complete (1,345 single-copy and 36 duplicated-copy), showing that the sweet cherry cv. Tieton genome assembly is well covered the gene space of the sweet cherry genome. Chromosome-scale pseudomolecule construction A consensus map was constructed from previously reported sweet cherry genetic maps for the chromosome-scale pseudomolecule construction (Calle et al., 2018;Guajardo et al., 2015;Klagges et al., 2013;Peace et al., 2012;Shirasawa et al., 2017;Wang et al., 2015). GMAP (Wu & Watanabe, 2005) and ALLMAPS (Tang et al., 2015) were used to organize scaffolds onto eight chromosome-scale pseudomolecules (Hulse-Kemp et al., 2018). A total of 494 scaffolds representing more than 214 MB sequences, were anchored to eight chromosome-scale pseudomolecules of the sweet cherry cv. Tieton genome using 7,838 genetic markers (36.6 markers per Mb). 202.6 of the 214 MB anchored sequences were oriented, the anchor rate and synteny of the maps were shown in Table S6 and Fig. 1. This formation resulted in a higher contiguity than the sweet cherry cv. Satonishiki genome, consisting of 905 scaffolds spanning 191.7 MB (Shirasawa et al., 2017).

Annotation of repetitive sequences
The Repbase library and repetitive motifs were searched and 32.71% (over 91 MB) of the sweet cherry cv. Tieton genome assembly was found to be repetitive. Different repetitive elements were annotated in sweet cherry cv. Tieton genome, and their distribution were shown in Table 3. Long-terminal-repeat retrotransposons (6.39%) were predominant among the repetitive elements. The annotated repeat sequence length of the sweet cherry cv. Tieton genome was 28.4 MB shorter than the sweet cherry cv. Satonishiki genome (Shirasawa et al., 2017), which may explain why the k-mer method estimated a smaller genome size for cv. Tieton than cv. Satonishiki (299.17 vs. 352.9 MB).

cDNA assembly and noncoding RNA (ncRNA) annotation
Trinity was used to assembly the high quality cDNA reads (Grabherr et al., 2011). A total of 33,401 transcripts with a total length of 42.6 MB were generated. The length of the assembled transcripts ranged from 201 to 15,591 nt, with a mean length of 1,276 nt. These assembled contigs were considered to be unigenes, and the distribution of their lengths is shown in Table S7. Noncoding RNA includes miRNA, rRNA, snoRNA, tRNA, and the tRNA pseudogene. A total of 109,277 ncRNAs were generated, with a total length of 7.35 MB, representing 2.63% of the sweet cherry cv. Tieton genome. As summarized in Table 4, our annotation predicted fewer tRNAs and rRNAs, compared to the annotation in of sweet cherry cv. Satonishiki genome (Shirasawa et al., 2017).

Protein-coding gene prediction and functional annotation
In total, 30,439 genes coding for 30,975 proteins were predicted in the sweet cherry cv. Tieton genome assembly. A summary of the predicted results using different methods was shown in Table 5. The de novo methods predicted 47,866 gene models, but the average gene length was shorter than other methods. After correcting with the transcript evidence, more than 16,000 genes were filtered out.

Gene family analysis compared with other plant species
OrthoFinder (Emms & Kelly, 2015) identified the potential orthologous genes between the sweet cherry cv. Tieton genome and the other 12 plant genomes. The results of gene orthologous analysis were shown in Table S8. Gene family clustering identified 23,129 common orthogroups consisting of 375,493 genes (81.1% of the total genes) in these genomes. 8,465 orthogroups were present in all species, and 246 were single-copy genes. In the sweet cherry cv. Tieton genome, 46 orthogroups (124 genes) were unique and 2,062 orphan genes were identified that could not be clustered with any genes in the thirteen genomes. A species tree was constructed using STRIDE (Emms & Kelly, 2017), as part of OrthoFinder. As shown in Fig. 2, sweet cherry (Prunus avium) exhibits a closer relationship with flowering cherry (Prunus yedoensis) than peach (Prunus persica) and   Chinese plum (Prunus mume). A comparison was conducted to evaluate the expansion or contraction of these gene families using CAFÉ (version 4.2) (De Bie et al., 2006), and the results were shown in Fig. 2. A total of 1,012 gene families expanded and 3,642 gene families contracted in the sweet cherry cv. Tieton genome compared to the other 12 plant genomes (Fig. 2).
Comparison between sweet cherry cv. Tieton genome and cv.

Satonishiki genome
According to Fig. 3A, genomic analysis using D-GENIES showed a high scaffold-level synteny of the sweet cherry cv. Tieton genome compared to sweet cherry cv. Satonishiki genome. High chromosome-level synteny was also detected in the two sets of  pseudomolecules, except at the end of chromosomes 1, 4, 5, and 6 ( Fig. 3B). Based on Fig. 3A, the sweet cherry cv. Tieton genome assembly had a better contig contiguity, whereas the sweet cherry cv. Satonishiki genome was more fragmented. The original annotation of sweet cherry cv. Satonishiki genome (Shirasawa et al., 2017) and the re-annotated version of cv. Satonishiki genome released by the NCBI Eukaryotic Genome Annotation Pipeline were used to compare the gene content with our  Notes: NCBI version is the improved assembly annotation of sweet cherry cv. Satonishiki released by National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Prunus_avium/100/). Original version is the assembly annotation of sweet cherry cv. Satonishiki genome documented in (Shirasawa et al., 2017).
annotation of sweet cherry cv. Tieton genome. OrthoFinder analysis showed that the originally annotated version of cv. Satonishiki had 48 species-specific orthogroups represented 349 genes from our cv. Tieton genome annotation and the NCBI annotation of cv. Satonishiki genome ( Table 7). The original version of sweet cherry cv. Satonishiki assembly annotated 41% more genes than our cv. Tieton genome annotation, however, the re-annotated version of cv. Satonishiki genome annotated a similar number of genes with our cv. Tieton genome. The increased gene numbers in the original annotation of sweet cherry cv. Satonishiki genome can be attributed to the fragmentation of genes onto multiple individual contigs. The re-annotated version of sweet cherry cv. Satonishiki genome adopted RNA-seq to improve the quality of the gene annotation by connecting genes fragmented in the assembly process (Denton et al., 2014). This method was also used in our sweet cherry cv. Tieton genome annotation process.

CONCLUSION
We successfully assembled a high-quality reference genome of sweet cherry cv. Tieton using linked reads sequencing technology. The assembly will be a valuable resource for future breeding efforts, gene function characterization and cultivar identification in the sweet cherry, as well as for comparative genomic analysis with other Prunus species.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This study was supported by Shandong Provincial Key Laboratory for Fruit Biotechnology Breeding, the Special Fund for Innovation Teams of Fruit Trees in Agricultural Technology System of Shandong Province (SDAIT-06-04), and funded by Agricultural scientific and technological innovation project of Shandong Academy of Agricultural Science (CXGC2018F03). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: Special Fund for Innovation Teams of Fruit Trees in Agricultural Technology System of Shandong Province: SDAIT-06-04. Agricultural scientific and technological innovation project of Shandong Academy of Agricultural Science: CXGC2018F03.
Weizhen Liu conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Dongzi Zhu conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Xiang Zhou analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Po Hong performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Hongjun Zhao performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Yue Tan analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Xin Chen performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Xiaojuan Zong performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Li Xu analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Lisi Zhang performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Hairong Wei analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Qingzhong Liu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.9114#supplemental-information.