Sequencing and analyses on chloroplast genomes of Tetrataenium candicans and two allies give new insights on structural variants, DNA barcoding and phylogeny in Apiaceae subfamily Apioideae

Background Tetrataenium candicans is a traditional Chinese folk herbal medicine used in the treatment of asthma and rheumatic arthritis. Alongside several Tordyliinae species with fleshy roots, it is also regarded as a substitute for a Chinese material medicine called ‘Danggui’. However, a lack of sufficient sampling and genomic information has impeded species identification and the protection of wild resources. Methods The complete chloroplast genomes of T. candicans from two populations, Tetrataenium yunnanense and Semenovia transilliensis, were assembled from two pipelines using data generated from next generation sequencing (NGS). Pseudogenes, inverted repeats (IRs) and hyper-variable regions were located by Geneious 11.1.5. Repeat motifs were searched using MISA and REPuter. DNA polymorphism and segment screening were processed by DNAsp5, and PCR product was sequenced with Sanger’s sequencing method. Phylogeny was inferred by MEGA 7.0 and PhyML 3.0. Results The complete chloroplast genomes of T. candicans from two populations, T. yunnanense and S. transilliensis, were 142,261 bp, 141,985 bp, 142,714 bp and 142,145 bp in length, respectively, indicating conservative genome structures and gene categories. We observed duplications of trnH and psbA caused by exceptional contractions and expansions of the IR regions when comparing the four chloroplast genomes with previously published data. Analyses on DNA polymorphism located 29 candidate cp DNA barcodes for the authentication of ‘Danggui’ counterfeits. Meanwhile, 34 hyper-variable markers were also located by the five Tordyliinae chloroplast genomes, and 11 of them were screened for population genetics of T. candicans based on plastome information from two individuals. The screening results indicated that populations of T.candicans may have expanded. Phylogeny inference on Apiaceae species by CDS sequences showed most lineages were well clustered, but the five Tordyliinae species failed to recover as a monophyletic group, and the phylogenetic relationship between tribe Coriandreae, tribe Selineae, subtribe Tordyliinae and Sinodielsia clade remains unclear. Discussion The four chloroplast genomes offer valuable information for further research on species identification, cp genome structure, population demography and phylogeny in Apiaceae subfamily Apioideae.


INTRODUCTION
Subtribe Tordyliinae is an Apiaceae lineage in the tribe Tordylieae whose species are characterized by enlarged radiant outer petals and dorsally compressed fruits, with more than 15 incorporated genera (Fu, 1981;Xiao, 2017). Tetrataenium candicans var. candicans (Wall. ex DC.) Manden. is a perennial Tordyliinae herb endemic to the Qinghai-Tibet Plateau (QTP) and adjacent regions. They typically grow on sunny slopes and forest edges at elevations of about 2,000 to 4,200 m, with dense white pubescence on the back side of leaves and prominent fruit ridges distinguishing them from other allied species (Fu, 1981). Our survey of the Chinese Materia Medica resource inventory in Sichuan, Yunnan, Tibet and Qinghai province showed that T. candicans was a common herbal medicine used to treat asthma and rheumatic arthritis. Moreover, alongside some Tordyliinae species with fleshy roots, it has been also widely used as a substitute for 'Danggui,' a traditional Chinese medicine that was first recorded in ''Shen Nong's Herbal Classic'' in about AD 200 to be used in blood enrichment and the treatment of pulmonary diseases. However, since the misuse of the active ingredients in this species might cause severe side effects (Sondhiaa et al., 2017;Zhang, Wang & Hu, 2001), the Chinese Pharmacopoeia Commission (2015) has restricted 'Danggui's' origin plant to Angelica sinensis (Oliv.) Diels only. Additionally, the long growth cycle and over-exploitation of wild individuals used for medicinal and economic purposes has menaced the existence of T. candicans (Joshi & Dhar, 2003). Less attention was given to discriminate among the counterfeit 'Danggui' plants (especially from the subtribe Tordyliinae), which also delayed research on the drug's safety and discovery.
DNA barcoding is a common technique that aims to standardize DNA segments for species-level discrimination, and has been widely used in identification, taxonomy and biodiversity studies (Hodgetts et al., 2016;Chen et al., 2009;Devloo-Delva et al., 2016). The general procedure in developing DNA barcodes involves building public libraries of DNA segments from known species. The target sequences from an unknown species are then matched up with barcode libraries to search for best-matching species or close relatives (Hajibabaei et al., 2007). With the decline of sequencing cost in the past decades, a vast amount of DNA barcodes has been uploaded and retrieved. For instance, the CO1 gene is applied in the species recognition of animals and fungus (Seifert et al., 2007), while ITS, ITS2, psbA-trnH, matK and rbcL are effective in identifying Apiaceae species (Liu et al., 2014). Nevertheless, due to different substitution rates, hybridization, multiple copies and mutations, DNA barcodes are not applicable to all species (Kress & Erickson, 2007). For example, a universal barcode ycf15 (Gao, Zhao & Ni, 2017) is absent in Coriandrum sativum (Peery, 2015), and species discrimination by ITS sequences are often flanked by hybridization (Liu et al., 2011). For these reasons, it is essential to develop specialized available DNA markers to protect and fully utilize local herbal resources.
The chloroplast (cp) is an important component in the plant cell as it is where optical energy can be restored in the form of carbohydrates through photosynthesis. As a semiautonomous organelle, the chloroplast possesses uni-parental inherited cricoid DNA with a length ranging from 20.98 kb (Sciaphila densiflora) to 1320.6 kb (Haematococcus lacustris), presenting conservative gene locations and categories in flowering plants (Molina et al., 2014). Because of these features, chloroplast genomes have expanded our knowledge of genetic engineering (Daniell et al., 2016), phylogeny (Jansen et al., 2006 and population demography. With the development of next generation sequencing (NGS) and assembly technologies, the past two decades have witnessed more than 100 cp genomes reportedly covering over 40 genera in Apiaceae. These cp genomes are all characterized by a quadripartite structure similar to those of most angiosperms, and their discrepancies are mainly reflected in rare changes in gene order and shifting IR-LSC boundaries (Peery, 2015;Spooner et al., 2017). Although it is a monophyletic group (Logacheva et al., 2010) that is of important economic value, few studies have been devoted to the chloroplast genomes of Tordyliinae species.
In the present study, we reported cp genomes of two T. candicans individuals, Tetrataenium yunnanense (Xiao et al., 2017) and Semenovia transiliensis. By comparing them with published cp genomes, we explored the structural variation among Apiaceae species and located hypervariable cp DNA markers for Tordyliinae species and candidate cpDNA barcodes for the species identification of 'Danggui' counterfeits. Additionally, based on two cp genomes of T. candicans, 11 cpDNA markers were screened to survey population diversity. Finally, phylogeny on 37 Apiaceae and Araliaceae species using CDS sequences of cp genomes was also inferred. This research will offer genomic and genetic information for future phylogenetic and phylogeographic studies on Tordyliinae, and will shed light on the protection and utilization of wild herbal medicine resources. , respectively. All voucher specimens were deposited in the Sichuan University Herbarium (SZ). The total genomic DNA was isolated from dry leaves using an improved CTAB method (Doyle, 1987) and sequenced at Novogene (Novogene BioTech, Inc. Beijing, China) by Illumina Hiseq 2500 platform (Illumina, San Diego, CA). A genome skimming sequencing strategy (Steven, 2015) was performed to obtain deep coverage of organelle genomes regardless of the shallow sequencing of total genomic DNA. The coverage map was generated by Geneious 11.1.5.

Quality control, genome assembly and genome annotation
The quality of raw data was assessed by FastQC (v0.11.7 for windows) (Andrew, 2014). To accommodate the demands for data preparation of assembly software, we merely removed primers and adapters using Cutadapt v1.1.8 (Martin, 2011), and the qualified data was then assembled in different ways. Initially, a pipeline combining bowtie2-build (Langmead & Salzberg, 2012), SAMtools (Li et al., 2009), BEDtools (Quinlan, 2014 and SOAPdenovo 2 (Luo et al., 2012) was used to pick up reads that mapped to the best reference cp sequence. The consensus sequences were generated by Geneious 11.1.5 (Kearse et al., 2012) and gaps were filled by Sanger sequencing. For comparison, a seed-based assembler named NOVOPlasty 2.7.2 (Dierckxsens, Mardulyn & Smits, 2017) was employed which eventually cyclized cp genomes.
New rules for annotating cp genomes of the four assembled cp genomes were also followed: (1) Geneious 11.1.5 was used for batch annotation of consensus sequences referring to annotations of Daucus carota (NC_008325) and Pastinaca pimpinellifolia (NC_027450); (2) Sequences were re-annotated on the DOGMA (Wyman, Jansen & Boore, 2004) website to check for any omissions and the identity was set above 70% in protein-coding genes, 80% in rRNA genes and 90% in tRNA genes; (3) Pseudogenes were not marked, and incomplete homologous segments of a real gene (such as psbA, ycf1) were also regarded as pseudogenes; (4) tRNAscan-SE v2.0 (Lowe & Chan, 2016) was used for verifying tRNA genes. All annotation errors were manually corrected. Finally, the circular map was generated by OGDRAW (Lohse et al., 2013). MISA (Thiel, 2003) was used for identifying simple sequence repeats (SSRs) of T. candicans, T. yunnanense, Heracleum moellendorffii, S. transiliensis and P. pimpinellifolia. The threshold was set as follows: the minimum repetitions of SSRs for mono-nucleotide, di-nucleotides, tri-nucleotides, tetra-nucleotides, penta-nucleotide and hexa-nucleotides should be 10, 5, 4, 3, 3 and 3, respectively. Simple dispersed repeats (SDRs) were identified by Reputer (Kurtz et al., 2001). The minimal size of SDRs was 30 bp, and the similarity among SDRs should be no less than 90%. IRA region and double-counting SDRs were removed.

Segment screening for T. candicans
To investigate the population diversity of T. candicans in order to protect wild resources, we screened 11 cpDNA segments from noncoding regions and designed primers using Geneious 11.1.5 based on hyper-variable regions of the two cp genomes of T. candicans. More than 30 populations were sampled covering most distributions on record, and each population sampled one or two individuals. Total genomic DNA was extracted by an improved CTAB method (Doyle, 1987). PCR amplification was carried out in a 30 µL volume, and included15 µL ddH 2 O, 9 µL mix (Tiangen, Beijing, China), 3 µL DNA solution, 1.5 µL Forward primer and 1.5 µL reverse primer solution (10 µmol/L −1 ). The PCR procedure was as follows: pre-denaturation for 4 min at 94 • C, followed by 35 cycles of denaturation for 45 s at 94 • C, annealing at 50-55 • C for 1 min and extension at 72 • C for 1 min, and finally, extension at 72 • C for 7 min. The PCR products were sequenced (BGI, Beijing, China) by ABI 310 Genetic Analyzer (Applied Biosystems, Waltham, MA, USA). The paired-end sequences were assembled by Geneious 11.1.5. DnaSP5 was available for nucleotide diversity, haplotype diversity, and mismatching analyses.

Phylogeny reconstruction
Phylogenetic analysis was performed using concatenated alignments of 80 protein coding sequences from 37 plastomes:

Genome features of T. candicans, S. transiliensis and T. yunnanense
The sequences of two T. candicans, S. transiliensis and T. yunnanense, were 142,261 bp, 141,948 bp, 142,145 bp and 142,714 bp in length, respectively, presenting typical quadripartite structures containing a large single copy region (LSC) and a small single copy region (SSC) jointed by two inverted repeats (IRA and IRB) (Fig. 1, coverage map in Fig. S1). The overall GC content of the four cp genomes was 37.4% with an exception of 37.3% in T. yunnanense. The GC content of the IR region was much higher than that of the other components, mainly because tRNA and rRNA genes had aggregated at this region. The SSC region had the lowest GC content in comparison with the remaining two components ( Table 1).

Gene loss and pseudogenization
Pseudogenes (marked with ' ') were DNA segments that were homologous to real genes but had lost functionality to some extent (Vanin, 1985). Of the four assembled chloroplast genomes, ψ ycf15 in T. candicans, T. yunnanense and S. transiliensis was formed through several small InDels (KU951523 was used as a reference sequence), while ψycf1 and ψpsbA were generated through LSC-IR boundary shifting.

LSC-IR/IR-SSC boundaries
Taking the IRA-LSC boundary of tobacco as reference, junction type I', defined by Downie (Plunkett & Downie, 2000), was found in cp genomes of T. yunnanense, T. candicans and S. transiliensis. Meanwhile, a rare LSC-IR junction named F' was discovered in chloroplast genomes of P. pimpinellifolia (NC_027450) and H. moellendorffii (MK210561). The two junction types experienced extra expansion to psbA, and the duplicated segment trnH-psbA inserted right behind old contracted LSC-IRB boundaries (Fig. 2).
While boundaries between the IR and LSC region were quite changeable, IRB-SSC and SSC-IRA junctions were highly conserved. All IRB-SSC boundaries of the four chloroplast genomes reported here were placed adjoined to ndhF, and the SSC-IRA boundaries located within the scope of ycf1. The Junction type definitions will contribute to further research on structural variations of cp genomes of Apiaceae species.

SSRs and SDRs
SSRs (Simple sequence repeats) correlated with multiple replications of motifs that contained one to six base pairs. Herein, we reported SSRs in cp genomes of the five Tordyllinae species that were longer than 10 bp (Fig. 3, Supplemental Information 1) 3E) in each cp genome were substantially in agreement and larger repeat units meant fewer replications. These SSRs can be potential markers for species discrimination, phylogeny and population studies for the five Tordyliinae species. SDRs (Short Dispersed Repeats) were more complex repeats longer than 30bp. In this study, 24, 42, 27, 25, and 20 pairs of SDRs in cp genomes of T. candicans (MK333395), T. yunnanense, H. moellendorffii, P. pimpinellifolia and S. transiliensis were surveyed, respectively (Fig. S2, Supplemental Information 3). Forward (direct) SDRs were the most frequent repeats, followed by palindromic SDRs and reverse SDRs. Most SDRs were located at the intergenic regions; trnH-rrn16 specifically was a hotspot area for SDRs in T. yunnanense with 15 pairs of SDRs dispersed in this region. There was no SDR longer than 60 bp in H. moellendorffii, nor P. pimpinellifolia, while only two, two, and three SDRs longer than 60 bp were observed in T. candicans, S. transiliensis and T. yunnanense, respectively. SDRs longer than 100 bp only existed in T. yunnanense.
Additionally, SDRs accounting for the expansion of cp genomes through multiple replications were found in S. transiliensis and T. yunnanense. The 39 bp SDRs that began at the 30419th genome in S. transiliensis repeated three times in the LSC region. In T. yunnanense, SDRs that repeated more than twice occurred in the IR and LSC regions, with   Particularly, the 125 bp SDRs were considered novel insertions as they aligned any species with high similarities. This information on SDRs will be advantageous in exploring the structure evolution of chloroplast genomes.

Segments screening for genetic diversity of T. candicans
All target DNA segments were successfully amplified and sequenced, except for sequencing failures in ndhF-rpl32 and rpl16 intron caused by repeat sequences. These segments from different populations presented low DNA polymorphism below 0.01 (Table S2), but significantly high haplotype diversity from 0.248 (trnS-trnG) to 0.844 (rpl16 intron). Only four haplotypes were found in trnS-trnG and rpl16 intron alignments, while trnQ-rps16 had the most haplotypes with 16. More than half of the haplotypes were private, indicating high haplotype diversity. Regarding mismatch analyses, all nine segments presented unimodal or smooth curves, indicating that the sampled populations (Table S3) had probably experienced recent expansion. However, the results were only confirmed by Tajima's test and Fu and Li's test against rps16 intron, trnQ-rps16, trnL-trnT, rps16-trnK (weak statistical significance), and trnS-trnG (weak statistical significance), while test result from rpl16 intron, psbA-trnH, trnL-F did not show statistical significance of deviation from zero.

Phylogenetic inference
Maximum Likelihood (ML) trees (Fig. 6) by 80 concatenated CDS sequences of 37 species from Apiaceae subfamily Apioideae were constructed, and most lineages were well supported. Bupleureae species experienced the earliest differentiation in the 12 clades, followed by species from tribe Pleurospermeae, Komarovia clade, tribe Oenantheae and tribe Scandiaceae. Seven clades or tribes from Apioid superclade clustered together with 100% bootstrap support value, and Pyramidoptereae was sister to the remaining six clades. Within the subtribe Tordyllinae, lineage including P. pimpinellifolia and H. morllendorffii were closely clustered with 100% bootstrap support value. This lineage, together with species from tribe Selineae and Sinodielsia clade, constituted three parallel branches. Lineage composed of T. candicans, S. transiliensis, T. yunnanense and C. sativum was clustered with species from Sinodielsia clade with 100% bootstrap support value. Phylogeny inference failed to recover the five Tordyliinae species as a monophyletic group.

Assembly methods and gene content
In this study, four chloroplast genomes were assembled using genome skimming data based on the principle that organelle DNA have more copies than nrDNA in plant cells. The cp genomes of T. candicans, S. transiliensis and T. yunnanense were quadripartite with conservative gene organization and an almost halved length of IR regions that determined the selection of assembly methods. From experience, seed-based assemblers (such as NOVOPlasty v 2.7.1) were more competent in assembling chloroplast genomes with contracted IR regions. The four cp genomes contained 123 genes: 80 protein-coding genes, 35 tRNA genes and eight rRNA genes. Apart from truncated psbA, only one pseudogene, ψ ycf15, was observed, suggesting that gene content was highly conserved. 'ATG' was the start codon of most plastid genes, and illegal start codons occurred in ndhD, psbC and rps19 in the four reported cp genomes. The translational initiation sites of ndhD was 'ACG', which might be associated with a post-transcriptional 'C' to 'U' edit to improve the translating efficiency (Hirose et al., 1999). The start codon of psbC was annotated as two distinct codons, namely 'GTG' in tobacco and 'ATG' in Apiaceae species, resulting from annotation preference as this gene was highly conservative in both tobacco and the four cp genomes, and 'ATG' was 33 bp upstream of 'GTG'. 'ATG' was reported as a putative start codon of rps19 in algae, while in angiosperms, the start codon turned into 'GTG', indicating an ancient mutation that may take place in a common ancestor.

Molecular markers and segment screening for population genetics
Based on the number of haplotypes and insertions, the 29 candidate cpDNA barcodes were able to distinguish 'Danggui' from the counterfeits, which gave new insight on discovering new candidate DNA barcodes using cp genomes. The 34 hypervariable DNA segments and SSRs in the five Tordyliinae species were also located. Our findings were consistent with previous conclusions that most SSRs were abundant in 'A/T' format, but the number of SSRs in the five cp genomes was much less than what had been reported in other species (Zhao et al., 2018;Liu et al., 2018;Zong et al., 2019). Additionally, we made attempts on segments screening for the genetic diversity of T. candicans through two cp genomes. All nine DNA segments exhibited high haplotype diversity with abundant private haplotypes, which might be explained by habitat fragmentation (Fahrig, 2003) of T. candicans in the Qinghai-Tibet plateau and adjacent regions, while low DNA polymorphism might be correlated to low substitution rate of cpDNA segment or rapid adaptive evolution (Zhao et al., 2013;Shi & Zhang, 2015). What's more, recent expansion shown by mismatch analyses might give insight into why T. candicans are more likely to survive and be prosperous in its habitats. Those segments of high haplotype diversity could be genetic resources for population genetics and complement nrDNA barcodes in identifying producing areas of T. candicans.

Dynamic IRA-LSC boundaries and repeat sequences
Most angiosperms were highly conserved in the cp genome structure. As described in Nicotiana tabacum, a typical LSC-IRB (J LB ) boundary occurred near rps19. However, due to contraction or expansion at the IR regions, LSC-IRB or IRA-LSC boundaries could be really dynamic in Apiaceae. So far, 12 kinds of IRA-LSC boundaries were defined in the family Apiaceae (Plunkett & Downie, 2000;Peery, 2015), among which type I' and F' were two rare junction types among land plants which existed only in cp genomes of tribe Coriandreae and Tordyliinae species in family Apiaceae.
There are several reported explanations for changes in the IR-LSC boundaries, including homologous recombination (Guo et al., 2014), double-strand break repair and the existence of SDRs (Odom et al., 2008). A previous study on cp genomes of C. sativum supported double-strand break repair (DBSR) by partial duplication of trnV-GAC and SDRs located within IR regions that caused change in the IR boundary (Peery, 2015). However, the duplication of trnV-GAC was not observed in any Tordyliinae species. More interestingly, we observed some direct SDRs located between trnH-GUG and trnL-CAA, and between trnH-GUG and rrn16 in cp genomes of the five Tordyliinae species. These results indicated that changes in the LSC-IR boundary were far more complicated to be explained by a single mechanism. Additionally, some of the aforementioned SDRs (i.e., 125 bp SDRs in T. yunnanense) are new insertions that are highly similar to nrDNA, mtDNA or mRNA segments. The novel insertions existed in the form of direct SDRs that might be preliminary evidence for intra-molecular recombination and foreign DNA transformation mediated by short direct-repeat sequences (Ogihara, Terachi & Sasakuma, 1988;Cai et al., 2008).

Phylogeny inference
In the past decades, phylogeny on the whole range of Apiaceae or subfamily Apioideae has been widely explored (Downie, Downie & Watson, 2000;Downie et al., 2004;Downie et al., 2010;Zhou et al., 2008;Zhou et al., 2009), but few attempts have been made on utilizing plastid genomes for phylogeny inference. In such cases, protein coding sequences from 37 plastomes from 12 tribes in Apiaceae were employed for phylogeny inference as they were maternal inheritance, free of hybridization and had more informative loci (Daniell et al., 2016).
However, the five species from subtribe Tordylinae were not recovered as a monophyletic group on ML tree as studies using ITS and ETS sequences suggested (Downie et al., 2010;Zhou et al., 2008;Xiao, 2017;Logacheva et al., 2010). The possible reasons for phylogenetic incongruency were sampling errors, systematic errors and biological factors (Zou & Ge, 2008). Sampling errors (stochastic errors) and systematic errors were excluded first as causes for topological discordances as abundant informative loci existed in chloroplast genomes and no conspicuous long branch attraction (Bergsten, 2005) was observed (Philippe et al., 2005;Zhang, Zeng & Li, 2012). Biological factors referred to many aspects, such as horizontal gene transfer, ancient hybridization, incomplete lineage sorting, homoplasy and concert evolution (Koch, Dobes & Thomas, 2003). Even though very few intergeneric hybrids had been reported on Apiaceae species (Desjardins et al., 2015;Yu et al., 2011), we suggested ancient hybridization cannot be excluded as most Tordyliinae and Selineae species possessed the same base number of chromosomes (n = 11) (He, Pu & Wang, 1994). Moreover, as shown in Fig. 6, short branches suggested that the disputed phylogenetic relationship may also be explained by incomplete lineage sorting in Apiaceae superclade Apioid that may had experienced rapid evolutionary radiation and species formation (Calviño, Martínez & Downie, 2008;Tamura et al., 2012) related to climate fluctuations or reproductive isolation during Miocene Wu et al., 2014;Banasiak et al., 2013). Nevertheless, further analyses with more nuclear markers and samples from Tordylieae should be performed in order to clarify this issue. Our research also indicated that chloroplast genomes were effective in phylogeny inference, but may not be competent in dealing with conflicting phylogeny among rapidly radiated species.

CONCLUSIONS
In this study, four cp genomes of two T. candicans individuals, T. yunnanense and S. transiliensis, were first reported. Analyses on genome structure revealed two kinds of rare simultaneous contractions and expansions of LSC-IR boundaries in assembled Tordyliinae cp genomes. The candidate cpDNA barcodes for the authentication of 'Danggui' and 34 hypervariable DNA segments in the five Tordyliinae cp genomes were also identified. Segment screening for population genetics of T. candicans suggested that populations had probably experienced recent expansion. Phylogeny inferences based on protein coding sequences from 37 plastomes of Apiaceae and Araliaceae species suggested that subtribe Tordyliinae was closely related with subtribe Selineae, tribe Coriandreae and Sinodielsias clade. However, the ML tree failed to recover the five Tordyliinae species as a monophyletic group. On that basis, forthcoming research might focus on the structure variation of plastomes of Apiaceae species, cpDNA barcodes for 'Danggui' and phylogeny reconstruction at the genomic level by exploring efficient methods and increasing samples.