Dairy farming has been a vital part of agriculture for several thousands of years, and have a high impact in social, health and economic conditions of several Asian countries (http://www.fao.org). Notably, India is the largest producer of milk in the world, and buffalo’s milk account for about half of India’s milk production (http://www.fao.org). Milk or milk products form the primary source of fat and protein in an average Indian diet, and buffalo dairy farming is a life sustaining business for millions of small-scale dairy farmers in the country. Economically important Indian buffalo breeds include: Mehsani, Jafarabadi, Nili-Ravi, Surti, Murrah, Pandharpuri, Banni etc. Despite their multi-faceted significance in countries like India, a few effort has been made so far to improve buffalo breeding programs (Singh et al., 2009). It is well understood that milk yield in dairy animals have a major genetic component (Cole et al., 2009). In the recent years, genomic selection in cattle using molecular markers such as SNPs has been proved to be highly efficient (Berglund, 2008). A multi-breed analysis will facilitate precise dissection of milk-yield makers and their molecular framework (Goddard & Hayes, 2009; Melka & Schenkel, 2012; Olson, VanRaden & Tooker, 2012). Recently, taking advantage of next-generation sequencing with targeted DNA capture technologies are proved to be an efficient and cost-effective approach for high-throughput SNP discovery (Hirano et al., 2013).
Here, we designed a whole exome capture custom array that target exons and untranslated regions (UTRs) of the cattle genome to perform whole-exome/targeted sequencing and conducted exome sequencing to identify variants in three recognized milch breeds of river buffalo native to Gujarat state of India viz. Banni, Mehsani, and Jafrabadi breeds. This study design allows us to identify the genetic variants specific for buffalo breeds.
Material and Methods
Sample collection and genomic DNA extraction
We recruited 12 water buffaloes, which were biological replicates from three breeds namely Banni, Mehsani and Jafrabadi, belonging to either high or low-milk yield category from the Gujarat state of India (Table 1). The gDNA was isolated from blood samples using a Qiagen DNeasy Blood and Tissue kit (Qiagen Corp., CA, USA) and the resultant DNA was quantified using Qubit® dsDNA BR Assay (Invitrogen Corp., CA, USA) and integrity was confirmed by agarose gels. Blood drawings were conducted in accordance with regulations and prior approval by the institutional animal ethics committee of Anand Agricultural University, Gujarat, India.
|Group||Sample ID||Breed||Total milk yield (liters)|
|Low milk yield||BLP1||Banni||1,680|
|High milk yield||BHP1||Banni||6,440|
Probe design, target enrichment and exome sequencing
We obtained the intended targets (all coding exons, 3′UTR and 5′UTR exons) of cattle (Btau_4.6.1/bos Tau7) from RefGene tables of UCSC genome browser. Further, the custom probe design was performed by NimbleGen (Roche, Germany) which are compatible with Roche GS-FLX Titanium sequencer.
Rapid library for each sample was prepared from ∼1 μg of gDNA separately and multiplexed according to manufacturer’s protocol (Roche, Germany) using high quality DNA. Final libraries were used to set up hybridization reaction at isothermal temperature of 47 °C for 68–72 h in thermal cycler, with custom designed probes as per manufacturer’s protocol (NimbleGen). Captured DNA libraries were quantified spectrophotometrically, and evaluated electrophoretically with high sensitivity DNA assay on Agilent Bioanalyzer 2100 (Agilent). Finally, the libraries were sequenced on Roche 454 GS-FLX Titanium instrument according to manufacturer’s protocol (Roche, Germany). The raw sequence data will be deposited in a public repository.
Raw reads filtering, alignment and variant detection
Raw reads were filtered based on phred quality score ≥ 20 and length ≥ 50 bp using QTrim tool (Schmieder & Edwards, 2011). The resultant sequence reads were mapped against Bos taurus genome build 4.6.1 using ‘bwa-mem’ module of BWA v 0.7.5a (Li & Durbin, 2010). Potential PCR duplicates were removed using ‘MarkDuplicate’ module of Picard Tools. SNPs passing the criteria depth ≥ 5 and quality score ≥ 30 were identified using SAMtools, FreeBayes and VarScan tools, through pooled variant calling strategy (Li et al., 2009; Garrison & Marth, 2012; Koboldt et al., 2012). To deduce the shared (MultiCom) and specific SNPs from each tool, we used VCFtools package v0.1.11 and basic Linux commands (Danecek et al., 2011). The SNPs belonging to high-yield and low-yield were compared using VCFtools package v0.1.11 (Danecek et al., 2011). The specific SNPs (high or low yield) refers to the SNPs that were present in any of the sample in the given group which are not present or did not pass the SNP calling threshold in the other group.
Annotation and QTL dataset
The resulting sequence ontology (SO) and candidate gene annotation of SNPs specific for high and low yield groups was performed using SnpEff v 3.4 (Cingolani et al., 2012). The candidate gene enrichment analysis was performed in the GeneCodis tool and the enriched pathways and gene ontology (GO) terms were identified using Hypergeometric test followed by Benjamini-Hochberg’s correction (p-value < 0.01). Cattle QTL candidate genes belonging to milk-yield and milk-fat categories were obtained from the AnimalQTLdb online resource (Hu et al., 2013).
Mammary gland transcriptome analysis
From NCBI-GEO database, we obtained the whole gene expression data of 8 samples derived from mammary gland of Holstein Friesen cattle belonging to high or low milk yield (Accession: GSE33680). The gene expression was measured using Agilent-023647 v2 microarray, comprising of 45,220 oligo-nucleotide probes representing cattle genes. In the dataset, there were four samples in each for high-milk yield and low-milk yield groups. GEO2R implementation of LIMMA ebayes test was used to identify differentially regulated genes between high and low milk yield groups (p-value < 0.01).
Samples and overview of sequence data
In order to identify SNPs specific for high and low milk-yield in Indian buffalo breeds, we designed two balanced sample groups of healthy Indian buffaloes with prominent difference in milk yields. Animals with fertility issues, inconsistent diets or other known disorders were excluded from the study. The custom designed capture probes based on UCSC RefTable data of Bos taurus genome version 4.6.1/bosTau7 covered 125,679 exons, 14,084 3′UTRs and 16,574 5′UTRs. The sequencing experiment of 12 samples generated about 2.49 gigabases encompassed in ∼6.5 million sequence reads with average length of 380 bases (Table 2). We obtained ∼68% average on-target capture efficiency across samples and approximately 98% of filtered reads were mapped to Bos taurus 4.6.1 reference sequence (NCBI), with the average per sample depth ≥ 5× (Table S1).
|Sample ID||Raw dataset||Filtered dataset|
|Total reads||Base count (Mb)||Mean read length||No. of reads||Base count (Mb)||Mean read length||% passed||Target covered (%)|
|6,561,664 (total)||2,498 (total)||381 (average)||6,076,154 (total)||2,102 (total)||346 (average)||92.6 (average)||67.8 (average)|
Identification of SNPs specific for high and low milk-yield using MultiCom approach
We grouped the samples based on milk-yield (high and low milk-yield groups), and performed the SNP analysis separately for each group, using cattle as reference. Initially, we used three variant calling tools namely: SAMtools, FreeBayes and VarScan with the consistent thresholds of base depth ≥ 5 and quality score ≥ 30 for all the tools. In the high milk-yield group, SAMtools detected 1.097 million SNPs. FreeBayes and VarScan reported 1.101 million and 1.090 million SNPs respectively (Fig. 1A–1C). Even though high concordances were observed between the outcomes from the three tools, several thousands of SNPs (6.5–9.3% of output from each tool) were tool-specific or non-reproducible SNP calls (Fig. 1A). On the other hand, in the low milk-yield group, SAMtools identified 1.389 million SNPs, followed by 1.103 and 1.098 million SNPs respectively by VarScan and FreeBayes. In the low milk-yield group, the tool-specific SNP calls for SAMtools was the same as that of high-production dataset (6.5%), but a considerable reduction of tool–specific calls was observed for VarScan and FreeBayes (4.6 and 3.8%, respectively) outputs, even though the total outcome by these two tools were comparable with the high milk-yield group (Fig. 1A). We hypothesized that the non-reproducible SNPs were potential false positive calls. To test this, we compared the transition/transversion (ts/tv) ratios of SNPs which were tool-specific and shared, separately for the two groups. A highly consistent ts/tv range was observed for the shared SNPs, ranging from 2.61 to 2.65. On contrary, the tool-specific or non-reproducible SNPs showed a dynamic range of lower ts/tv ratios, ranging from 1.60 to 2.38 (Fig. 1D). Strikingly, the difference in the ts/tv ratios were highly significant (p-value < 0.005) (Fig. 1D). Next, using SO annotation by snpEff tool, we compared the regions in the reference genome in which the SNPs are located. Notably, it was observed that the inter-genic region SNPs were almost doubled in case of tool-specific SNP calls, consistently in three tools and the two groups (Fig. 1E). Therefore, we considered only those SNPs which were reproduced by at least two tools (hereon called MultiCom approach). There were a total of 1.203 million SNPs in high milk-yield group and 1.315 million SNPs for low milk-yield group (Fig. 1A–1C). Interestingly, in almost all cases, SNP selection through MultiCom approach facilitated an increased discovery rate (up to 20%) compared to the outcome from each tool. Finally, we compared MultiCom outcomes in high milk-yield and low milk-yield groups, and have found that 255,741 SNPs (21.3%) were specific for high yield group, and 367,674 SNPs (28.0%) were specific for low yield group. On the other hand, the 947,646 SNPs were common to the two groups which may be generic SNPs in buffalo, as the reference genome was cattle.
Candidate gene prediction and comparison with milk-yield and milk-fat QTL dataset
Using snpEff prediction candidate genes were determined for high and low milk-yield specific SNPs (Table S2). The snpEff annotation mapped the 255,741 high yield specific SNPs in 7,212 genes, and the 367,674 low yield specific SNPs in 8,284 coding genes. Even though the SNPs were specific for the two groups, we have found that the 5,037 genes were common to high and low milk-yield groups. On the other hand, there were 2,175 genes specific for high yield and 3,247 for low milk-yield. In order to determine whether these genes were present among the reported milk-yield and milk-fat related genes in QTL region, we obtained the list of candidate genes in milk-yield and milk-fat categories of cattle resources of the animal genome data repository. Remarkably, of 74 milk-yield candidates, 51 genes were present in our results, among which 26 were common to high and low milk-yield group and 17 genes specific for low yield specific genes, and 8 were high specific genes (Fig. 2A). On the other hand, of 91 milk-fat candidates, 30 genes were common to high and low-milk yield group, 24 and 15 candidate genes were specific for low and high milk-yield groups respectively (Fig. 2B). On the whole, about 70% of the QTL candidate genes in respective categories belonging to cattle genome annotation were present among the genes the genes identified by our study in buffalo breeds.
Enrichment analysis of yield specific candidate genes
Next, we performed the GO enrichment analysis of the candidate genes, specific for high and low milk-yield groups with the threshold of FDR corrected p-value < 0.01. In the GO molecular functions category, eight GO terms were enriched in high milk-yield group. On the other hand, 21 terms were enriched in low-milk yield group, among which four terms were common to GO terms belonging to high milk-yield group (Fig. 3). The common terms were metal ion binding, nucleotide binding, zinc ion binding and cytokine activity. Interestingly, several GO terms in the low milk-yield category which were related to transcriptional regulation, such as sequence-specific DNA binding transcription factor activity (GO:0003700), sequence-specific DNA binding (GO:0043565) and transcription regulatory region DNA binding (GO:0044212) (Fig. 3A). Collectively, 140 genes were found to be contributing to the statistical enrichment of these three GO terms, which were specific for low-milk yield group (Fig. 3C).
In order to obtain the clues about the functional consequences of the candidate genes specific for high and low-yield specific groups, we performed the KEGG pathway enrichment analysis using Genecodis tool. There were 15 pathways enriched for the high yield specific group and 32 pathways for the low-yield specific group (Table S3). However, we have noticed that several of the pathways were related to human diseases. Further, after excluding human disease specific pathways, there were 8 pathways present in high-yield group and 19 pathways for the low milk yield group (Table 3). The most significant pathways in the low milk-yield group were Oxidative phosphorylation, Toll-like receptor signalling, cytokine-cytokine receptor interaction etc. On the other hand, the most significant among the high-yield group were Jak-STAT signalling pathway, Wnt signalling pathway, ErbB signalling pathway etc. Surprisingly, Toll-like receptor signalling and MAPK signalling pathway were enriched in both high and low milk-yield groups by distinct set of genes falling in the same pathway (Figs. S1–S3).
|KEGG pathways||Group||Genes||FDR p-value|
|Toll-like receptor signalling pathway*||L||20||3.39E-05|
|Cytokine-cytokine receptor interaction||L||26||1.08E-04|
|Jak-STAT signalling pathway||H||18||1.96E-04|
|Wnt signaling pathway||H||17||1.91E-03|
|RIG-I-like receptor signaling pathway||L||12||2.99E-03|
|Adipocytokine signaling pathway||L||12||3.40E-03|
|Protein digestion and absorption||L||10||3.91E-03|
|ErbB signaling pathway||H||10||4.14E-03|
|Natural killer cell mediated cytotoxicity||H||13||4.36E-03|
|Leukocyte transendothelial migration||H||14||4.43E-03|
|MAPK signaling pathway*||L||27||5.01E-03|
|Vasopressin-regulated water reabsorption||L||9||6.02E-03|
|VEGF signaling pathway||H||10||6.79E-03|
|Collecting duct acid secretion||L||7||7.38E-03|
|Cell adhesion molecules (CAMs)||L||15||8.50E-03|
|Protein processing in endoplasmic reticulum||L||20||9.14E-03|
L, Low milk-yield; H, High milk-yield.
Comparison of buffalo milk-yield specific candidate genes with milk-yield DEG (Differentially Expressed Genes) in cattle
Next, we hypothesized that the SNPs in the candidate gene may exert changes in the gene expression that could have an effect on milk yield. In order to verify this, we obtained the mammary cell transcriptome dataset belonging to high and low milk-yielding cattle (see Materials and Methods). LIMMA analysis between the high milk-yield and low milk-yield group detected 1,056 annotated differentially expressed genes, in which 579 gene were up-regulated and 477 genes were down-regulated among the low milk-yield samples (Fig. 4; Table S4). Quite interestingly, it was found that 582 out of 1,056 DEG were buffalo SNP candidate genes in low yield specific, high yield specific or common to both groups (Fig. 4). Among these 582 overlapping DEG, 196 genes (118 up-regulated and 78 down-regulated) belonged to low yield specific group, 139 genes (68 up-regulated and 71 down regulated) were high yield specific group. On the other hand, 247 DEG (147 up-regulated and 100 down-regulated in low milk yield) were common to high and low milk-yield groups. However, we didn’t observe any trend in the direction of fold change with respect to milk-yield group. Next, in order to assess whether the overlap between the DEG and the SNP candidates are not due to random chance, we performed Hypergeometric tests on each category viz. low yield specific, high yield specific and common. Interestingly, in all categories p-value < 0.0001. Thus, considering that SNP candidate genes accounted more than half of the DEG in a cross-species comparison, we propose that buffalo SNP candidates potentially influence downstream gene expression changes to a large extend.
In India, milk or milk products form one of the primary source of nutrition in an average diet, and water buffalo’s milk contribute to about 55% country’s total milk production (Michelizzi et al., 2010). In this study, using Roche 454 GS-FLX Titanium sequencer we performed exome sequencing of replicated samples of three Indian buffalo breeds belonging to high or low milk-yield groups. The contrast between high and low-milk yield across breeds were about 3.7 fold, which was not influenced by disease, fertility or diet factors. The sequencing experiment generated about 2.49 Gb of data, and the reads were mapped to cattle genome, as the complete assembly of water buffalo whole genome sequence is not available yet. Initially, this resulted in high discovery rate of SNPs compared to the previously reports (Jansen et al., 2013; Georges, 2014). However, the number of SNPs have markedly reduced when we filtered out those which are specific for low and high milk-yielding group. It has been recently reported that there are indigenous bias for each SNP calling algorithm which causes disagreement of results across software tools (Pabinger et al., 2014; Yi et al., 2014). However, only few efforts have been taken to assess this aspect systematically (Yi et al., 2014). In this study, we have used three algorithms for SNP calling and have selected those SNP which were detected by at least two tools. Initially, in order to assess potential true positive calls, we have applied the widely accepted ts/tv metrics (DePristo et al., 2011). According to the 1000 genomes project, the expected ts/tv ratio in whole-genome sequencing is about 2.10, and for exome target regions ranges from 2.6 to 3.5 (http://www.1000genomes.org/). We have not only observed a lower ts/tv ratio in the non-reproducible SNP calls, but also a varying range of ts/tv values. In addition, the SNP calls from intergenic region were almost doubled in case of non-reproducible SNPs. Overall, we propose the MultiCom approach as a simple but effective strategy to identify reliable SNPs, without decreasing overall discovery rate.
After determining the SNPs specific for high and low milk-yield groups, it is important to understand its possible consequences by extending the analysis at gene level, which is a very useful strategy to obtain the possible molecular framework hosted by the SNPs (Lehne, Lewis & Schlitt, 2011; Menon & Farina, 2011; Shastry, 2009). Towards this, we focused on the genes in which SNPs fall in their exonic region. Initially, we have seen that majority of the milk-yield and milk-fat QTL candidate genes are found among the discovered group. The SNP candidate genes specifically found in the low-yield specific group includes ABCG2, CSN2, CSN3, IL12RB2, SRC, STAT5B etc. The classic milk proteins such as CSN2, CSN3 (members of casein family), STAT5B etc. are prominent indicators of milk yield parameters (Lee et al., 2014). Interestingly, in the low-yield specific group, there were several genes related to immunological properties of milk. For example, ABCG2 gene is responsible for the active secretion of clinically and toxicologically relevant substrates into the milk and soluble CD14 in colostrum and milk acts as a sentinel molecule and an immune modulator, which provide innate responses against bacterial infections in the calf (Olsen et al., 2007; Ibeagha-Awemu et al., 2008). IL12RB2 gene is another example in this category. Thus the low-yield specific group not only contained genes directly implicated in milk yield, but also comprised of genes that contribute to major immunological properties of milk. On the other hand, high-yield specific candidate genes included ARFGEF1, GH1, DGAT1, IRF9, PRL etc., which are hallmark genes for milk yield in cattle. Interestingly, Shi et al. (2012) have recently shown that the polymorphism of several of these genes affects the milk composition in different water buffalo breeds.
GO enrichment analysis showed several transcriptional regulation related terms in the low-milk yield group, and we highlighted the 140 genes responsible for this observation. Recently, a gene network analysis of human mammary transcriptome data highlighted crucial nodes such as NR1H3 and PPARA genes (Mohammad & Haymond, 2013). Notably, these genes are present in our results, and it was reported that PPARA and its target genes has been shown to be involved in fatty acid uptake/oxidation, and promotes energy balance during early lactation in cattle (Schlegel et al., 2012).
The pathway enrichment analysis pointed towards several known pathways implicated in mammary biology such as JAK-STAT signalling, Wnt signalling, epidermal growth factor related pathways (ErbB signalling, VEGF signalling) etc. (Watson & Burdon, 1996; Rossiter et al., 2007; Hardy et al., 2010; Turashvili et al., 2006). In the low milk-yield specific group, Oxidative phosphorylation pathway was the most enriched pathway, which may contribute to the energy balance during different functional states of mammary gland (Nelson, Butow & Ciaccio, 1962). Interestingly, there were two pathways common to high and low milk-yield groups. Considering that the input genes for the pathway analysis were different, Toll-like receptor signalling and MAP kinase signalling pathways were enriched by diverse set of genes belonging to the two groups. Very interestingly, O’Neill, Golenbock & Bowie, (2013) have extensively reviewed Toll-like receptor signalling pathway, and suggested the possible dual outcomes of this pathway. Two important families of transcription factors that are activated downstream of TLR signalling are nuclear factor-κB (NF-κB) and the interferon-regulatory factors (IRF3/7), and the major consequence of NFKB mediated signalling is the production of interleukins, while the IRFs induces type I interferon (O’Neill, Golenbock & Bowie, 2013). Notably, RELA (a member of NFKB complex) was present among low-specific candidates and IRF3 was present in the high milk-yield specific group. Of note, TLR-4 mediated excess interleukin production has shown to cause lactation insufficiency in mice (Glynn, Hutchinson & Ingman, 2014). On the other hand, increased interferon level was observed in breastfeeding infants compared to non-breast feeding group, which confers enhanced protection against viral infection for the infant (Melendi et al., 2010). However, more studies need to be performed in this direction, as evidenced by limited number of studies.
Comparison of gene expression and SNP candidate genes offers a great extent of biological insights (Quan et al., 2014). Recently, Cui et al. (2014) observed differential gene expression in various susceptibility genes related to milk protein and fat level in Holstein friesen cattle. In our case, the buffalo SNP data were identified based on cattle as the reference genome, and the gene expression data were from cattle. However, we excluded the putative generic SNPs in buffalo breeds (found common to both high and low-yield groups) at the initial stage of analysis which makes the comparison relevant, even though the genome and transcriptome data are from buffalo and cattle species respectively. In the comparison, we found a strong presence of high or low-yield specific SNP containing genes among the DEG. This observation not only underlines the strategy of comparison but also provides positive indication for the reliability of the identified SNPs which need to be investigated in additional datasets.
In this study, we have highlighted issues related to SNPs detection using a single tool and propose selection of reproducible SNPs using Multicom approach, which facilitates detection of reproducible SNPs with a higher discovery rate. Focussed analysis on the high and low milk-yield SNPs in buffalo breeds, uncovered SNPs specific for high or low milk yield. A substantial number of reported genes in QTL region of cattle genome is present in results. Further, the enrichment analysis uncovered several transcriptional regulator candidates in low-milk yield group. Surprisingly, pathway analysis shared pathways like TLR receptor signalling, along with several known and specific pathways in high and low-milk yield groups. Finally, SNP candidate genes in buffalo breeds account for were splendidly found in differentially expressed genes from the cattle species. On the whole, these results not only highlighted SNPs related to high and low-yield buffalo, but also sheds light to methodological improvement, functional insights and cross species genome-phenome comparison. Of course, outcomes of this study need to be validated in independent samples which in turn contribute for the better genomic selection in buffalo breeds.