Animals and sampling
Blood samples were collected for genotyping from each of the four goat breeds, Nubian (NU), Desert (D), Taggar (T), and Nilotic (NI), 24 representative female animals were sampled from different regions of Sudan (Additional file 1: Table S1).
Nubian goats were collected at six locations in four different states along the river Nile, the Dongola area (Northern State), Abu Hamad, Aldamir, Shendi (River Nile State), Khartoum (Khartoum State) and Aljazirah (Aljazirah State). In total Nubian samples were obtained from 10 villages, three districts, one university farm, and three research stations.
Desert goat samples were collected from Bara and Abu Zabad area in the North Kordofan state (eight villages, one university farm).
Taggar goat samples were obtained from the Nuba Mountains and Dalang area in the South Kordofan state (five villages, one research station).
Nilotic goat samples were collected in the Kosti and Rabak areas in the White Nile state (eight villages).
Livestock experts, herdsmen and owners of the animals were consulted to ensure representative sampling of unrelated animals.
The characterization of the different Sudanese goat breeds are the following:
Nubian goat
Nubian is a highly productive dairy goat compared to other Sudanese goat breeds. Nubian goats are widely distributed in arid and extreme arid areas [4]. Besides Sudan, Nubian goats are found widespread in North Eastern Africa and the Mediterranean coastal belt. They have likely their origin in Sudan [5]. These goats are commonly black, but pure brown and multi-colorations of black and white also exist [6, 7].
Desert goat
Sudanese Desert goats are dual purpose goats. They are mainly found in semi-arid areas of the West of Sudan. They also move to extreme arid areas during nomadic migration. The efficient transformation of low quality feed into body mass makes the Desert goat valuable for the production of meat [8]. Desert goats are phenotypically similar to West African long-legged goats, and are possibly related to the Nubian goats [4]. Coat color is variable and mixed colors exist [9, 10].
Taggar goat
Taggar is a meat-type goat that has adapted to survive under harsh environmental conditions [11,12,13]. It is kept in many parts of Sudan with the highest density in the Nuba Mountain area close to the border to South Sudan. Taggar is a dwarf goat with disproportionally short legs, plump body and short head. The short stature is thought to result from achondroplastic dwarfism with lack of ossification at the cartilage joints [13, 14]. It is assumed that natural selection for the recessive dwarfism gene was favorable in response to the humid and hot climate conditions [4]. The most common coat colors for Taggar goats are dark or grey brown [7, 12].
Nilotic goat
Nilotic goat is another meat-type goat that produces high muscle mass under good feeding conditions [15]. Nilotic goats have a high reproductive potential since they reach sexual maturity at an early age. Nilotic goats live in the border region between Sudan and South Sudan. They distinguish from other breeds through their resistance against Trypanosomiasis [16]. These goats are small in size; the body is compact, but has normal proportions [6, 7, 17, 18]. Though, different from Taggar goats, achondroplasia does not occur in this breed. Almost all colors occur, but the predominant color is a mixture of black and white.
Genotyping
Blood samples were collected from the jugular vein using vacutainer tubes containing EDTA as anticoagulant. Blood was stored at −20 °C until DNA was extracted using the Puregen core kit A (Qiagen Sciences, Maryland, USA). All animals were genotyped with the Goat SNP52 BeadChip (Illumina, San Diego, CA), developed by the International Goat Genome Consortium (IGGC) [19]. The raw signal intensities of the 53,347 SNPs on the chip were imaged using the IlluminaScan Reader and converted into genotype calls with GenomeStudio software suite (version 2011.1) by using the SNP genomic locations and cluster file made available by IGGC (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/goat_9925/viewBatch/snpBatch_IGGC_1057128.gz). Locations of the SNP probes were remapped using BLASTN towards the CHIR 1.0, CHIR 2.0 and LWT01 genome version of Capra hircus. This was done to provide future researchers with an easy way to compare our results with their own work, since other researchers can find the corresponding locations on the genome version of their choice. Locations reported in this paper are based on the CHIR1.0 genome build. Locations of protein coding genes on the CHIR1.0 genome were obtained from The National Center for Biotechnology Information (NCBI) [http://www.ncbi.nlm.nih.gov/]; the version used is available as supplemental file (Additional file 2: locations of Protein coding genes).
SNP quality control
The R language for statistical computing v3.2.3 was used for quality control of the called SNP data [20]. From the entire set of 53,347 SNPs, only those with an Illumina GenTrain score ≥ 0.6 were kept (1441 SNPs failed). Furthermore, all SNPs with a minor allele frequency (MAF) lower than 5% (2273 SNPs) as well as SNPs which showed more than 5% missing data (150 SNPs) were excluded from further analysis. The deviation from Hardy-Weinberg equilibrium (HWE) was calculated and we found only 42 SNPs not in HWE. Since HWE deviations only concerns a very limited amount of SNPs <0.1% of the total, we decided to not remove these SNPs from further analysis. SNP probe sequences were mapped against the CHIR1.0 genome assembly using the Basic Local Alignment Search Tool (BLAST) [21, 22]. Probe sequences that were not mapping in the CHIR1.0 genome (41 SNPs) or probes which showed multiple hits (937 SNPs) against the reference genome were dropped from further analysis. 48,505 SNPs passed the quality control. Based on quality control of the data, one sample (Nubian goat) was excluded from further analysis.
Statistical analyses
Genetic diversity assessment
In order to assess the genetic diversity, the minor allele frequency, distribution and proportion of polymorphic SNPs were computed using the R language for statistical computing v3.2.3 [20]. To measure the genetic variation within a population, mean observed heterozygosity (Ho) and expected heterozygosity (HE) for each breed of goats and the total population were analyzed using the diveRsity R package v.1.9.89 [23]. The average population inbreeding coefficient (FIS) using Sewall Wright’s method [24] and the pairwise genetic differentiation between populations (Fst) were also calculated using the diveRsity package. To detect the level of genetic variation among samples within populations and among populations at different hierarchical levels, the Analysis of MOlecular VAriance (AMOVA) [25] was performed using StAMPP R package [26]. StAMPP calculates an AMOVA based on the Nei’s genetic distance matrix using the amova() function from the package PEGAS for exploring within and between population variation. StAMPP uses the formula: distance = populations, to calculate a hierarchical AMOVA as described in Excoffier et al. [25] to explore population differentiation and within/between population variation.
Hierarchical clustering
To measure the genetic distance between the four goat populations, hierarchical clustering of SNP data was performed. Pairwise Nei’s genetic distances [27] between all individuals were calculated from the SNP data by using the StAMPP R package [26]. Additionally, we used Reynolds [28] and Manhattan [29] distances between individuals (Additional file 3: Figure S1). Distances were clustered using the hclust function in R. After clustering of genetic distances between individuals, the ape package [30] was used to convert the resulting dendrogram into a phylogentic tree. For visualization of the phylogenetic tree a standard ape plotting functions was used.
STRUCTURE analysis
Population structure was determined by using a model-based clustering for assigning individuals from multi locus genotypes to a population using the STRUCTURE 2.3.3 software suite [31,32,33]. STRUCTURE uses a maximum likelihood method to infer the genetic ancestry of each individual from a mixture of K pre-defined ancestral groups. STRUCTURE analysis was carried out using an admixture model and correlated allele frequencies. Under the hypothesis of two to five sub-populations K was set from 2 to 5 and the length of the burn-in period was set to 100,000 iterations, followed by 100,000 Markov Chain Monte Carlo (MCMC) iterations. The whole analysis was then repeated five times in STRUCTURE to prevent the model from producing local optimal results. In our analysis, all five runs gave similar results. Results from STRUCTURE were loaded back into the R environment to visualize resulting groups. The most likely hierarchical level of genetic structure was determined by log probability of the data (Ln Pr (X|K) [33] and observed variance of this probability.
Principal component analysis (PCA)
As a different approach to characterize divergence, the genetic relationship between animals was analyzed by principal component analysis (PCA) using the ‘prcomp’ function in the R language. PCA is a statistical method for exploring and interpretation of big data by reducing the dimension of the data to the few principal components which capture the majority of the genetic variation observed in the genotypes. We visualized the contribution of individuals to the first 10 principal components; from our data we observe that a combination of PC1 and PC2 shows a good separation between different goat breeds. As such we investigate which SNPs allow PCA to separate between the four populations of Sudanese goats; we investigated which SNPs contribute highly to PC1.
To select high impact SNPs we calculated the variable correlations with PC1 by multiplying the loading factors with the component standard deviations. Quality of representation for all variables on the factor map (cos2) was calculated as the squared variable correlation. Afterwards, we summed up all the cos2 values for PC1 and expressed each SNP contribution as percentage of total variation. We then looked for SNPs which have more than 10 times the expected contribution to PC1 (contribution ≥0.02%, expected 0.0021%). These SNPs will capture most of the differentiation among breeds separated by PC1. We extracted genes within a region of 500 kb cis-window around these high contributing SNPs according to SNP Annotation and Proxy Search (SNAP) [34].