Skip to main content

Principal components analysis - K-means transposon element based foxtail millet core collection selection method



Core collections are important tools in genetic resources research and administration. At present, most core collection selection criteria are based on one of the following item characteristics: passport data, genetic markers, or morphological traits, which may lead to inadequate representations of variability in the complete collection. The development of a comprehensive methodology that includes as much element data as possible has been explored poorly. Using a collection of (Setaria italica sbsp. italica (L.) P. Beauv.) as a model, we developed a method for core collection construction based on genotype data and numerical representations of agromorphological traits, thereby improving the selection process.


Principal component analysis allows the selection of the most informative discriminators among the various elements evaluated, regardless of whether they are genetic or morphological, thereby providing an adequate criterion for further K-mean clustering. Overall, the core collections of S. italica constructed using only genotype data demonstrated overall better validation scores than other core collections that we generated. However, core collection based on both genotype and agromorphological characteristics represented the overall diversity adequately.


The inclusion of both genotype and agromorphological characteristics as a comprehensive dataset in this methodology ensures that agricultural traits are considered in the core collection construction. This approach will be beneficial for genetic resources management and research activities for S. italica as well as other genetic resources.


The exploitation of genetic resources has been a primary concern for several governmental and nongovernmental agricultural institutions around the world [1], where the interest may vary from economically exploitable variant crops [2], to sociocultural [3], health-related [4], and biological-related studies (phylogenetic relationships, phenotype-genotype relationships, and physiological-environmental behaviors [1]). However, most researchers must address the problem of data mining to obtain collections of an appropriate size [5].

Due to the size of some collections, complete collection (MC) data mining may sometimes be too expensive (both operative and monetary); therefore, core collections (CC) [6] and mini-core collections have emerged in recent decades [7].

Methods for obtaining an optimal CC have been explored widely [811], and several algorithms and informatics tools have been developed [1215], but CCs still have many different objectives and various evaluation criteria [10].

Most CC-related studies are based on one or more of three principal characteristics: a) passport data, b) genotypic analysis, and c) morphological traits ([16]). As new genetic information becomes available, CC selection has increasingly used genotypic analysis as a good criterion, but the efficiency of specific molecular markers needs to be demonstrated for phenotypic traits of interest because both types of data are fundamental requirements of genetic breeding programs [17]. Several studies have utilized molecular markers in different collections, including the development of CCs based on widely used simple sequence repeats [11, 17, 18] and restriction fragment length polymorphisms [19], which have demonstrated the great potential of using genetic data for CC selection.

Foxtail millet (Setaria italica subsp. italica (L.) P. Beauv.) is one of the oldest cereals consumed by people in Eurasia, America, Africa, and Australia. Foxtail millet has a relatively small genome size (515 M) and it is has been adopted as a model organism [20, 21] because of its potential use in studies that involve grass species evolution, C3 and C4 photosynthesis, stress biology and biofuel [2224].

Three recently active transposons (TE) have proved to be suitable genome-wide markers for evolutionary studies of S. italica [25]. We hypothesize that these markers may also be useful for CC selection in this species.

In this study, we combined principal components analysis (PCA) and the K-means method for CC selection [18] based on evaluations of traditional and newly described CC evaluation parameters [10]. This methodology allowed to include both genotypic and agromorphological traits (AT) in CC selection. Thus, we present a proof of concept for the potential use of TE and AT combined as selection criteria for CC construction in S. italica.


Core collection selection

Dataset used

The accessions used in this study originated from 38 different countries, which encompassed the major traditional geographical distribution (Asia, Eurasia, and Africa) of the study species. In order to obtain genomic information, transposon display (TD), a modified form of amplified fragment length polymorphism (AFLP) [26], was performed with some modifications using three TEs: TSI-1 [tourist miniature interspersed nuclear elements (Tourist MITE)], TSI-7 [long terminal repeats (LTR) retrotransposons], and TSI-10 [short interpersed nuclear elements(SINE)], with different classes and characteristics [27]. These TEs were identified in the mutant alleles of Waxy (GBSS1), which controls the amylose content in the starch endosperm [27]. The genomic dataset obtained (data 0) comprised a total of 423 S. italica accessions, which were genotyped by TD [25]. AT data was downloaded and categorized from the National Institute of Agrobiological Sciences (NIAS) for 141 of the original 423 accessions. Eight ATs were categorized and mapped to binary data, which were represented as 28 “m” characteristics (data II) for discrete variables, and any possible phenotypic traits were treated as present/absent. Continuous variables were categorized arbitrarily into three groups and then treated as discrete variables using the same present/absent criteria. The original phenotypic values and their numerical representations are summarized in Additional file 1 (Online Resource 1). To facilitate comparisons of data II behavior, we created data I, which comprised the same 141 accessions used in data II, but with the genotypic information for data 0. In order to determine the feasibility of analyzing phenotypic traits with genotypic markers in a single step, we merged the data I and data II sets to obtain (data III), where each m element was treated as equal regardless of its TD or AT origin.

Principal component analysis - K-means analysis

Because the informativeness is different for each m element of data, PCA was performed in order to rearrange data into a new matrix. This procedure decreases the informativeness of subsequent elements and it discards elements with a variance that is equal to 0. This process generated two new matrices: one containing the original m characteristics mapped vectors (x) and the rearranged variance value matrix (X). Thus, matrix X contained n samples, which were formed of a numerical vector with m=m-(non-informative m). m can also be determined arbitrarily in order to work with only the most informative elements of data. To select the CCs, we performed PCA to arrange the data from the most significant to the least significant elements in terms of the difference information discriminator, but without affecting the element associations [28]. After rearranging the data, the score that represented each value was subjected to K-means clustering according to [29], which is an implementation that enhances the K-means algorithm in order to avoid empty clusters. For each K cluster, the sample with the lowest Euclidean distance from the cluster centromere was selected as a representative. The newly generated CC was evaluated according to several validation parameters, which have been used widely [8, 9] and reviewed in recent studies [10].

Evaluation of the selected core collections

The selected CCs were analyzed based on their distribution according to a phylogenetic reconstruction. A genetic distance matrix and a neighbor-joining dendrogram were obtained using AFLP-SURV 1.0 [30] and the Phylogeny Inference Package (PHYLIP) 3.69 [31], respectively, for the 141 accessions present in data I. The data I dendrogram and the visualization of the CCs were obtained using MEGA 5.2 [32]. The geographical distributions of the CCs were digitalized and visualized using DIVA GIS

According to [10], the best method for evaluating a CC depends on the purpose of the CC and ideally different datasets should be used in the evaluation, although it can be performed with the same data. Thus, they established three criteria based on the CC data dispersion: a) average distance between each MC sample and the nearest CC sample (ANE), b) average distance between each CC sample and the nearest CC sample (ENE), and c) average distance between CC samples (E), which are calculated as:

$$ ANE_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}\sum\limits^{J}_{j=1}D(k-cMC_{j}), $$

where K is the total of CC elements, k is each CC element, and D is the alignment-free genomic distance (GAFD) [33] between k and each jth cMC element, for which the closest CC element is k, including itself, thereby yielding L comparisons in total.

$$ ENE_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}D(k-cCC), $$

where K is the total of CC elements, k is each CC element, and D is the GAFD distance between k and its closest CC element cCC, excluding itself, thereby yielding L comparisons in total.

$$ E_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}\sum\limits^{J}_{j=1}D(k-cCC_{j}), $$

where K is the total of CC elements, k is each CC element, and D is the GAFD distance between k and all other jth CC elements, cCC, excluding itself, thereby yielding L comparisons in total.

The ideal value for ANE is 0, where each sample of CC represents itself and others exactly like it. It is useful to evaluate CCs where the objective is a homogeneous representation of the diversity in the MC. In addition, ENE and E are used to evaluate the data dispersion for the CC, where higher values indicate the better representation of extreme values.

Evaluation criteria based on statistical parameter comparisons between the CC and the MC are used mainly to determine whether the CC adequately represents the identity of the MC as well as its distribution. Widely used evaluation parameters that meet these criteria were applied as follows.

A homogeneity test was performed on each trait for CC and MC based on the means and variances. For each comparison, a global value was represented as the percentage of traits that were statistically different (α=0.05) according to a tt e s t for means (MD) and the Ft e s t for variances (VD) [8].

The coincidence rate (CR) and variable rate (VR) were used to evaluate the properties of the CCs in terms of the MC, which are defined by:

$$ CR=\frac{1}{M}\sum\limits^{M}_{m=1}\frac{R_{CC}}{R_{MC}}*100 $$


$$ VR=\frac{1}{M}\sum\limits^{M}_{m=1}\frac{CV_{CC}}{CV_{MC}}*100, $$

respectively, where R is the range and CV is the coefficient of variation for each m trait in the CC and MC, and M is the number of traits. According to ([9]), a valid CC has C R>80 and M D<20, which are the limits for the ideal representation of the MC identity and distribution. The coverage of alleles (CA) in a CC measures the percentage of alleles from the MC that are present in the CC, which is given by:

$$ CA=\left[|1-(|1-ACC|/AMC)|\right]* \,100, $$

where ACC is the set of alleles in the CC and AMC is the set of alleles present in the MC [12].

Excluding the phylogenetic reconstruction and geographical distribution, all of the methodological procedures were performed using FREEMAT v4.2

The FREEMAT codes are available in Additional file 2 (Online Resource 2).

Results and discussion

Usefulness of transposon display markers for CC selection

Locus-specific molecular marker systems, such as SNPs [21, 34], microsatellites [35] and other indel events [34] are available for foxtail millet. These markers may provide useful information for CC selection, but the full coverage of the complete genome with these markers has some conceptual and methodological limitations. SNPs and indels provide relatively less information per locus due to their bi-allelic nature and over 10,000 SNPs may be required to discriminate a closely related populations [36]. Microsatellites may overcome these limitations, but testing microsatellites that cover the complete genome distribution also incur high laboratory expenses and time-consuming procedures [1].

The use of TEs as an alternative to locus-specific molecular marker systems is based on the assumption that a significant fraction of plant genomes comprise TEs [37], i.e., recently active display higher polymorphisms [38]. A considerably large number of alleles can be detected using TEs as genetic markers with a small number of primer sets. CC selection using TEs combined with the recently released foxtail millet genome sequence [21] will considerably increase the number of polymorphic markers. Thus, we proposed a method that does not require genomic information, or a large number of locus-specific genetic markers, which is based on an AFLP-like technique that could easily be transferred to other biological systems. This method will enhance the reliability of CC selection considerably, thereby refining the exploitation of genetic resources.

To demonstrate the efficiency of ATs and TEs as CC selection criteria, we used K-means as a practical approach to clustering based on Kai et al. [11], who stated that the use of the principal coordinates instead of raw data (i.e., microsatellite genotype data) before K-means clustering makes the clustering step less sensitive to changes in the noisiness of the raw data. We agree that dimensionality reduction can enhance clustering process and it is possible to reduce the number of dimensions analyzed during this methodological step. However, to avoid more variables in the ATs and TEs evaluation, we used all of the informative data and we will explore the significance of dimension reduction in future implementations.

Validation of the CCs selected by different datasets

The validation scores (VS) for different K values are presented in Table 1. As expected, the scores obtained by the CCs improved as their K values increased, which strongly suggests that the VSs are consistent with those reported previously [9, 10]. Interestingly, the VSs agreed with the data I, data II, and data III distributions (Fig. 1). When the CCs were constructed and evaluated using the same data (Figs. 2 (left), 3 (center) and 4 (right)), data II obtained better ANE and ENE results because these values should be affected considerably by the relationship between the data distribution and K value. This effect was supported when the CCs were constructed and evaluated using different data (Figs. 2 (center&right), 3 (left & right) and 4 (left & center)). Thus, the CCs constructed using data I and evaluated with data II obtained better results in terms of most of the VSs, but not vice versa. Initially, this may suggest that genotypic data are better for CC construction, but a genotype-based CC cannot ensure the inclusion of interesting agricultural traits. In general, the data III VS values were as expected between data I and data II, but there were some interesting exceptions. When they were compared using the same data, the ANE and ENE values with data III were lower than those obtained with the other datasets. This may be explained by the data distribution pattern (Figs. 2 (left), 3 (center) and 4 (right)). The data distribution of data III was wider, which would lead to poorer ANE values with the same k than when the data distribution is more compact. The same distribution effect obtained the opposite result when compared with different data, where in some cases data III obtained even better ANE values than data I and data II. The ENE values were also affected by the data distribution because wider distributions generated extreme value representations, which were more difficult to handle under the k-mere representations implemented in this study (i.e., the closest element to the centromere). A better ENE score may be obtained using different selection criteria, which will be addressed in future implementations of this concept.

Fig. 1

Principal component distributions for data I (blue), data II (black), and data III (red) in the first three (left) and two (right) principal components, respectively

Fig. 2

Principal component distributions of data I (left), data II (center), and data III (right) in data I for the first two principal components

Fig. 3

Principal component distributions of data I (left), data II (center), and data III (right) in data II for the first two principal components

Fig. 4

Principal component distributions of data I (left), data II (center), and data III (right) in data III for the first two principal components

Table 1 Core Collection evaluation scores for different K selected elements

The discreteness of the 141 accessions used in the CC selection procedures was confirmed by displaying their distribution on the phylogenic dendrogram based on data 0 presented in Additional file 3 (Online Resource 3). In order to evaluate whether the CC was representative, a phylogenetic dendrogram was constructed based on the genotypic distances among the MCs data I. The phylogenetic reconstruction obtained eight groups, which agreed with previously reported groupings [25]. Thus, the selected CCs were identified according to this dendrogram.

The distribution pattern of the dendrogram demonstrated that data I CC covered the largest number of branches, followed by data III and data II (Fig. 5). This may be because the tree itself was constructed using complete data, which differed from data I only in terms of the number of accessions included in each dataset. However, data II CC also covered over half of the branches when K>12. Data III CCs covered as many different branches as data I CC (except K=48). This suggests that the data III-based CCs successfully integrated phenotypic information into the genotypic information, but without altering the distribution in the dendrogram. The geographical distributions of the selected CCs were also displayed on a world map and the results are shown in Fig. 6 Data II CCs represented the widest geographical distribution range. The CCs include accessions from both the longitudinal and latitudinal range edges, even small K CCs (Fig. 6). This clearly indicates that the data II CCs represent accessions that are adapted to different environmental conditions. As the number of K increased, the distribution range became wider for all the CCs in terms of both the longitude and latitude. Interestingly, several accessions were selected from different datasets. Among these accessions, two were included in 100 % of the CCs irrespective of their original dataset (12 times in 12 CCs), and 5 accessions were present in 66.7 % (8 times out of 12 CCs) to 91.7 % (11 times out of 12 CCs) of the CCs. These accessions may be distantly related to other accessions in terms of both their genetic and phenotypic traits, although the establishment of a phenotype/genotype correlation would require a different approach. Thus, we demonstrated that it is possible to generate adequate CCs using both phenotypic and genotypic information, and it is important to remember that the phenotypic traits employed in this study were selected and mapped arbitrarily only to establish a proof-of-concept with respect to the feasibility of constructing a comprehensive CC based on both genotypic and AT information. Further studies based on the optimization of phenotypic numerical representations are needed to enhance the accurate representation of the available information. We believe that the use of adequate AT mappings and the inclusion of different molecular markers will improve the CC selection process. This methodology could be used to infer ancestry, particularly with low K when the algorithm is expected to favor the selection of polyphyletic taxons that would represent unique ancestry for each element in the CC. However, it needs to be taken into consideration that phenotypic traits may affect this expected outcome, and that the algorithm was not designed nor tested for ancestries establishment.

Fig. 5

Distribution of the selected CCs (k = 12) from data I (solid circles-left), data II (solid triangles-center), and data III (solid squares-right) based on the dendrogram obtained using 141 foxtail millet individuals. The dashed lines represent groups of clusters

Fig. 6

Geographical distribution of k = 12 CCs from data I (top), data II (center), and data III (bottom). The colored dots represent the geographical origin of each CC member and the crosses represent the geographical origin of each accession included in the analysis. Maps were generated with Diva-GIS 7.5 based on GADM v.1.0

To the best of our knowledge, the present study is the first attempt to combine genotypic and morphological information during CC construction with this approach. It was possible to construct CCs based on both information types using the proposed methodology. As demonstrated by the VS values, the PCA distribution (Figs. 2, 3, and 4), phylogenetic representations (Fig. 5), and geographic distributions (Fig. 6), the phenotypic data provided useful and potentially important information. We believe that genotypic information alone should not be used to generate CCs. In general, morphological information is used to include variation in the CC [11, 18]. Our evaluation of the PCA distribution suggests that both phenotypic and genotypic information have important effects on the selected CCs.


Our approach was successful in capturing most of the genotypic, phenotypic, and geographical diversity in a small set of individuals. Data III CCs were highly representative in terms of both genetic and phenotypic variations. The use of this approach for CC selection may provide beneficial materials in terms of biochemical, morphological, agronomic, and phylogenetic traits, which can be combined with genomic information. The precise definition of phenotypic numerical representations requires further attention, but we believe that combined information CCs will be highly beneficial for breeding improvement, domestication description processes, evolutionary studies, and phenotype/genotype correlation research given the advantages of using adequate CCs for S. italica as well as other crops.

Availability of data and materials

Supporting data and codes are available as additional files.


  1. 1

    McCouch SR, McNally KL, Wang W, Sackville Hamilton R. Genomics of gene banks: a case study in rice. Am J Bot. 2012; 99(2):407–23. doi:10.3732/ajb.1100385.

    Article  PubMed  Google Scholar 

  2. 2

    Studnicki M, MADRY W, Schmidt J. Efficiency of sampling strategies to establish a representative in the phenotypic-based genetic diversity core collection of orchardgrass (Dactylis glomerata. Czech J Genet Plant Breed. 2013; 2013(1):36–47. [Accessed 8 August 2014].

    Google Scholar 

  3. 3

    Bellon M, Smale M, Aguirre A, Taba S. Identifying appropriate germplasm for participatory breeding: An example from the Central Valleys of Oaxaca, Mexico. 2000. [Accessed 8 August 2014].

  4. 4

    Santra M, Matthews SB, Thompson HJ. Development of a core collection of Triticum and Aegilops species for improvement of wheat for activity against chronic diseases. Agric Food Secur. 2013; 2(1):4. doi:10.1186/2048-7010-2-4.

    Article  Google Scholar 

  5. 5

    Reeves Pa, Panella LW, Richards CM. Retention of agronomically important variation in germplasm core collections: implications for allele mining. TAG. Theoretical Appl Genet Theoretische und angewandte Genetik. 2012; 124(6):1155–71. doi:10.1007/s00122-011-1776-4.

    Article  Google Scholar 

  6. 6

    Brown AHD. Core collections: a practical approach to genetic resources management. Genome. 1989; 31(2):818–24. doi:10.1139/g89-144.

    Article  Google Scholar 

  7. 7

    Guo Y, Li Y, Hong H, Qiu LJ. Establishment of the integrated applied core collection and its comparison with mini core collection in soybean (Glycine max). Crop J. 2014; 2(1):38–45. doi:10.1016/j.cj.2013.11.001.

    Article  Google Scholar 

  8. 8

    Franco J, Crossa J, Warburton ML, Taba S. Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Sci. 2006; 46(2):854. doi:10.2135/cropsci2005.07-0201.

    Article  Google Scholar 

  9. 9

    Hu J, Zhu J, Xu HM. Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. TAG Theor Appl Genet. 2000; 101(1-2):264–8. doi:10.1007/s001220051478.

    Article  CAS  Google Scholar 

  10. 10

    Odong TL, Jansen J, van Eeuwijk FA, van Hintum TJL. Quality of core collections for effective utilisation of genetic resources review, discussion and interpretation. TAG Theor Appl Genet Theoretische und angewandte Genetik. 2013; 126(2):289–305. doi:10.1007/s00122-012-1971-y.

    Article  CAS  PubMed  Google Scholar 

  11. 11

    Kai S, Tanaka H, Hashiguchi M, Iwata H, Akashi R. Analysis of genetic diversity and morphological traits of Japanese Lotus japonicus for establishment of a core collection. Breed Sci. 2010; 60(4):436–46. doi:10.1270/jsbbs.60.436.

    Article  Google Scholar 

  12. 12

    Thachuk C, Crossa J, Franco J, Dreisigacker S, Warburton M, Davenport GF. Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures. BMC Bioinformatics. 2009; 10:243. doi:10.1186/1471-2105-10-243.

    PubMed Central  Article  PubMed  Google Scholar 

  13. 13

    De Beukelaer H, Smýkal P, Davenport GF, Fack V. Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search. BMC Bioinformatics. 2012; 13:312. doi:10.1186/1471-2105-13-312.

    Article  PubMed  Google Scholar 

  14. 14

    Jansen J, van Hintum T. Genetic distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuce. TAG Theor Appl Genet Theoretische und angewandte Genetik. 2007; 114(3):421–8. doi:10.1007/s00122-006-0433-9.

    Article  CAS  PubMed  Google Scholar 

  15. 15

    Gouesnard B, Bataillon T. MSTRAT: An algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness. J Hered. 2001; 92(1):93–4. doi:10.1093/jhered/92.1.93.

    Article  CAS  PubMed  Google Scholar 

  16. 16

    Parra-Quijano M, Iriondo JM, Cruz MDL, Torres E. Strategies for the development of core collections based on ecogeographical data. Crop Sci. 2011; 51(2):656. doi:10.2135/cropsci2010.04.0191.

    Article  Google Scholar 

  17. 17

    Santos-Garcia MO, de Toledo-Silva G, Sassaki RP, Ferreira TH, Resende RMSA, Chiari L, et al.Using genetic diversity information to establish core collections of Stylosanthes capitata and Stylosanthes macrocephala. Genet Mol Biol. 2012; 35(4):847–61. doi:10.1590/S1415-47572012005000076.

    PubMed Central  Article  PubMed  Google Scholar 

  18. 18

    Ebana K, Kojima Y, Fukuoka S, Nagamine T, Kawase M. Development of mini core collection of Japanese rice landrace. Breed Sci. 2008; 58(3):281–91. doi:10.1270/jsbbs.58.281.

    Article  Google Scholar 

  19. 19

    Kojima Y, Ebana K, Fukuoka S, Nagamine T, Kawase M. Development of an RFLP-based Rice Diversity Research Set of Germplasm. Breed Sci. 2005; 55(4):431–40. doi:10.1270/jsbbs.55.431.

    Article  CAS  Google Scholar 

  20. 20

    Zhang Y, Zhang X, Che Z, Wang L, Wei W, Li D. Genetic diversity assessment of sesame core collection in China by phenotype and molecular markers and extraction of a mini-core collection. BMC Genet. 2012; 13(1):102. doi:10.1186/1471-2156-13-102.

    PubMed Central  Article  PubMed  Google Scholar 

  21. 21

    Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, et al.Reference genome sequence of the model plant Setaria. Nat Biotechnol. 2012; 30(6):555–61. doi:10.1038/nbt.2196.

    Article  CAS  PubMed  Google Scholar 

  22. 22

    Doust AN, Kellogg Ea, Devos KM, Bennetzen JL. Foxtail millet: a sequence-driven grass model system. Plant Physiol. 2009; 149(1):137–41. doi:10.1104/pp.108.129627.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  23. 23

    Brutnell TP, Wang L, Swartwood K, Goldschmidt A, Jackson D, Zhu XG, et al.Setaria viridis: a model for C4 photosynthesis. Plant Cell. 2010; 22(8):2537–44. doi:10.1105/tpc.110.075309.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  24. 24

    Muthamilarasan M, Prasad M. Advances in Setaria genomics for genetic improvement of cereals and bioenergy grasses. Theor Appl Genet. 2015; 128(1):1–14. doi:10.1007/s00122-014-2399-3.

    Article  CAS  PubMed  Google Scholar 

  25. 25

    Hirano R, Naito K, Fukunaga K. Genetic structure of landraces in foxtail millet (Setaria italica (L.) P. Beauv.) revealed with transposon display and interpretation to crop evolution of foxtail millet. Genome. 2011; 54(6):506:498–506. doi:10.1139/G11-015.

    Article  Google Scholar 

  26. 26

    Casa AM, Nagel A, Wessler SR. MITE display. Methods Mol Biol (Clifton, N.J.). 2004; 260(1):175–88. doi:10.1385/1-59259-755-6:175.

    CAS  Google Scholar 

  27. 27

    Kawase M, Fukunaga K, Kato K. Diverse origins of waxy foxtail millet crops in East and Southeast Asia mediated by multiple transposable element insertions. Mol Genet Genomics:MGG. 2005; 274(2):131–40. doi:10.1007/s00438-005-0013-8.

    Article  CAS  PubMed  Google Scholar 

  28. 28

    Bamberg J, del Rio A. Selection and validation of an AFLP marker core collection for the wild potato solanum microdontum. Am J Potato Res. 2013. doi:10.1007/s12230-013-9357-5.

  29. 29

    Pakhira MK. A modified k -means algorithm to avoid empty clusters. Int J Recent Trends Eng. 2009; 1(1):220–6.

    Google Scholar 

  30. 30

    Vekemans X. AFLP-SURV. Laboratorie de Génétique et Ecologie Végétale. Bruxelles, Belgium: Université Libre de Bruxelles; 2002.

    Google Scholar 

  31. 31

    Felsenstein J. PHYLIP. Seattle, Washington, USA: Department of Genome Sciences, University of Washington; 2004.

    Google Scholar 

  32. 32

    Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol. 2011; 113(6):1530–4. doi:10.1093/molbev/msr121.

    Google Scholar 

  33. 33

    Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PLoS ONE. 2014; 9(11):110954. doi:10.1371/journal.pone.0110954.

    Article  Google Scholar 

  34. 34

    Jia G, Huang X, Zhi H, Zhao Y, Zhao Q, Li W, et al.A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica). Nat Genet. 2013; 45(8):957–61. doi:10.1038/ng.2673.

    Article  CAS  PubMed  Google Scholar 

  35. 35

    Zhang S, Tang C, Zhao Q, Li J, Yang L, Qie L, et al.Development of highly polymorphic simple sequence repeat markers using genome-wide microsatellite variant analysis in Foxtail millet [Setaria italica (L.) P. Beauv]. BMC Genomics. 2014; 15(1):78. doi:10.1186/1471-2164-15-78.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  36. 36

    Yamamoto T, Nagasaki H, Yonemaru J-I, Ebana K, Nakajima M, Shibaya T, et al.Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC Genomics. 2010; 11:267. doi:10.1186/1471-2164-11-267.

    PubMed Central  Article  PubMed  Google Scholar 

  37. 37

    Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002; 3(5):329–41. doi:10.1038/nrg793.

    Article  CAS  PubMed  Google Scholar 

  38. 38

    Monden Y, Naito K, Okumoto Y, Saito H, Oki N, Tsukiyama T, et al.High potential of a transposon mPing as a marker system in japonica x japonica cross in rice. DNA Res. 2009; 16(2):131–40. doi:10.1093/dnares/dsp004.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

Download references


This research was supported partly by the SATREPS project entitled ’Diversity Assessment and Development of Sustainable Use of Mexican Genetic Resources’ funded by the Japan Science and Technology Agency and by Japan International Cooperation Agency, and partly by a Grant-in-Aid (No. 25257416) from the Japan Society for the Promotion of Science.

Author information



Corresponding author

Correspondence to Kazuo Watanabe.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

EB and RMH conceived the idea, carried out coding, performed the experiments, analized the data and wrote the manuscript. MT, MK and KW substantialy contributed with conceptual advice, participated in the discussion of the results and commented on the manuscript at all stages. KW administered the experiment. All authors have read and approved the final version of the manuscript.

Additional files

Additional file 1

Online Resource 1. Tables with Genotype/Phenotype information (OR1.rar).

∙FoxtailATMap.cvs : Mapped Foxtail AT where 1 = ’ presence’; 0 = ’abscence’. (corresponds to data II).

∙FoxtailGenMap.cvs : Mapped Foxtail Genotype where 1 = ’presence’; 0 = ’abscence’.(corresponds to data I). (RAR 18.2 kb)

Additional file 2

∙ alleleFrecNRet.m

∙ CCvalidation.m

∙ corrI.m

∙ ChooseInitialCenters.m

∙ dcKMeansB.m

∙ DirectkMeans.m

∙ eigdec.m

∙ FFTCompare.m

∙ FFTForSignals.m

∙ findCCinFC.m

∙ randCC.m

∙ pca2.m


∙ triu.m (RAR 12.4 kb)

Additional file 3

Online Resource 3. Phylogenetic Dendrogram of data 0 and the selected elements of data II (dots) in OR3.png. (PNG 56.5 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Borrayo, E., Machida-Hirano, R., Takeya, M. et al. Principal components analysis - K-means transposon element based foxtail millet core collection selection method. BMC Genet 17, 42 (2016).

Download citation


  • Core collection
  • Foxtail millet
  • K-means
  • Principal component analysis