### Core collection selection

#### Dataset used

The accessions used in this study originated from 38 different countries, which encompassed the major traditional geographical distribution (Asia, Eurasia, and Africa) of the study species. In order to obtain genomic information, transposon display (TD), a modified form of amplified fragment length polymorphism (AFLP) [26], was performed with some modifications using three TEs: *TSI-1* [tourist miniature interspersed nuclear elements (Tourist MITE)], *TSI-7* [long terminal repeats (LTR) retrotransposons], and *TSI-10* [short interpersed nuclear elements(SINE)], with different classes and characteristics [27]. These TEs were identified in the mutant alleles of *Waxy* (*GBSS1*), which controls the amylose content in the starch endosperm [27]. The genomic dataset obtained (*data 0*) comprised a total of 423 *S. italica* accessions, which were genotyped by TD [25]. AT data was downloaded and categorized from the National Institute of Agrobiological Sciences (NIAS) http://www.gene.affrc.go.jp/databases-plant_search_char_en.php?type=9 for 141 of the original 423 accessions. Eight ATs were categorized and mapped to binary data, which were represented as 28 “*m*” characteristics (*data II*) for discrete variables, and any possible phenotypic traits were treated as present/absent. Continuous variables were categorized arbitrarily into three groups and then treated as discrete variables using the same present/absent criteria. The original phenotypic values and their numerical representations are summarized in Additional file 1 (Online Resource 1). To facilitate comparisons of *data II* behavior, we created *data I*, which comprised the same 141 accessions used in *data II*, but with the genotypic information for *data 0*. In order to determine the feasibility of analyzing phenotypic traits with genotypic markers in a single step, we merged the *data I* and *data II* sets to obtain (*data III*), where each *m* element was treated as equal regardless of its TD or AT origin.

#### Principal component analysis - K-means analysis

Because the informativeness is different for each *m* element of *data*, PCA was performed in order to rearrange *data* into a new matrix. This procedure decreases the informativeness of subsequent elements and it discards elements with a variance that is equal to 0. This process generated two new matrices: one containing the original *m* characteristics mapped vectors (*x*) and the rearranged variance value matrix (*X*). Thus, matrix *X* contained *n* samples, which were formed of a numerical vector with *m*=*m*-(non-informative *m*). *m* can also be determined arbitrarily in order to work with only the most informative elements of *data*. To select the CCs, we performed PCA to arrange the data from the most significant to the least significant elements in terms of the difference information discriminator, but without affecting the element associations [28]. After rearranging the data, the score that represented each value was subjected to K-means clustering according to [29], which is an implementation that enhances the K-means algorithm in order to avoid empty clusters. For each K cluster, the sample with the lowest Euclidean distance from the cluster centromere was selected as a representative. The newly generated CC was evaluated according to several validation parameters, which have been used widely [8, 9] and reviewed in recent studies [10].

### Evaluation of the selected core collections

The selected CCs were analyzed based on their distribution according to a phylogenetic reconstruction. A genetic distance matrix and a neighbor-joining dendrogram were obtained using AFLP-SURV 1.0 [30] and the Phylogeny Inference Package (PHYLIP) 3.69 [31], respectively, for the 141 accessions present in *data I*. The *data I* dendrogram and the visualization of the CCs were obtained using MEGA 5.2 [32]. The geographical distributions of the CCs were digitalized and visualized using DIVA GIS http://www.diva-gis.org/.

According to [10], the best method for evaluating a CC depends on the purpose of the CC and ideally different datasets should be used in the evaluation, although it can be performed with the same data. Thus, they established three criteria based on the CC data dispersion: a) average distance between each MC sample and the nearest CC sample (ANE), b) average distance between each CC sample and the nearest CC sample (ENE), and c) average distance between CC samples (E), which are calculated as:

$$ ANE_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}\sum\limits^{J}_{j=1}D(k-cMC_{j}), $$

((1))

where *K* is the total of CC elements, *k* is each CC element, and *D* is the alignment-free genomic distance (GAFD) [33] between *k* and each *jth*
*cMC* element, for which the closest CC element is *k*, including itself, thereby yielding *L* comparisons in total.

$$ ENE_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}D(k-cCC), $$

((2))

where *K* is the total of CC elements, *k* is each CC element, and *D* is the GAFD distance between *k* and its closest CC element *cCC*, excluding itself, thereby yielding *L* comparisons in total.

$$ E_{tot}=\frac{1}{L}\sum\limits^{K}_{k=1}\sum\limits^{J}_{j=1}D(k-cCC_{j}), $$

((3))

where *K* is the total of CC elements, *k* is each CC element, and *D* is the GAFD distance between *k* and all other *jth* CC elements, *cCC*, excluding itself, thereby yielding *L* comparisons in total.

The ideal value for ANE is 0, where each sample of CC represents itself and others exactly like it. It is useful to evaluate CCs where the objective is a homogeneous representation of the diversity in the MC. In addition, ENE and E are used to evaluate the data dispersion for the CC, where higher values indicate the better representation of extreme values.

Evaluation criteria based on statistical parameter comparisons between the CC and the MC are used mainly to determine whether the CC adequately represents the identity of the MC as well as its distribution. Widely used evaluation parameters that meet these criteria were applied as follows.

A homogeneity test was performed on each trait for CC and MC based on the means and variances. For each comparison, a global value was represented as the percentage of traits that were statistically different (*α*=0.05) according to a *t*−*t*
*e*
*s*
*t* for means (MD) and the *F*−*t*
*e*
*s*
*t* for variances (VD) [8].

The coincidence rate (CR) and variable rate (VR) were used to evaluate the properties of the CCs in terms of the MC, which are defined by:

$$ CR=\frac{1}{M}\sum\limits^{M}_{m=1}\frac{R_{CC}}{R_{MC}}*100 $$

((4))

and

$$ VR=\frac{1}{M}\sum\limits^{M}_{m=1}\frac{CV_{CC}}{CV_{MC}}*100, $$

((5))

respectively, where *R* is the range and *CV* is the coefficient of variation for each *m* trait in the CC and MC, and *M* is the number of traits. According to ([9]), a valid CC has *C*
*R*>80 and *M*
*D*<20, which are the limits for the ideal representation of the MC identity and distribution. The coverage of alleles (CA) in a CC measures the percentage of alleles from the MC that are present in the CC, which is given by:

$$ CA=\left[|1-(|1-ACC|/AMC)|\right]* \,100, $$

((6))

where ACC is the set of alleles in the CC and AMC is the set of alleles present in the MC [12].

Excluding the phylogenetic reconstruction and geographical distribution, all of the methodological procedures were performed using FREEMAT v4.2 www.freemat.sourceforge.net.

The FREEMAT codes are available in Additional file 2 (Online Resource 2).