This paper presents a simple method for comparing whole microbial populations (Figure 1). First, samples are collected from the microbiome of interest, in this case the rumen and faeces of cattle. Variation in the treatment of samples should be avoided, as this may affect the microbial populations. Next DNA is extracted. The use of the same DNA extraction method on all samples is extremely important, as it has been shown that the chosen DNA extraction method has a large effect on the observed population . The whole DNA extract from each sample is then sequenced on a MPS platform.
To generate the matrix used for hierarchical clustering, quality control is first performed on the sequence reads. Quality control typically involves removing low quality reads, and trimming the ends of reads based on Phred quality scores. The number of reads in each library is then normalised. This removes the effect of library size on the matrix, while retaining information about what does not align to the database. Reads are then aligned to a contig reference. This reference does not have to be well annotated as the method does not rely on identifying specific species or genes. We have found that the database should ideally be from the same type of environment as the samples; although a database generated from a similar environment type (e.g. from a different species, in this case aligning bovine rumen reads to a human faecal reference) can be used if required. This method does not rely on assignments to taxonomy; therefore it can make use of assemblies from whole metagenome sequencing projects as the reference database. Because there is no requirement for 16S genes to be pulled out, the proportion of the metagenome used in the between sample comparison is limited only by the database quality and coverage. The resulting alignments are then used to generate a contig by sample matrix that contains counts of the number of reads from each library that aligned to each contig in the database.
The sample by contig matrix is then used to perform hierarchical clustering. This can be done using freely available software such as the R statistical package , specifically the ‘dist’ command to generate a distance matrix. We used the Canberra and binary distance matrix methods. The command ‘hclust’ is then used to generate a dendrogram (we used the method ‘Ward’). To find the support for the generated dendrogram the package pvclust  can be used. Low support for the clustering pattern may suggest the need for deeper sequencing of the metagenomes. The dendrogram can be used to interpret the relationships between samples on a whole population level. The distance matrix used in this method could be considered the metagenome equivalent of a genomic relationship matrix, which can be used to infer relationships between individuals based on variation across the genome. Here we infer relationships between samples based on variation across the metagenome.
Rumen samples were collected from lactating cannulated cows located at the Victorian Department of Primary Industries Ellinbank Centre near Warragul, Victoria Australia (latitude 38 14`S, longitude 145 56`E). A diet of 6.0 kg dry matter of concentrates and ad libitum lucerne hay was fed to all cows for two weeks prior to sampling. The concentrate mix contained 4.1 kg dry matter of crushed wheat, 1.5 kg dry matter of cold-pressed canola meal, 0.12 kg dry matter of mineral mix and 0.28 kg dry matter of palabind molasses powder.
For the three cows used in the sample repeatability experiment; after two weeks on the diet, the entire rumen contents were removed through the fistulae, sampled and replaced. The samples referred to as ‘TOP’ were collected from the first third of the rumen contents; these samples generally contained mostly feed material and were quite solid. The samples referred to as ‘BOTTOM’ were collected from approximately the last sixth of rumen contents; these samples were mostly liquid and are possibly representative of a reticulum sample. The samples referred to as ‘MIX’ were collected by mixing together every 20th handful of rumen material; this sample is a representation of the entire rumen. Rumen fluid was squeezed from rumen samples, frozen on dry ice, and then stored at −20°C.
For the seven cows used in the rumen/faeces comparison; rumen fluid was collected as per the MIX sample from the repeatability experiment. Faecal samples were collected on the same morning as rumen samples.
DNA extraction and sequencing
Thawed rumen fluid was strained through four layers of UV sterilized tulle. 80 mL of filtered rumen fluid was centrifuged at 5000 g for one hour at 4°C. DNA was then extracted from the pellet using PowerMax Soil DNA Isolation Kit (MO-BIO). Faecal samples were prepared in the same manor, omitting the initial centrifugation step.
Library preparation for sequencing was performed using an in-house indexing protocol. The indexes are a short third read of the sequencing run. Briefly, DNA was sheared to 300 bp, adapters were added by ligation, and then indexes were added using PCR. The libraries were then quantified and pooled. Paired-end sequencing of rumen fluid genomic DNA was performed on the GAIIx (sample repeatability experiment) and HiSeq2000 (rumen/faeces comparison experiment) sequencers (Illumina, San Diego, CA). Sequence reads were trimmed so that the average Phred quality score for each read was above 20. If the read length was below 50 after trimming, the read was discarded.
Hierarchical clustering databases
The DPI_rumen database contained bovine and ovine rumen metagenome sequences that were obtained from whole metagenome sequencing of several samples. Bovine samples were collected from a number of cannulated Holstein-Friesian cows in Victoria, at different times, which were fed ryegrass pasture supplemented with cereal grain. The ovine sample was collected from one sheep, after slaughter at a Victorian abattoir, which had been on a diet of pellets (23.1% crude protein, 4.5% fat). Sequencing of the whole genomic DNA prepared from filtered and unfiltered rumen fluid was performed using the 454 GS-FLX (Roche Diagnostics, Indianapolis, IN), according to manufacturer’s instructions and on 454 FLX Standard (Roche Diagnostics, Indianapolis, IN) and 454 Titanium (Roche Diagnostics, Indianapolis, IN) sequencing platforms at JCVI . See Additional file 1 for a more detailed description.
The SFF files from Roche 454 MPS of the bovine and ovine rumens were assembled into contigs. A combination of Newbler v2.5.3 (release 20101207_1124) and Newbler v2.3 (release 091027_1459) was used. Assemblies were run with default settings, and also stringent parameters: 98% minimum identity, 100 bp minimum length of overlap. The flag ‘-urt’ was used to extend contig length.
The database JGI_rumen consists of contigs previously published by Hess et al. . The GreenGenes database  consisted of rRNA sequences. The NCBI prokaryote database  contained all sequenced prokaryotes available on the NCBI’s genomes database on the 9th of March 2011. The soil database contained contigs derived from sequencing of top soil . The human faeces database contained contigs from sequencing of human faeces . Additional file 1 contains detailed database characteristics.
For the sample repeatability experiment; 6 million reads from each library were analyzed by hierarchical clustering. Sequencing reads were aligned to each of the databases using the default settings of BWA . Alignments were then used to make a sample-by-contig matrix of the number of reads mapping to each contig in the database. Hierarchical clustering was then performed using this matrix in the R statistical package ; the package Pvclust  was used to compute bootstrap (BP) and approximately unbias (AU) values. For hierarchical clustering two distance matrix methods were used: Binary for presence/absence data and Canberra for counts data.
To test the minimum sequencing depth required for separation of sample by cow, alignments were subsampled by extracting the first 1000, 10 000, 100 000, 1 million, 2 million, 3 million, 4 million, 5 million and 6 million reads from each alignment file. Hierarchical clustering was then performed on the subsamples as described above, using the Canberra distance matrix method.
For the rumen/faeces comparison experiment sequence reads were hierarchically clustered using the same method as the sample repeatability experiment. The R statistical package  then used the sample-by-contig matrix to assess the correlations between each faeces sample and each rumen sample. A t-test was then performed on these correlation values to assess if the correlation between a cows rumen and faeces profile was higher than the correlation between samples from two different cows.
Linear mixed models
Statistical analysis was performed using ASReml . The number of reads was modelled using a linear mixed model. For the rumen position experiment, 3 million pairs of reads were aligned to the reference. For the faeces/rumen comparison 4 million pairs of reads were aligned to the reference. A contig by sample matrix (rumen metagenome profile) was generated. Contigs with less than 10 reads aligning were removed from the matrix. The models fitted to the data were Yijk = μ + sample positioni + contigj + animalk + eijk for the repeated sampling experiment and Yijk = μ + sample typei + contigj + animalk + eijk for the rumen faeces comparison. Y was the number of reads mapping, animal was the host animal that the samples were taken from, contig was the contig in the database used, sample type (faeces or rumen fluid) specified the section of the digestive system that the sample was from and sample position (top, bottom or mix) specified the poison within the rumen that the same was taken from. Metagenome profiles generated from alignments to both DPI_rumen and JGI_rumen were used.
For the rumen/faeces comparison experiment 100 000 reads with a minimum Phred quality of 20 for every base, a minimum average quality of 30 across the whole read and a minimum length of 90 basepairs were aligned to nr (last update September 22nd 2011) database on an internal system using BLASTx (minimum e-value 0.2, 5 alignments per read). The resulting output file was then loaded into MEGAN, and the level 2 KEGG assignments were extracted.
Animal ethics statement
Samples collection was approved by the Victorian Department of Primary Industries (DPI) Agricultural Research & Extension Animal Ethics Committee (Number: AEC 2010_16) and the DPI Metropolitan Animal Ethics Committee (Number: 07/004/DH), and conformed with all relevant regulatory standards.
Sequence data availability
All raw Illumina sequence data can be obtained freely by contacting the Department of Primary Industries Victoria Biosciences Research Division. The DPI_rumen database has been uploaded to MG-RAST (metagenomics.anl.gov) [ID: 4491686.3].