### Array Design

The Nsp CN array contains 12,339,139 oligonucleotide probes tiled onto two arrays. Probes were selected to represent each of the 1,330,354 fragments between 200–1100 bp predicted to arise after digestion of human genomic DNA with the restriction enzyme NspI. All data presented is based on the human reference genome build 35 (May 2004 build). For all chromosomes, 8–10 PM (perfect match) probes were identified per fragment using a probe selection algorithm previously developed for high density 25-mer arrays [66]. Simple repeats and SNP sequences were avoided.

For background estimation, a pooled set of "antigenomic" probes were used which has been matched to each perfect match feature based on its GC content and which are not present elsewhere in the genome [42].

### Data Analysis

#### I. Preprocessing

##### 1. Probe Filtering

In order to extract the highest quality data from the Nsp-CN arrays, several filtering steps were implemented to remove adversely performing probes.

###### Probe filtering based on probe GC content, fragment length and GC content, and NspI restriction site characteristics

Several previous studies have suggested that the restriction fragment length and GC content as well as probe GC content have a strong effect on feature intensity [37, 52, 67]. Analysis of the relationship between Nsp-CN array probe intensity and its associated probe and fragment characteristics (data not shown) have led to the first set of filtering criteria: probes with less than 30% or greater than 60% GC content were removed as well as probes within restriction fragments greater than 1000 bp in length, <25% GC content, or > 60% GC content. In addition, probes residing in fragments in which the enzyme recognition site contains a SNP [68] were also filtered out.

###### Probe filtering based on number of genome hits

The xMAN (extreme Mapping of OligoNucleotides) algorithm was used to map all Nsp CN probes to the human genome [69]. Probes with more than two genomic hits were discarded due to reduced ability to respond to changes in target dosage.

After the above two filtering steps, the number of probes was reduced from 12,339,139 to 10,379,759 (84.12%), and the number of fragments were reduced from 1,330,354 to 1,245,607 (93.6%). The remaining set of filters was applied independently for each data set.

###### Filtering of high-intensity probes

Exploratory data analysis discovered that probes with the highest intensity on the arrays had very low dose response (Additional File 3), in part due to cross hybridization with multiple sites in the genome. For each set of samples being analyzed together, probes that were consistently in the top 10% intensity categories were filtered out.

###### Filtering of low-intensity probes: estimation of background effects

In order to identify probes that consistently failed to produce a signal above the background level, a sequence specific model was used to estimate the contribution of systematic noise to the probe signal intensity. Although overall probe GC content plays a crucial role in the estimation of background, recent studies have pointed out that position dependent sequence effects are also important [70–72]. Motivated by the sequence-specific model, the following multiple linear regression model was used to describe the background effect on the Nsp CN arrays:

\begin{array}{c}\mathrm{log}\phantom{\rule{0.1em}{0ex}}(Intensit{y}_{i})=\alpha +{\displaystyle \sum _{k\in \{A,C,G\}}{\displaystyle \sum _{l=1}^{3}{\beta}_{k,l}}}{P}_{i,k}^{l}+{\displaystyle \sum _{j=1}^{25}{\displaystyle \sum _{k\in \{A,C,G\}}{\displaystyle \sum _{l=1}^{3}{\gamma}_{k,l}{j}^{l}{I}_{ijk}}}}\\ +{\displaystyle \sum _{m=1}^{24}{\displaystyle \sum _{n\in \left\{A\right\{A,C,G,T\},C\{A,C,G,T),G\{A,C,G,T),T\{A,C,G\left\}\right\}}{\displaystyle \sum _{l=1}^{3}{\delta}_{n,l}{m}^{l}{I}_{imn}}}}+{\epsilon}_{i}\end{array}

(1)

where

• *Intensity*_{
i
}is the probe intensity of probe *i*;

• *α* is the intercept of the regression;

• *j* = 1,...,25, representing the position along the probe *i*;

• *k* represents the base at position *j*;

• *P*_{i,k}is the percentage of nucleotides A, C, G in the probe *i*;

• *β*_{k,l}is the effect of nucleotide percentage (A, C, or G) in the probe, for a fixed base nucleotide *k*, the effect is modeled as a polynomial of degree 3;

• *I*_{
ijk
}is an indicator function such that it is 1 when the *j* th position is base *k* in probe *i*, and it is 0 otherwise;

• *γ*_{k,l}is the effect of base *k* in position *j*, the effect is modeled as a polynomial of degree 3;

• *m* = 1,2,...,24, representing the di-nucleotide position along the probe *i*;

• *n* is the set of di-nucleotide nearest neighbor compositions such as 'AA', 'AC', 'GT' etc;

• *I*_{
imn
}is an indicator function such that it is 1 when the *m* th position is di-nucleotide *n* in probe *i*, and it is 0 otherwise;

• *δ*_{n,l}is the effect of di-nucleotide in position *m*, the effect is modeled as a polynomial of degree 3;

• *ε*_{
i
}is the error-term.

Log intensities of all 33,886 anti-genomic probes were fitted to estimate parameters using least squares. Each array was fitted separately and a total of 64 parameters were estimated for each array. These parameters were used to calculate the background-adjusted intensities for all interrogation probes on the array, and the value of zero was set as the threshold to determine whether signal was greater than background. For each set of samples being analyzed together, probes that exhibited a consistent signal lower than background were filtered out.

###### Probe filtering based on number of probes within a fragment (probe set)

The last probe filtering step removed probes where only a single probe remained for a given fragment (due to filtering from previous steps). Thus, every fragment is represented by at least two probes that have passed all filtering criteria.

##### 2. Probe Standardization

Inspired by previous studies demonstrating that probe intensities are affected by fragment length, fragment GC content, probe GC content, nucleotide locations on the probe, and recognition site sequence of restriction enzyme, optical background adjusted probe intensities were fitted to a multiple linear regression model [37, 70–72]. The AIC stepwise auto-selection procedure was used to identify the best model. The starting model has a 10 degree polynomial for each variable. A cubic term was used with most of the variables and the subset of selected variables can be slightly different from sample to sample. The following multiple linear regression model was used to fit the data:

\begin{array}{c}\mathrm{log}\phantom{\rule{0.1em}{0ex}}(\text{adjusted}P{M}_{i})=\alpha +{\displaystyle \sum _{k\in \{A,C,G\}}{\displaystyle \sum _{l=1}^{3}{\beta}_{k,l}}}{P}_{i,k}^{l}+{\displaystyle \sum _{j=1}^{25}{\displaystyle \sum _{k\in \{A,C,G\}}{\displaystyle \sum _{l=1}^{3}{\gamma}_{k,l}{j}^{l}{I}_{ijk}}}}\\ +{\displaystyle \sum _{m=1}^{24}{\displaystyle \sum _{n\in \left\{A\right\{A,C,G,T\},C\{A,C,G,T),G\{A,C,G,T),T\{A,C,G\left\}\right\}}{\displaystyle \sum _{l=1}^{3}{\delta}_{n,l}{m}^{l}{I}_{imn}}}}\\ +{\displaystyle \sum _{o\in \{A,C,G\}}{\displaystyle \sum _{l=1}^{3}{\eta}_{o,l}}}{F}_{i,o}^{l}+{\displaystyle \sum _{l=1}^{3}{\lambda}_{i,l}{L}^{l}}+{\zeta}_{i}{I}_{ic=C}+{\epsilon}_{i}\end{array}

(2)

where

• adjusted *PM*_{
i
}is the optical background adjusted probe intensity of probe *i*; for each array, the minimum intensity from all interrogation probes is first identified and this number minus 1 is regarded as the optical background intensity and it is subtracted from all probe intensities;

• *α*, *j*, *i, k, P*_{i,k}, *β*_{k,l}, *I*_{
ijk
}, *γ*_{k,l}, *m*, *n*, *I*_{
imn
}, *δ*_{n,l}, *ε*_{
i
}have the same meaning as in formula (1);

• *F*_{i,o}is the percentage of nucleotide A, C, or G in the fragment on which probe *i* resides;

• *η*_{o,l}is the effect of A, C, or G percentage in the fragment, for a fixed base nucleotide *o*, the effect is modeled as a polynomial of degree 3;

• *L* is the length of the fragment which corresponds to probe *i*;

• *λ*_{i,l}is the effect of fragment length, the effect is modeled as a polynomial of degree 3;

• *I*_{
ic
}is an indicator function such that it is 1 when the nucleotide at the 3' restriction cutting site is C and it is 0 otherwise;

• *ζ*_{
i
}is the effect of nucleotide at the 3' restriction cutting site for the fragment on which probe *i* resides;

There are total of 77 parameters in this model consisting of 1 α, 9 β, 45 γ, 9 δ, 9 η, 3 λ and 1 ζ. 100,000 autosomal probes were randomly selected from probes which were kept after filtering steps for each array. Optical background adjusted intensities from these 100,000 probes were used to fit the model to estimate the model parameters for each array. Using these estimated parameters, residual intensities for all probes were predicted and these standardized intensities were used in subsequent steps.

##### 3. Probe Set Summarization

After filtering and standardization, probes residing on the same Nsp I restriction fragment (i.e. the probe set) were summarized to a single value using RMA, a median polish based method developed previously for RNA expression studies to account for feature effects due to probe composition [73]. The effect of RMA was evaluated using the 1X–5X DNA samples, where the linear correlation coefficient and the regression slope improved significantly (Additional File 5).

###### Pair-wise CNV Detection

CNV detection was implemented on a pair-wise basis by comparing a single test sample with a single reference sample. In this study, we only concentrate on discovering CNVs from autosomal chromosomes. Immuglobulin genes (Ig) were removed from the analysis. These regions include IgK at 2p11, IgL at 22q11, and IgH at 14q32[74].

#### II. Genome Segmentation

##### 1. Calculating log of intensity ratio

After the RMA summarization step, each probe set is represented by one single value. Subsequently, log intensities of the reference sample were subtracted from the test sample to obtain the log intensity ratio.

##### 2. Local correction

A local correction step was used to remove outlier fragments based on the premise that a typical CNV region should span more than one *NspI* fragment, and neighboring fragments within a CNV should have a similar log intensity ratio. First, all significant fragments from each chromosome were identified as fragments whose log intensity ratio is 3 times higher than the chromosome specific standard deviation of the log intensity ratio. A single non-significant fragment located between two significant fragments was ignored for subsequent analysis as long as the significant fragments were in the same direction (either positive log intensity ratios or both negative log intensity ratios). Furthermore, if the two significant fragments were very close in distance (<1 kb), all non-significant fragments located between them were removed. In addition, a single significant point was removed if neighboring points, defined as the nearest upstream or downstream fragment within 100 kb, or any fragment within 1 kb, did not show a log intensity ratio greater than 2 times the standard deviation of log intensity ratios. For a typical pair-wise comparison, 0.8%–0.9% of the fragments were filtered out in this step.

##### 3. Kernel smoothing and regression tree partitioning to identify CNV regions

To make array data more comparable across different data sets, the local-corrected intensities were first scaled to a mean of zero by subtracting the mean log intensity ratio for all autosomal fragments. Next, to improve the signal to noise of the adjusted log intensity ratio data, kernel smoothing was applied with a Gaussain kernel and a 10 kb bandwidth. Finally, in order to identify putative CNV regions, the smoothed log intensity ratios were fitted to a regression tree model as described previously [51, 75]. The end result is the partitioning of the genome into consecutive genomic regions. A single measurement is derived from each region which is the mean log intensity ratio based on all fragments that are within the region.

The optimal value for the threshold complexity parameter (cp), was empirically determined using a test sample, NA15510 and a reference sample, NA10851. This parameter controls the complexity of the partitioning of the regression tree. We tested a range of cp values, from 0.0001 (used in our previous study [51]) to 0.001 in a step of 0.0001. The two major metrics used to evaluate this parameter were 1) how well the final CNV list overlapped with validated/reported known CNV regions in sample NA15510 [7, 15], and 2) whether regions either known to undergo somatic rearrangement (such as the Ig loci) or harbor previously identified CNVs are split into several smaller regions. This cp parameter was finally set to 0.0004, indicating that splits which do not increase the overall R-squared value by 0.04% were not tested. In the process of building the regression tree, the "minsplit" parameter was set to 3. When a genomic region contains 3 or less fragments, the tree building procedure was halted. In the tree-pruning phase of the algorithm, 10 fold cross-validation and the 1-SE (standard deviation) rule were used to decide the size and the complexity of the final tree model [51, 75].

#### III. CNV Identification

##### 1. Permutation approach to define CNV significance thresholds

To determine significance thresholds for defining CNV regions in pair-wise comparisons, we used a permutation test after the local correction and mean ratio adjustment. The physical locations of the fragments were randomly permuted 500 times and the permuted data was subjected to the same kernel smoothing and regression tree partitioning procedures with the same algorithmic parameters as described above. A unique threshold was defined for each size group based on the false discovery rate (FDR).

Genome partitioning results from the permutation runs were parsed into 19 size groups containing 2, 3, 4, 5, 6, 7, 8, 9, 10, 11–15, 16–20, 21–30, 31–40, 41–50, 51–60, 61–70, 71–80, 81–90, and 91–100 fragments to get size specific null distributions of the log intensity ratios. A unique threshold was defined for each size group based on the false discovery rate (FDR) [76, 77] with even partitioning of the FDR among all the size groups. The following formula was used to determine the significance threshold for each size group:

{I}_{ij}=\frac{fdr}{{N}_{g}\ast {N}_{a}}\ast \frac{{N}_{pij}}{{N}_{cij}}

(3)

where

• *I*_{
ij
}is the index for retrieving the significance threshold for size group *i* on array *j* of the Nsp-CN array set, *j* = 1,2;

• *fdr* is the pre-specified maximal false discovery rate for the whole Nsp-CN array set;

• *N*_{
g
}is the total number of size groups;

• *N*_{
a
}is the number of arrays in the array set, *N*_{
a
}= 2 for Nsp CN array set;

• *N*_{
pij
}is the number of genomic regions in the size group *i*, based on results summarized from all the permutation runs of array *j*;

• *N*_{
cij
}is the number of genomic regions in the size group *i*, based on results from tree partitioning of the test sample's genome on array *j*;

Once the *I*_{
ij
}was computed, *I*_{
ij
}+ 1 was the index used to retrieve significance thresholds for size group *i* on array *j*. Thresholds for amplifications and deletions were computed separately. Significant regions from the partitioned test sample were identified using these log intensity ratio thresholds. For putative CNV regions containing more than 100 fragments, which were not considered directly in the permutation test, we used the threshold derived from the 91–100 fragment group and required the log intensity ratio to be greater than 3 times the standard deviation of raw, unsmoothed autosomal log intensity ratios.

The optimal number of falsely-detected CNVs for our test sample was identified as eight (after testing values between 1 and 10) using the following criteria: 1) overlap of generated CNV regions with reported CNVs in the literature and 2) consistency with QPCR validation. This number corresponds to a FDR (False Detection Rate) of ~5% since there are ~160 CNVs detected in each pair-wise comparison (8/160).

##### 2. Additional criteria for determining the final CNV regions

To generate the final list of CNV regions, the following additional steps were taken:

1) Only putative CNV regions with average log intensity ratios greater than 4 times the standard deviation of kernel smoothed, autosomal log intensity ratios were retained.

2) Adjacent significant regions were merged to form one larger CNV region and the log intensity ratio of the newly merged region was averaged.

3) Only CNVs containing more than one significant fragment were retained. Significance was based on having a raw log intensity ratio at least 3 times more significant than the standard deviation of raw, un-smoothed autosomal log intensity ratios.

### Target preparation and hybridization to arrays

DNA from cell lines was purchased from the Coriell Institute for Medical Research (Camden, NJ). The DNA samples containing different numbers of X chromosomes (1X to 5X sample set) are NA10851, NA15510, NA04626, NA01416 and NA06061. The sample used for much of the parameter tuning and CNV identification was the test sample, NA15510. Additional samples include two HapMap trios (NA10831, NA12155, NA12156, NA10846, NA12144, NA12145). In all cases a normal male reference sample, NA10851, was used for comparison.

For target preparation of the DNA, we used the whole genome sampling assay (WGSA) as described by the manufacturer for the Nsp250K SNP genotyping array [63]. Briefly, 250 ng of DNA is digested with NspI, adapter-ligated, and PCR amplified using a single primer homologous to the adapter. After purification, 90 ug of fragmented and labeled target is hybridized onto the array.

For data quality assessment, genotype calls were generated from 250 SNPs using the DM (Dynamic Modeling) calling algorithm with cutoff p-value 0.26 [78]. Any arrays giving rise to a call rate of less than 85% were redone.

### QPCR validation of CNV regions

Quantitative PCR using the ABI 7500 Sequence Detection System was used to independently validate CNVs detected by our algorithm as described previously [37]. At least four replicate reactions for novel CNVs were run for each primer pair and the comparative ΔΔC_{T} method (User Bulletin #2; Applied Biosystems) was used to calculate the fold change at each locus between the test and reference samples. In addition, a t-test p-value based on the ΔCt values was used to determine the statistical significance of the result. The thresholds for determining whether an amplicon was validated or not were set using results from seven independent X chromosome amplicons that were each analyzed using the 1X to 5X DNA samples (Additional File 7). The 1X, 3X, 4X and 5X DNA samples were compared to the normal female 2X sample for each of the seven amplicons for a total of 28 measurements (4 comparative measurements per amplicon × 7 amplicons). All results that showed a fold change less than 0.8 or greater than 1.25 as well as a p-value < 0.01 were considered to be significant. Using these thresholds, there were 24 of the 28 comparisons that reached significance. Of the four measurements that did not meet significance, one (Chr X_Amplicon 2) is a known copy number variant between NA15510 (2X sample) and NA18501 (1X sample) and thus this did not pass the fold change threshold. The remaining three measurements all passed the fold change threshold but did not pass the p-value cut-off. For ambiguous results, the QPCR was repeated and often new primer pairs were designed as shown in the Additional File 6. Of the 96 QPCR-validated CNVs, 18 were tested with a single amplicon and 76 were tested with at least two independent amplicons. Also, for CNVs that failed QPCR validation, 23 out of 25 were tested with two or more amplicons. Any one primer pair displaying significance was considered evidence of CNV validation. Some novel CNVs reside in regions of segmental duplication that preclude the identification of QPCR primer pairs that generate a single unique amplicon. Thus independent validation of these CNVs is technically challenging, leading to possible false negative results.

### Data Release

The raw data from this study are posted at the Gene Expression Omnibus with accession number GSE9053 [79].