Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease

Purpose Current diagnostic testing for genetic disorders involves serial use of specialized assays spanning multiple technologies. In principle, genome sequencing (GS) can detect all genomic pathogenic variant types on a single platform. Here we evaluate copy-number variant (CNV) calling as part of a clinically accredited GS test. Methods We performed analytical validation of CNV calling on 17 reference samples, compared the sensitivity of GS-based variants with those from a clinical microarray, and set a bound on precision using orthogonal technologies. We developed a protocol for family-based analysis of GS-based CNV calls, and deployed this across a clinical cohort of 79 rare and undiagnosed cases. Results We found that CNV calls from GS are at least as sensitive as those from microarrays, while only creating a modest increase in the number of variants interpreted (~10 CNVs per case). We identified clinically significant CNVs in 15% of the first 79 cases analyzed, all of which were confirmed by an orthogonal approach. The pipeline also enabled discovery of a uniparental disomy (UPD) and a 50% mosaic trisomy 14. Directed analysis of select CNVs enabled breakpoint level resolution of genomic rearrangements and phasing of de novo CNVs. Conclusion Robust identification of CNVs by GS is possible within a clinical testing environment.


Supplemental Figures
. Inspection of Coriell GS CNVs in microarray data. Shown is median microarray probe depth for CNVs called in the GS-Canvas data processing pipeline across a cohort of 17 Coriell cell-lines. The background density on the right of the figure represents the distribution a four probe rolling median across the chip. * indicates putative mosaic copy number states (1.5X-1.75X for deletions and 2.5X-2.75X for copy-number gains).        In these cases, calls from our GS call-set had partial overlap with the PacBio/BioNano derived calls. We inspected these manually to better understand the discrepancies and assess false-positive or true-positive status.

Supplemental Tables
Supplemental Note Figure 1. This is a homozygous deletion flanking a mosaic 22q11 deletion, a likely cell line artifact. Our GS pipeline called this event as a single CNV, whereas the PacBio/BioNano based call-set only contained the homozygous deletion. Figure 2: This CNV is a homozygous deletion followed by a mosaic loss leading up to the centromere of chromosome 2. The homozygous deletion is contained in the PacBio/BioNano callset, but the mosaic loss is missed or filtered. Figure 3: This is a very common deletion supported by both population data, as well as discordant sequencing reads. The PacBio/BioNano call-set only partially called this deletion, but the data strongly support the GS depth based call. We suspect that this was missed by the alternative technologies due to the presence of more complex structural rearrangement in the region.

Supplemental Note
We conducted an investigation of false negative (FN) calls to determine if any systematic issues could be identified. To search for error modes, FN calls were analyzed via manual inspection of microarray depth, sequencing depth, and discordant reads. We found nearly all of the discrepant calls occurred in low complexity regions not covered by microarray, or had ambiguous annotation on the Coriell website and/or copy number calling publication 3 . Although we cannot definitively conclude that certain calls from Coriell are erroneous, data from NGS and multiple genotyping arrays do not support a majority of these calls. To this effect, while the initial recall was calculated at 86% (31/36) events, this in-depth view of data leads us to speculate that the sensitivity is considerably higher.

Manual Inspection of Coriell CNV Calls
Prior to validation, a 75% reciprocal overlap threshold was set for calling of concordant calls. In Table 1 we note 4 CNV calls with reciprocal overlaps in the range of 50-75%. A post-hoc analysis of this data generally support the boundaries of the Canvas CNV. The Coriell provided coordinates for all four CNVs are provided in Table S1.
NA02767: trisomy 21. The Coriell website records the CNV as extending across the centromere, whereas canvas calls the trisomy as the entirety of 21q, resulting in a 70% overlap. We note that we cannot call CNVs into the 21p due to low sequence complexity.

Manual Inspection of Calls with Less Than 75% Reciprocal Overlap
For assessment of reference call recovery, we chose a one-sided overlap to assess the fraction of a given reference call recovered. This was due to many of the reference CNVs in Coriell being reported by exonor probe-based measurements and/or compatibility issues between reference assemblies resulting in imprecise CNV boundaries. In addition, a bi-directional metric is complicated by the presence of reference call CNVs being represented by multiple CNVs from our calling pipeline. This is unavoidable for larger CNVs as benign variation often breaks up large CNVs into multiple calls: for example for the case of trisomy 21, there were multiple benign deletions causing a deviation from copy number 3 within much smaller regions.
During our validation, calls were manually curated to check for the type of edge cases where a large artefactual CNVs, may spuriously validate a reference call. To formally assess this, CNVs overlapping reference set calls were assessed for the fraction of the call overlapping the reference call. Nine calls with less 75% overlap were curated, and are represented below. It can be seen that these are clearly supported by the sequencing depth data and unlikely to be artifacts. See Supplemental Note Figures 5 and 7- and was split into two calls by our CNV Caller. The smaller call extends past the boundary of the reference call, but qualitatively it is clear that the correct call should extend to the centromere.

Manual Inspection of False Negative Coriell CNV Calls
Independent investigation of false negative (FN) calls was performed to determine if any systematic issues could be identified (Supplemental Note Table 1). To search for error modes, FN calls were analyzed via manual inspection of microarray depth, sequencing depth, and discordant reads.   Figure 15). In contrast, the CNVs on NA09834 and NA20304 are likely mapping or array artifacts. Take for example the 418kb deletion on chromosome 15 in NA20304 (Supplemental Note Figure 16). This CNV is reported in hg18 coordinates, and in our arrays it seems as though there are few probes within the region. Looking at the mappability (UCSC track 'Duke unique 35mers') between hg18 and hg19 in this region, it becomes clear that the updated reference has the sequence of this region represented multiple times as the 'uniqueness' drops from ~1 to ~.5 in most of the region (Supplemental Note Figure 16). Thus we can conclude that in hg18, this is likely a copy-number 4 region in the reference and any copy-number variants would represent this collapsed representation. Taken together we hypothesize that this CNV could be an artifact of the Affymetrix array from which it was derived, but have insufficient evidence to definitively rule this call out as a false negative. Figure 15: Example of CNV annotated in the Coriell sample NA21886 that has little to no support from two commonly used clinical microarrays. Figure 16: Example of CNV annotated in the Coriell sample NA20304 that has little to no support from two commonly used clinical microarrays.

Microarray confirmation of CNVs
All sequenced clinical samples were run in parallel with Illumina Infinium Omni 2.5 genotyping chips.
Array-based CNV analysis was conducted post-hoc on all samples with a positive CNV result that was not validated externally. Samples were processed in a single batch through GenomeStudio, and median centered across the cohort. All listed p-values are assessed via a permutation of the probe logR values for a given sample for all autosomes except for that on which the CNV resides. Note that the CNVs reported in subjects P6 and P7 were confirmed by an external clinical microarray lab, and the P16 was confirmed via karyotype. Figure 18: Subject P7. chr2: 11314-3033976. Empirical P < 10 -10 . Figure 27: Subject P13. chr15: 22696624-23301066. Empirical P < 0.01. Note that this was a low quality array sample (LogRDev = 0.33), which was deemed suitable for sample tracking but would not normally be used in CNV analysis. We note that in addition to the significant depth change, we also see an absence of heterozygous variants which lends further support to the deletion call.
de novo CNV phasing models For de-novo CNVs we observe the inheritance patterns of small variants to decipher parental haplotype on which a CNV resides.

Deletion phasing
Here we simply compare inheritance of variants under the assumption of the deletion being on either the maternal or paternal alleles. Table 2: Model assuming deletion on paternal allele (all variants inherited from mother): Model log-likelihoods: father -2472.071702 mother -6295.203394

Supplemental Note
Prediction: de novo deletion on paternal allele. Figure 8: Transition frequencies for example deletion.

Gain phasing
For gains, there are four possible scenarios. A gain may be of maternal or paternal origin, and be either simple or complex. By simple we refer to a duplication of a single allele, while a complex gain refers to the scenario where a proband can inherit material from both parents' copies of the DNA segment (an example of this is in an unbalanced translocation). Additionally, rather than having two copy states as in the case of deletions, gains have four possible variant copy states. Table 3: Model assuming a simple duplication of a maternal allele: