저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할수없습니다. 변경금지. 귀하는이저작물을개작, 변형또는가공할수없습니다. 귀하는, 이저작물의재이용이나배포의경우, 이저작물에적용된이용허락조건을명확하게나타내어야합니다. 저작권자로부터별도의허가를받으면이러한조건들은적용되지않습니다. 저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다. 이것은이용허락규약 (Legal Code) 을이해하기쉽게요약한것입니다. Disclaimer

A thesis of the Degree of Doctor of Philosophy Proof-of-concept studies for clinical application of next-generation sequencing in diagnosis of rare genetic diseases February 2019 The Department of Biomedical Sciences Seoul National University College of Medicine Se Song Jang

2019 2 2018 12 ( ) ( ) ( ) ( ) ( )

Proof-of-concept studies for clinical application of next-generation sequencing in diagnosis of rare genetic diseases By Se Song Jang A thesis submitted to the Department of Biomedical Sciences in partial fulfillment of the requirement of the Degree of Doctor of Philosophy in Medical Sciences at Seoul National University College of Medicine February 2019 Approved by Thesis Committee: Professor Professor Professor Professor Professor Chairman Vice Chairman

ABSTRACT Proof-of-concept studies for clinical application of next-generation sequencing in diagnosis of rare genetic diseases Se Song Jang Major in Biomedical Sciences Department of Biomedical Sciences Seoul National University Graduate School Over the past several years, next-generation sequencing (NGS) technology has begun to replace old genetic testing technologies due to its accuracy and large target range. With the help of NGS, many disease-causing or diseaseassociated variants of disorders suspected to be genetic have been discovered, especially in the field of rare diseases. Nonetheless, further research is still necessary to establish the most realistic and reliable diagnostic methods for various diseases so that NGS will be more feasible in clinical practice. i

Clinical decision making about which molecular diagnostic tests to perform is a field of active research; the studies described in this dissertation provide options for clinicians in several different clinical circumstances. The field of noninvasive prenatal diagnosis (NIPD) of rare genetic diseases has embraced the incorporation of NGS into clinical use. Discovery of cellfree fetal DNA in maternal plasma has led to the rapid development of NIPD. In previous studies, parental haplotypes were resolved using indirect approaches, by inference from the genotype of the proband or other affected family members. In the work this dissertation details, I improved on previous algorithms in two different ways. In the first chapter, I describe using microfluidics-based linked-read sequencing technology to directly resolve maternal haplotypes without the proband s DNA in families affected with Duchenne muscular dystrophy. In the study described in the second chapter, I use traditional indirect haplotype phasing approaches but with a common platform that can perform genotype prediction, diagnosis of the proband and carrier, as well as recombination event detection simultaneously for multiple X-linked diseases, reducing the time and cost of the analysis. Both studies show that NIPD of X-linked diseases using cffdna and NGS is an accurate and viable method that could replace invasive testing currently in clinical use. ii

Chapters 3 and 4 describe how I then investigated genetic testing for rare genetic diseases that are very heterogeneous, specifically early onset epilepsy (EOE) and maturity-onset diabetes of the young (MODY). For heterogeneous diseases, the ultimate goals are genome-wide testing methods such as wholeexome or whole-genome sequencing, but in clinical application, selective gene panel tests have multiple advantages over genome-wide approaches; if appropriate genes and regions are chosen, selective tests can be more efficient, more affordable, and more easily implemented in clinical practice. One of the major difficulties in implementing NGS clinically is variant interpretation. I tried to interpret the pathogenicity of variants more accurately by combining clinical information with publicly available archives and strictly adhering to the guidelines of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG-AMP). Using targeted gene panel sequencing followed by thorough evaluation of the putative variants, I was able to achieve diagnostic yields similar to those of the largest and most comprehensive studies of EOE and MODY. Different factors needed to be taken into account for each disease. For example, in EOE, both the phenotypic spectrum and the type of genomic variations were diverse. Copy number variations (CNVs) and mosaic variants were detected quite frequently, increasing the diagnostic rate, whereas in MODY, no CNVs or mosaic variations were detected. Hence, designing a panel that can iii

effectively detect low-frequency variations and CNVs is critical for EOE. For MODY, selecting the relevant genes and categorizing the clinical criteria, such as family history, symptoms, age, body-mass index, MODY probability, and C-peptide levels, were important for increasing the diagnostic rate. From both studies, I determined that strictly applying the ACMG-AMP guidelines had considerable significance, resulting in more objective and reproducible interpretations of variant pathogenicity. NGS is a versatile diagnostic tool applicable to various diseases and in different clinical settings. In this dissertation, I propose multiple alternative approaches of NGS that can be applied to diagnosis of various rare diseases. Designating the most appropriate diagnostic method according to the characteristics of the diseases and choosing the proper target candidate genes are fundamental factors for successful clinical application. * The first chapter was published in Scientific Reports (1). * The second chapter was published in Prenatal Diagnosis (2). ---------------------------------------------------------------------------------------------- Keywords: rare disease; noninvasive prenatal diagnostics; next-generation sequencing; monogenic diseases; pediatric disorders; targeted panel sequencing. Student number: 2014-21994 iv

CONTENTS ABSTRACT... i CONTENTS... v LIST OF TABLES... viii LIST OF FIGURES... xi LIST OF ABBREVIATIONS AND ACRONYMS... xiv General Introduction... 1 Clinical applications of next-generation sequencing in rare diseases... 2 Chapter 1. Noninvasive prenatal diagnosis of Duchenne muscular dystrophy by direct haplotype phasing using targeted linked-read sequencing... 9 Abstract... 10 Introduction... 12 Materials and methods... 15 Results... 19 Discussion... 37 v

Chapter 2. Development of a common platform for the noninvasive prenatal diagnosis of multiple X-linked diseases... 41 Abstract... 42 Introduction... 44 Materials and methods... 46 Results... 49 Discussion... 63 Chapter 3. Epilepsy panel testing for patients with early seizure onset... 67 Abstract... 68 Introduction... 70 Materials and methods... 72 Results... 76 Discussion... 104 Acknowledgments... 107 Chapter 4. Pathogenic variants of maturity onset diabetes of the young using targeted gene panel sequencing in Korea... 109 Abstract... 110 Introduction... 112 Materials and methods... 115 vi

Results... 120 Discussion... 137 Acknowledgments... 141 General Discussion... 143 References... 147... 163 vii

LIST OF TABLES Chapter 1 Table 1-1. Mutation status in 5 DMD families... 24 Table 1-2. Targeted sequencing summary of genomic and maternal plasma DNA sequencing... 25 Table 1-3. Phasing results concurrent with the fetal genotype... 26 Table 1-4. Comparison of estimated cost between proband-based and direct phasing methods... 26 Chapter 2 Table 2-1. Disease, mutation, and maternal plasma profiles of the six study cohorts... 51 Table 2-2. Sequencing coverage depth of 33 genes and recombination hotspots... 52 Table 2-3 Sequencing summary... 56 Table 2-4. Fractional fetal DNA concentration... 57 Table 2-5. Informative SNVs used for analysis... 57 viii

Chapter 3 Table 3-1. Clinical features related to seizures in 112 study cohorts... 80 Table 3-2. Gene list included in each epilepsy gene panel... 84 Table 3-3. Sequencing summary of 112 study cohorts... 87 Table 3-4. Profile of 49 pathogenic or likely pathogenic sequence variants... 92 Table 3-5. Profile of five pathogenic microdeletions... 96 Table 3-6. Validation results of the mosaic variants found in patients and parents with amplicon sequencing... 97 Table 3-7. Diagnostic yield and clinical features between two groups with classified or unclassified epilepsy syndromes... 98 Table 3-8. Phenotypic spectrum of patients with KCNQ2, SCN2A, or PRRT2 pathogenic variants... 99 Chapter 4 Table 4-1. Comparison of characteristics in pathogenic and nonpathogenic groups... 126 Table 4-2. Percentage coverage of target genes at each frequency... 127 Table 4-3. Characteristics of pathogenic variants... 128 Table 4-4. ACMG-AMP guideline analysis... 130 Table 4-5. Characteristics of dyslipidemia pathogenic variants... 131 ix

Table 4-6. Logistic regression models of BMI 22.5 kg/m 2 for predicting monogenic diabetes... 131 x

LIST OF FIGURES Chapter 1 Figure 1-1. The workflow comparison of direct haplotype phasing and indirect haplotype phasing.... 27 Figure 1-2. Illustration of the principle of linked-read sequencing and haplotype phasing.... 28 Figure 1-3. Haplotype phasing of copy number variations.... 29 Figure 1-4. Coverage and allele frequency plots of maternal genomic DNA.... 30 Figure 1-5. Visualization of linked-reads in each haplotype according to different types of variations.... 31 Figure 1-6. Phasing results of deletion region.... 32 Figure 1-7. Allele fraction distribution of plasma samples.... 33 Figure 1-8. Recombination event estimation results from proband-based indirect phasing vs. direct phasing.... 34 Figure 1-9. Fetal genotype prediction.... 35 Figure 1-10. Recombination detection and haplotype reconstruction in xi

DMD-02.... 36 Chapter 2 Figure 2-1. Genes and recombination hotspots included in the probe for the noninvasive prenatal diagnosis of 33 monogenic X-linked diseases.... 58 Figure 2-2. The average RPKM values of the PLP1 region.... 59 Figure 2-3. Distribution of the allele fraction of the SNVs used for phasing and haplotype dosage analysis.... 60 Figure 2-4. Estimated recombination events throughout the X chromosome in the earliest gestational week.... 61 Figure 2-5. Fetal genotype prediction using the haplotype dosage imbalance of the maternal plasma data in all six families.... 62 Chapter 3 Figure 3-1. Frequency (y-axis) of genes or copy number variations with pathogenic or likely pathogenic variants.... 100 xii

Figure 3-2. Family pedigree of seven patients with pathogenic or likely pathogenic variants inherited from asymptomatic parents.... 101 Figure 3-3. Diagnostic yields and frequencies of genes according to seizure onset: neonatal ( 1 month), early infantile (1 6 months), and late infantile (6 12 months).... 102 Figure 3-4. Diagnostic yields according to the different electroclinical syndromes.... 103 Chapter 4 Figure 4-1. Flowchart of participants.... 132 Figure 4-2. Pedigrees of monogenic diabetes.... 136 xiii

LIST OF ABBREVIATIONS AND ACRONYMS acgh: array comparative genomic hybridization ACMG: American College of Medical Genetics and Genomics ACMG-AMP: American College of Medical Genetics and Genomics and the Association for Molecular Pathology BMI: body mass index bp: base pair cffdna: cell-free fetal DNA CGH: comparative genomic hybridization CHROM: chromosome CNV: copy number variation CVS: chorionic villus sampling DMD: Duchenne muscular dystrophy Dx: diagnosis EOE: early onset epilepsy ExAC: exome aggregation consortium FHx: family history GATK: genome analysis toolkit xiv

gdna: genomic DNA GVCF: genomic variant call format HbA1c: hemoglobin A1c HGMD: Human Gene Mutation Database HGVS: human genome variation society Indel: short insertion/deletion IPEX: immunodysregulation polyendocrinopathy enteropathy X-linked IQRs: interquartile ranges MELAS: mitochondria myopathy, encephalopathy, lactic acidosis, and stroke MIDD: maternally inherited diabetes with deafness MODY: maturity-onset diabetes of the young NGS: next-generation sequencing NIPD: noninvasive prenatal diagnosis PCR: polymerase chain reaction POS: position PPV: positive predictive value RPKM: reads per kilo-base per million mapped reads SBP: systolic blood pressure xv

SNV: single nucleotide variation T1DM: type 1 diabetes T2DM: type 2 diabetes TLA: targeted locus amplification WES: whole-exome sequencing WGS: whole-genome sequencing xvi

General Introduction 1

Clinical applications of next-generation sequencing in rare diseases Rare diseases have an impact on a small portion of a given population defined in the United States as fewer than 200,000 people and in Europe as fewer than 1 in 2,000 people (3). Although some rare diseases are compatible with a good quality of life if diagnosed early and optimally managed, systems that support effective and affordable diagnosis and treatment of such diseases are yet very scarce. In the case of rare diseases present at birth, prenatal diagnosis of fetal anomalies is crucial to parental decision making and to assessing the risk of recurrence. In recent years, a significant amount of research has been directed at applying next-generation sequencing (NGS) to clinical genetic testing (4-9). Diverse molecular tests are currently available, and selecting the best diagnostic method for patients with specific genetic conditions is very important. NGS is rapidly being incorporated into clinical genetic testing to lower costs and reduce the time of analysis. NGS can be applied to the diagnosis of rare diseases, complex diseases, and cancer. This dissertation focuses on the clinical application of NGS in congenital rare diseases. Although each rare disease alone affects a small population, a large number of individuals are affected by these conditions altogether (10). Developing an optimal method 2

for diagnosis, treatment, therapy and long-term management for rare diseases would have extensive population repercussions. Various molecular tests use NGS, including single-gene testing, gene panel testing, whole-exome sequencing (WES), and whole-genome sequencing (WGS) (11). Single-gene testing is often suitable for disorders with distinctive and typical clinical features and minimal locus heterogeneity. Gene panel testing is frequently the most comprehensive and feasible approach for heterogeneous disorders with phenotypic variability and multiple candidate genes (12). One needs to be strategic when determining which genes will be included in the gene panel. Genes strongly correlated with the diseases should definitely be included. It is also important to include genes that are related to disorders with similar or overlapping phenotypes. Selecting the appropriate genes creates an efficient gene panel, increasing the diagnostic rate. WES and WGS are usually used to diagnose extremely heterogeneous diseases and have the advantage of detecting novel variants previously undetected in other studies. WGS also detects structural variations such as copy number variations (CNVs), inversions, and translocations. Clearly, though, since WGS covers the entire genome, it produces large amounts of data, even with lower coverage depth, making analysis more costly and time-consuming. The most commonly used platform in the abovementioned sequencing approaches is Illumina short-read sequencing technology (13), but several 3

other sequencing technologies exist, including long-read sequencing and linked-read sequencing (14-17). In the following chapters, I thoroughly explore targeted sequencing using the Illumina platform and targeted sequencing combined with microfluidics-based linked-read sequencing technology using the 10X genomics platform. It is critical to choose the proper sequencing platform and approach in light of a disease s phenotype and genetics as well as the cost, turnaround time, and purpose of the test. Another clinical field in which NGS is expanding is prenatal diagnosis. The presence of cell-free fetal DNA (cffdna) in pregnant mother s plasma (18) has opened the floodgates for the application of NGS to prenatal diagnosis. Invasive methods such as chorionic villus sampling (CVS) and amniocentesis are currently the definitive and the most commonly recommended methods for prenatal diagnosis. However, many women feel uncomfortable with invasive testing because of the physical discomfort and the inevitable 1 2% risk of miscarriage (19). Recently, noninvasive prenatal diagnosis (NIPD) has become possible in clinical settings and is changing the prenatal testing paradigm. NIPD brings no risk of miscarriage and produces fewer false positives than serum screening. However, there still remains ambiguity regarding the accuracy of NIPD. Since only a small fraction ( 10%) of cffdna exist in maternal plasma at early gestational weeks, numerous variants near the pathogenic variant must be sequenced multiple times to accurately quantify the small allelic differences (20). I am reasoning that if allele frequencies of most of the nearby variants 4

estimated to be in the same haplotype as the pathogenic variant increased, it clearly indicates that the fetus has in fact inherited the pathogenic variant. Although noninvasive prenatal testing is yet in the proof-of-concept stage, and many professional societies still suggest that be used as a screening method, not for diagnosis, its high sensitivity and specificity make it a plausible and appealing alternative that could possibly replace the invasive methods. One of the biggest challenges in the clinical application of NGS is the interpretation of immense numbers of variations. Public and commercially available variant databases are important tools for assessing variants pathogenicity. However, these sources and the published literature can contain ambiguous and insufficient information, so careful evaluation is needed to prevent overassessment of pathogenicity and an incorrect diagnosis. The 2015 variant classification criteria published by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG- AMP) identify functional studies as one of the most important components of variant classification (21), but functional studies must also be carefully evaluated. Hence, it is important to recognize that like computational predictions, experimental predictions are informative but often not definitive. All of these points underscore the complexity of variant interpretation and the need to improve the annotation of disease alleles in mutation databases and the primary literature. In this thesis, I discuss diverse sequencing approaches 5

applied to several different congenital rare diseases and describe in detail how the variants were interpreted in each study. In the first chapter, I present the NIPD of DMD using targeted linked-read sequencing. Direct phasing using barcode-linked reads is a fairly new phasing technology, the most commonly used method in NIPD being proband-based phasing. Direct phasing cannot completely replace proband-based phasing, and it is very important to implement the most efficient and appropriate method for the specific disease and genomic variants being investigated. In the second chapter, I describe the indirect haplotype phasing method using the proband s genotype for the NIPD of more than 30 monogenic X-linked diseases. This platform has advantages over other approaches in that fetal genotype prediction as well as diagnosis of the proband and carrier for multiple diseases can be achieved using a single platform at a reasonable price. Chapters 3 and 4 examine the use of customized gene panel testing for heterogeneous rare diseases such as EOE and MODY. Although gene panel testing cannot detect pathogenic mutations in novel genes, it is an accurate, convenient, and cost-effective method. I also describe the extended applicability of gene panel testing by investigating not only germline single nucleotide variations (SNVs) but also low-frequency variants, possible indications of mosaicism, and structural variations such as copy number 6

variations (CNVs). Diagnostic yields are investigated, depending on the clinical variables. My intention is to maximize diagnostic yields by designing an adequate gene panel and selecting appropriate test candidates. For heterogeneous diseases such as EOE and MODY, there is no single universal set of clinical criteria. Therefore, classifying the diseases according to specific characteristics such as family history or symptoms, and metrics such as age or BMI improves the diagnostic rates. Establishing comprehensive criteria before genetic testing enables the creation of a more efficient gene panel and offers the potential for precision diagnosis and treatment of epilepsy and diabetes. 7

Chapter 1. Noninvasive prenatal diagnosis of Duchenne muscular dystrophy by direct haplotype phasing using targeted linked-read sequencing 9

Abstract Correctly resolving the haplotype of carrier mother is a crucial step for the noninvasive prenatal diagnosis (NIPD) of X-linked recessive diseases, including Duchenne muscular dystrophy (DMD). The most prominent method of haplotyping in NIPD using next-generation sequencing has been probandbased indirect haplotyping, despite its demand for genotype of other family members. In this chapter, a method for directly resolving the maternal haplotype, without requiring other family members, is displayed, thereby providing an alternative NIPD method for DMD. Targeted linked-read sequencing (mean coverage of 692 ) was performed on five carrier mother s genomic DNA (gdna) to determine maternal haplotypes. The haplotype of DMD alleles in the carrier mother was successfully phased using a targeted linked-read sequencing platform. For differentiating whether the recombination events occurred in the proband or in the fetus, linked-read sequencing was more accurate than the proband-based phasing method. Moreover, the deletions, duplications, point mutations, and recombination events in the DMD gene were reliably detected from maternal plasma DNA by a single targeted linked-read sequencing platform. The fetal genotype, obtained using amniocentesis or chorionic villus sampling (CVS), confirmed that the predicted DMD mutations were correctly diagnosed in all five families. Direct haplotyping with targeted linked-read sequencing approach 10

could perhaps be used as an alternative or even enhanced phasing method for the NIPD of DMD. * An earlier version of this chapter was published in Scientific Reports (1). ---------------------------------------------------------------------------------------------- Keywords: cell-free fetal DNA; Duchenne muscular dystrophy; linked-read sequencing; targeted sequencing; noninvasive prenatal diagnosis. Student number: 2014-21994 11

Introduction The discovery of cell-free fetal DNA (cffdna) in pregnant mother s plasma has made noninvasive prenatal diagnosis (NIPD) attainable (18). With the rapid emergence of next generation sequencing (NGS) technology, NIPD has become more prevalent than invasive methods such as amniocentesis or chorionic villus sampling (CVS) due to its safety. Several studies that applied next-generation sequencing to NIPD have proven its accuracy and convenience (22-26). In addition to the current use of NIPD to detect aneuploidies in clinical settings (27, 28), the application of this method to monogenic diseases is being actively investigated (29-31). Prior to this study, my colleagues and I demonstrated that NIPD for DMD is achievable by combining sequencing technology and the targeted capture of the DMD gene (32). Since the majority of DMD cases are caused by copy number variations (CNVs), reliable phasing of the maternal haplotype with pathogenic CNVs was crucial for executing NIPD on DMD families. The proband-based method was applied to resolve the two haplotypes, and the haplotype phasing results were used to determine the haplotype dosage imbalance in the carrier mother s plasma DNA (32). In proband-based phasing, the affected male proband s genotype is designated HapA (mutant haplotype), and the alternative to the proband s genotype is HapB (opposite haplotype). The proband s genotype is easily deduced due to the hemizygosity 12

of his chromosome X. Although genetic testing in the order proband, carrier, fetus is the most frequently used diagnostic flow in DMD clinics, this method cannot be executed in the absence of the proband or other affected family members (32, 33). Recently, two studies have overcome these impediments by using microfluidics-based linked-read sequencing technology (34) and targeted locus amplification (TLA)-based phasing (35) to directly phase the parental DNA and accurately predict the mutation inheritance pattern in the fetus. These two direct phasing methods offers alternative approach to NIPD of monogenic diseases, if applied accordingly in clinical settings. Though wholegenome linked-read sequencing has the benefit of being simultaneously applicable to multiple monogenic diseases and not requiring a capture probe (34), the combination of high-coverage (70X) whole-genome sequencing and barcoding technology may be too costly for common application in clinical practice. Although the TLA-based phasing method is much more costefficient, the need for a new, customized target capture kit for each NIPD may be problematic. Our earlier method using a single platform for targeted sequencing was a feasible and cost-effective means of proband diagnosis, carrier detection, and NIPD (32). I speculated that if linked-read sequencing could be combined with the targeted approach, with only requiring one capture probe, the process of NIPD could be simplified. 13

As a proof of principle and to examine the propriety of targeted linked-read sequencing technology in clinical practice, I analyzed samples from five families known to carry DMD mutations. The deletions, duplications, point mutations, and recombination events in DMD gene were detected reliably from maternal plasma DNA by a single targeted linked-read sequencing platform. In this chapter, I demonstrate that direct haplotyping of parental DNA can be readily achieved using targeted linked-read sequencing of the DMD region, allowing the correct prediction of the fetus mutation status. This targeted approach may provide a practical and cost-effective method for the NIPD of DMD that can realistically be implemented in clinical settings. 14

Materials and methods Participants The sequencing data of probands, fetuses, and carrier mothers (DMD-01~04) were used from a previous study (32). One additional family, DMD-05, was sequenced as reported in the previous study (32). Linked-read sequencing of DMD region was implemented on the carrier mothers genomic DNA (gdna) (DMD-01~05). Each family had mutations in different regions of DMD (Table 1-1). Fetal DNA concentration estimation In addition to the DMD gene, capturing the ZFX and ZFY genes provided a measurement of the fractional fetal DNA concentration. Using mean read depth of 2 zinc finger genes (ZFX and ZFY) with a minimum mapping quality score of 20 and base quality score of 20, we calculated the fractional fetal DNA concentration as: Fractional fetal DNA concentration = (2 ZFY) / (ZFX + ZFY) 100% Linked-read sequencing The principles of direct phasing and linked-read sequencing are shown in Figure 1-1 and Figure 1-2. In all five carrier mothers, high-molecular-weight gdna (average 52.7 kb), collected from the blood cells, were used to attain 15

barcoded DNA molecules, in conjunction with the 10X Genomics Chromium TM library (Pleasanton, CA) (Figure 1-2). This 10X Genomics Chromium TM technology uses a microfluidic device to segregate each genomic DNA molecule into individual oil-enclosed gel beads called gems. All fragments of the same gem are tagged with unique, distinct barcodes to create a library environment within a single gem (36). Targeted linked-read sequencing was then performed on samples from all five families. The barcoded reads described above were then captured using the same customized probe kit as in the earlier study (32). Finally, the barcoded and enriched reads were sequenced with an Illumina HiSeq 2500 sequencing system (San Diego, CA). Direct phasing and variant calling Long Ranger (v.2.1.2) software was used to directly resolve the two haplotypes in maternal gdna by linking the barcoded reads linearly. The wgs option was used in Long Ranger since the design of the DMD capture probe includes both the exonic and the intronic regions of DMD. The barcoded reads were aligned to the human genome (GRCh37/hg19) using 10X Genomics Lariat TM. Reads that share the same barcode indicate that they came from the same original long input DNA, constructing large haplotype blocks (Figure 1-2) (36). Variant calling was performed using the FreeBayes method offered in Long Ranger. Heterozygous single nucleotide variations (SNVs), linked to either the haplotype with the mutant allele (HapA) or the 16

wild-type allele (HapB), were used in subsequent analyses for fetal genotype prediction and recombination detection. I used LUMPY to detect exact breakpoints of structural variations (37). I confirmed that the breakpoints matched our previous results obtained from targeted sequencing (32). Since large deletion/duplication mutations in the DMD gene may obstruct haplotype phasing in linked-read sequencing, I added a confirmation step to inspect the linkage between large deletion/duplication mutations and the phased haplotypes (Figure 1-3). For large duplications, I selected the reads with particularly disparate allele frequencies within a certain region between the reference and alternate alleles. Then, I examined the heterozygous SNVs within such region to confirm that distinguishing the two haplotypes is possible (Figure 1-3A). For large deletions, I first collected the reads with the same barcode as the wild-type haplotype; those reads with different barcodes from the wild-type haplotype were considered the mutant haplotype. Next, I realigned the HapA reads to a customized deletion reference (Figure 1-3B). By doing so, I could confirm that the reads align properly to the deletion reference and that the heterozygous SNPs at the 5 end and the 3 end of the deletion belonged in the same haplotype. Fetal genotype prediction Since DMD is known to have high recombination rates, accurate detection of recombination events is obligatory before predicting the fetal genotype. I used the R package qcc to remove outliers and to prevent errors in predicting the 17

recombination point caused by outlier values from duplicate or repetitive sequences (38). Then, the R package changepoint was used to predict the statistically correct change point in the read fraction values (39). The fetal genotype prediction was calculated after the recombination event adjustment (Figure 1-1). The fractional fetal DNA concentrations and fetal genotype predictions were measured using ZFX and ZFY, as the method described in the previous studies (32, 40). The institutional review board at the Seoul National University Hospital approved the study protocol (IRB no. 1606-017- 768). 18

Results Sequencing Throughout the DMD gene, the targeted linked-read sequencing of the five carrier mother s gdna showed fairly consistent depth of coverage, with a mean coverage of 676 (Figure 1-4). Table 1-2 provides a detailed sequencing summary of maternal gdna, maternal plasma DNA, and fetal DNA. N50 phase-block length represents the contiguity accomplished from direct haplotyping (15, 35). In this study, average N50 phase-block length was 42.7 kb (range 34.6 51.8 kb), which are smaller than in other whole-genome linked-read sequencing studies (15, 35). However, taken into consideration that our capture probe only included the DMD region, the N50 values do not accurately depict haplotyping performance. The phasing results were more than adequate for subsequent analysis. Without referring to the previous study s results (32), all carrying mutations were detected from the targeted linked-read sequencing of carrier mothers gdna and were confirmed to be consistent with those from the sequencing data (Figure 1-4 and Table 1-2). The number of informative heterozygous SNPs in the DMD region used for analysis ranged from 700 to 1,000 (Table 1-2). 19

Direct haplotype phasing of mutant (HapA) and wildtype (HapB) alleles from linked-read sequencing Long-range information was obtained by linking the barcoded short sequencing reads produced by the 10X Genomics Chromium TM technology (Figure 1-2). Reads with the same barcode or the same allele at heterozygous SNP positions as the mutation-supporting reads were labeled HapA. Reads with same barcode as the opposite allele at heterozygous SNP positions were designated HapB. The two haplotypes of all five carrier mothers gdna were directly resolved by linking the haplotype blocks accumulated by the barcoded reads. Illustrations of directly phased mutation-linked haplotypes and wild-type-linked haplotypes for different types of variations are shown in Figure 1-5. Direct haplotype phasing of structural variation Structural variation can impede haplotyping in linked-read sequencing. To detect and phase structural variations accurately, I reinforced Long Ranger s structural variation detection method by adding a confirmation step (Figure 1-3). Since determining the CNV using coverage depth alone is not always reliable, especially in long-range sequencing, I examined the allele frequency within the DMD region for duplications. The haplotype with higher coverage 20

and distinctly higher allele frequency of the heterozygous SNPs within the duplication region was considered the mutant allele, HapA (Figure 1-3A). Since heterozygous SNPs do not exist within the deletion region, only the coverage information could determine to which allele the mutation belonged. I confirmed the haplotype of the deletion by isolating the reads that share the same barcode as the wild-type allele, then aligning the remaining reads to a custom-made deletion reference and determining whether the heterozygous SNPs at the deletion s 5' end and 3' end belonged to the same haplotype (Figure 1-3B). The deletion of DMD-05 called by Long Ranger was more definitive from targeted linked-read sequencing than in whole-genome linkedread sequencing, apparently due to the difference in sequencing depth (Figure 1-6), although in both sequencing methods, the confirmation step was required. Direct vs. indirect phasing and recombination event detection Nine plasma DNA samples from five pregnant carriers at diverse gestational weeks were target sequenced in the DMD region. Concentrations of fractional cffdna ranged from 4.1% to 9.25% as shown in Table 1-1. Before calculating the haplotype imbalance between the two resolved haplotypes in maternal plasma DNA, I inspected whether there are any recombination events within 21

the DMD region. This revealed a critical change point in the read fraction of DMD-05 at eight weeks (DMD-05-8-wk) and at 12 weeks (DMD-05-12-wk), representing a recombination event in the fetal DNA (Figure 1-7A and Figure 1-8). With the recombination point information, I could reconstruct the haplotypes of the DMD-05-8-wk and DMD-05-12-wk (Figure 1-9E). After the recombination event adjustment, synchronism between the phasing results and the fetal genotype increased in both the indirect and direct phasing methods (Figure 1-9E and Table 1-3). Interestingly, previous study (32) predicted that the fetus of the DMD-02 family would have a recombination event within DMD region; this required correction before the dosage imbalance could be estimated (Figure 1-8 and Supplementary Figure 1-10B) (29, 32). Nevertheless, the direct phasing method using linked-read sequencing showed that this recombination event had actually occurred in the proband rather than in the new fetus (Figure 1-9B and Figure 1-10A). This evidently shows that directly haplotype phasing using linked-read sequencing is simpler and more reliable for determining the individual in which recombination occurred. The direct phasing results in all five samples were >90% in agreement with the fetal genotype (Table 1-3). Recombination events were not detected in families other than DMD-02 (Figure 1-7 and Figure 1-8A, C, and D). I succeeded in correctly predicting the fetal genotype by directly resolving the 22

allele fraction imbalance between the two haplotypes in the maternal plasma. The mutation status of the fetus was validated using amniocentesis or chorionic villus sampling (CVS). Detailed results are shown in Figure 1-9. 23

Table 1-1. Mutation status in 5 DMD families Study number Genotype of mother Predicted genotype of fetus a Gestational age Fetal DNA concentration (%) DMD-01 Exons 49 52 deletion/normal Normal 6 weeks, 5 days; 17 weeks, 1 day 5.66, 7.74 DMD-02 Exon 2 duplication/normal Exon 2 duplication 9 weeks, 3 days; 12 weeks, 1day 9.25, 6.85 DMD-03 Exons 3 7 deletion/normal Exons 3 7 deletion 8 weeks, 5 days; 11 weeks, 3 days 6.34, 8.80 DMD-04 c.649 + 2 T>C/Normal c.649 + 2 T>C 7 weeks, 1 day 6.24 DMD-05 Exons 52 62 deletion/normal Exons 52 62 deletion 8 weeks, 2 days; 12 weeks 4.10, 5.07 a All fetuses were male. 24

Table 1-2. Targeted sequencing summary of genomic and maternal plasma DNA sequencing Sample Mean molecular length (bp) Total reads Total reads mapped to hg19 Total reads mapped to target Total reads mapped to target (%) Covered bait bases 30 (%) Mean depth N50 phase block (bp) Number of informative SNPs DMD-01 38,135 71,808,868 71,355,903 32,197,734 44.84 99.1 686.77 41,696 740 DMD-02 34,933 79,473,835 78,592,012 33,880,770 42.63 99.1 610.28 38,361 958 DMD-03 44,252 72,921,736 72,398,702 32,158,928 44.1 99.1 751.15 46,931 705 DMD-04 27,074 75,238,466 74,692,925 34,483,239 45.83 99.1 885.97 34.619 881 DMD-05 42,809 89,246,738 88,388,188 36,186,926 40.55 99.1 527.74 51.769 730 DMD-01-fetus a NA 62,140,714 61,254,461 14,857,845 23.91 99 691.7 NA NA DMD-02-fetus a NA 53,482,735 52,826,547 11,252,767 21.04 98.9 525.8 NA NA DMD-03-fetus a NA 33,642,168 33,334,631 7,798,255 23.18 91.5 336.13 NA NA DMD-04-fetus a NA 29,799,572 29,478,604 7,881,987 26.45 98.4 248.04 NA NA DMD-05-fetus a NA 29,473,959 29,411,841 6,877,801 23.38 80.3 298.49 NA NA DMD-01-6wk NA 54,828,936 53,912,045 10,094,007 18.41 97.7 465.61 NA NA DMD-01-17wk NA 53,776,131 52,978,358 11,454,316 21.30 98 529.38 NA NA DMD-02-9wk NA 50,420,886 49,643,304 9,534,590 18.91 98 440.4 NA NA DMD-02-12wk NA 65,047,893 64,180,117 17,172,644 26.40 98.6 792.31 NA NA DMD-03-8wk NA 68,397,568 67,646,003 25,847,441 37.79 98.3 698.06 NA NA DMD-03-11wk NA 68,874,610 68,144,635 26,096,590 37.89 98.9 724.04 NA NA DMD-04-7wk NA 65,317,530 64,610,871 25,277,884 38.70 98.1 687.24 NA NA DMD-05-8wk NA 69,119,036 68,947,407 24,767,584 35.83 98.9 764.24 NA NA DMD-05-12wk NA 72,320,288 72,141,574 25,463,245 35.21 99.0 841.95 NA NA a Genomic DNA samples of fetuses were collected by chorionic villus sampling and amniocentesis. 25

Table 1-3. Phasing results concurrent with the fetal genotype Before recombination adjustment After recombination adjustment Sample Direct phasing method (%) Proband-based phasing method (%) Direct phasing method (%) Proband-based phasing method (%) Heterozygous SNVs DMD-01 99.46 98.92 99.46 98.92 740 DMD-02 91.65 63.05 91.65 63.05 958 DMD-03 94.61 98.01 94.61 98.01 705 DMD-04 92.62 94.67 92.62 94.67 881 DMD-05 46.30 48.49 90.82 90.27 730 Table 1-4. Comparison of estimated cost between proband-based and direct phasing methods Estimated cost per sample Proband-based method Number of samples Total (USD) Direct phasing method Number of samples Total (USD) gdna QC 30 3 90 2 60 Sequencing library 150 3 450 2 300 10X library 800 0 0 1 800 Probe capture 500 3 1500 2 1000 Targeted sequencing 100 3 300 2 200 Total $2,340 $2,360 26

Figure 1-1. Comparison of direct haplotype phasing and indirect haplotype phasing. Haplotype phasing can be significantly simplified by using targeted linked-read sequencing rather than proband-based indirect phasing. 27

Figure 1-2. Illustration of the principle of linked-read sequencing and haplotype phasing. 28

Figure 1-3. Haplotype phasing of copy number variations. Each oval represents a read, and the colors depict reads from the same gem. Asterisks (*) indicate the position of SNPs used for analysis, and the red bars are the non-reference alleles. Positions marked red are alternative alleles. (A) Confirmation step of mutant allele with duplication. The haplotype with higher allele frequency within the duplication region was considered the mutant allele. (B) Confirmation step of mutant allele with deletion. Reads that did not have the same barcodes as the wild-type haplotype (HapB) were labeled the mutant haplotype (HapA). Such reads were realigned to a custom deletion reference to verify whether the SNPs before and after the deletion shared the same barcode. 29

Figure 1-4. Coverage and allele frequency plots of maternal genomic DNA. The red vertical bars in the graph at the top represent the 79 exons in the DMD gene. The blue bars indicate the coverage depths of DMD. The dark red dots represent the allele frequencies of heterozygous SNPs in the maternal genomic DNA samples. The pathogenic mutation region in the DMD gene is highlighted in gray. The deletion regions in DMD-01, 03, and 04 do not consist of any heterozygous SNPs. In the duplication region of DMD-02, with the increase in copy number, the allele frequency values cluster around 1/3 and 2/3. 30

Figure 1-5. Visualization of linked-reads in each haplotype according to different types of variations. (A) Large duplication in exon 2 (DMD-02); (B) Large deletion in exon 49-52 (DMD-01); (C) Single nucleotide variation in c.649 + 2 T>C (DMD-04). The circles displayed in the enlarged view represent pairedend reads. Reads that come from a single gem and thus share the same barcode are depicted with same color connected by a same colored line. 31

Figure 1-6. Phasing results of deletion region. (A) Whole-genome linked-read sequencing; (B) targeted linked-read sequencing. 32

Figure 1-7. Allele fraction distribution of plasma samples. (A) DMD-05, (B) DMD-01, (C) DMD-03, (D) DMD-04. 33

Figure 1-8. Recombination event estimation results from probandbased indirect phasing vs. direct phasing. Each line graph represents the read fraction of the mutant allele (HapA) obtained from maternal sequencing data from the whole DMD gene. The red horizontal line represents the mean read fraction of the mutant allele (HapA). A value greater than 0.5 indicates the mutant allele is inherited, and an arrow at the change point indicates the possibility of a recombination event. DMD- 05 was the only family with a recombination event predicted by direct haplotype phasing. Only the data from the earliest gestational weeks are displayed above (DMD-01 at six weeks; DMD-02 at nine weeks; DMD-03 at eight weeks; DMD-04 at seven weeks; DMD-05 at eight weeks). Blue asterisks indicate point of pathogenic mutation; black arrows indicate point of estimated recombination event. 34

Figure 1-9. Fetal genotype prediction. HapA represents the mutant-linked allele and HapB represents the wild-type-linked allele. After detecting the recombination events, I reconstructed the haplotypes and designated them as HapA* and HapB*, where HapA* represents the recombination-adjusted mutant-linked allele and HapB* the wild-type allele. HapB is overrepresented in the DMD-01 maternal plasma samples. HapA and HapA* are overrepresented in the rest of the maternal plasma samples. Allele fraction differences in all five samples were significant (P < 0.001). (A) DMD-01, (B) DMD-02, (C) DMD-03, (D) DMD-04, (E) DMD-05. 35

Figure 1-10. Recombination detection and haplotype reconstruction in DMD-02. (A) Read fraction distribution of two haplotypes using direct phasing. The top diagram represents plasma DNA at nine weeks and the bottom at 12 weeks. (B) Read fraction distribution of two haplotypes using proband-based phasing. The black dotted line represents the recombination point predicted by the changepoint algorithm. The top diagram shows plasma DNA at nine weeks and the bottom at 12 weeks. 36

Discussion This study enhanced the previously reported approach of NIPD for DMD (32) by using linked-read sequencing to directly phase the maternal haplotype. Indirect haplotype phasing using a proband s genotype involves not only complex computational steps but also demands the DNA of the affected male proband or other male family members (32, 41). NIPD is impossible if the proband is absent or unsuitable for sequencing. An advantage of NIPD using targeted linked-read sequencing is that it does not require the genomic data of a proband or other family members, allowing the mother to be tested during her first pregnancy. This method is more practical in that it can be easily incorporated into genetic counseling and diagnosis and is more cost-efficient than other available NIPD methods. The inheritance of mutant-linked maternal alleles can be estimated only by comparing the dosage between the mutant and wild-type linked alleles, because of the high background of maternal DNA. Hence, recombination event adjustment for dosage imbalance detection is critical. The proband-based haplotyping method cannot distinguish whether the recombination event occurred in the proband or in the fetus, increasing the number of recombination adjustments needed. For example, in the previous study (32), DMD-02 was predicted as having a recombination event, but the direct haplotyping method did not show any recombination point, suggesting that this recombination in 37

fact occurred in the proband. Even though the fetal genotype could be predicted correctly regardless of the timing of the recombination event, any increase in the number of recombination adjustments inevitably may increase the number of computational errors. While the larger amount of data should be added, the direct haplotyping method from linked-read sequencing has a clear advantage in recombination analysis. Incorrect analysis of a recombination event may lead to a false diagnosis of DMD in the fetus, but this new method reduces that risk. This study illustrates a new method that has additional advantages compared with the methods used in the two recent studies that have introduced the direct haplotyping method for the NIPD of monogenic diseases. Though Hui et al. s whole-genome-based haplotyping method (34) can be applied to numerous monogenic diseases simultaneously and has advantages for correctly predicting recombination events, it is too costly for clinical practices. Whole-genome linked-read sequencing is particularly unnecessary for monogenic diseases, since it has lower sequencing depth in the region of interest and is more timeconsuming and expensive. With introns, exons, and UTRs included in the custom-designed target capture, whole-genome-like results can be achieved for the gene of interest, with a greater depth of coverage. This method could easily be applied to screen for other monogenic disorders by adjusting the target capture region. The TLA approach of Vermeulen et al. (42) is less expensive compared to the whole-genome method, but customization of the target region is more complex due to ethnic differences in the population frequency of SNPs. Moreover, recombination adjustment is challenging with this method, so in the 38

event of recombination, the result will be either inconclusive or falsely predictive. Additionally, these two methods require separate capture probe and sequencing platforms for proband diagnosis, carrier detection, and maternal plasma DNA sequencing. In comparison, the proposed targeted linked-read sequencing-based haplotyping method has advantages in terms of recombination prediction and cost-effectiveness. Because linked-read sequencing can accurately detect large deletions and duplication mutations in DMD, this method could also enable carrier diagnosis. Haplotype information obtained from the same sequencing data could be used for future NIPD. Notwithstanding these advantages, the cost effectiveness of this approach in an actual clinical setting should be noted. The method s main cost advantage is that the proband DNA is not required to be sequenced, because in practice, the current NIPD of DMD requires three samples, including that of the proband (43). This reduction in cost offsets the cost of the expensive library preparation step in linked-read sequencing. The estimated laboratory cost of NIPD for one DMD family with my custom capture probe is about USD 2,300 for either the proband-based or the direct phasing method (Table 1-4). It is also feasible to multiplex a barcoded library from linked-read sequencing, which will further decrease the cost. Because linked-read sequencing requires the additional library construction step, the turnaround time is three weeks, which is longer than for the proband-based method but still acceptable. 39

This approach is well suited for the NIPD of DMD, and applying this targeted approach for other monogenic diseases should be validated separately. Proper design of the target region and capture probe is essential for successful application. Though there is currently no officially approved guideline, Lam et al. (30) recommended that 1,000 SNPs and a 200-fold sequencing depth by computational simulation can be used confidently for relative haplotype dosage analysis, even with a low concentration of fetal DNA. It is imperative to check for recombination hotspots around the target region and include these in the recombination adjustment. Since only few studies have reported the clinical applicability of linked-read sequencing technology to NIPD, more research is required to validate this technique s efficiency and reliability. Nevertheless, the direct haplotyping approach using a targeted linked-read sequencing platform has a clear advantage over proband-based indirect haplotyping and could extend opportunities for the NIPD of DMD. 40

Chapter 2. Development of a common platform for the noninvasive prenatal diagnosis of multiple X-linked diseases 41

Abstract The noninvasive prenatal diagnosis (NIPD) of an X-linked disease can be achieved by resolving two maternal haplotypes, then estimating the haplotype dosage imbalance between the two alleles. The aim of the study described in this chapter was to develop a single sequencing platform that can target multiple monogenic X-linked diseases. A capture probe designed to target multiple single gene disorders simultaneously enables the efficient NIPD of these disorders. The capture probe I designed targeted 33 monogenic diseases known causal genes and regions with high recombination rates in chromosome X. Six families affected by X-linked monogenic diseases were recruited. Six male proband and carrier mother pairs were target sequenced using the capture probe described above. The pregnant carrier mothers plasma DNA were obtained at varying gestational weeks and target sequenced. Each family s fetal genotype was determined by evaluating the haplotype dosage imbalance between the two estimated maternal haplotypes. The targeted sequencing data yielded even coverage across the targeted regions. Since the capture probe targeted genes all throughout chromosome X, approximately three to five recombination events were detected in each sample, but these did not influence the haplotype dosage analysis for fetal genotype prediction. Accordingly, I succeeded in correctly predicting the fetal genotypes in all six families. A single platform that comprises multiple diseases could eliminate the labor of ordering disease- 42

specific probes for the NIPD of each disorder, so this method may offer feasible advantages for clinically applying the NIPD of X-linked diseases. * An earlier version of this chapter was published in Prenatal Diagnosis (2). ---------------------------------------------------------------------------------------------- Keywords: noninvasive prenatal diagnostics; targeted sequencing; X-linked diseases; monogenic diseases Student number: 2014-21994 43

Introduction After the identification of the presence of circulating cell-free fetal DNA (cffdna) in maternal plasma (18), X-linked diseases were the first targets to apply prenatal diagnosis (44). Since only male fetuses undergo subsequent prenatal testing for X-linked diseases, determining sex of the unborn fetus using cffdna protects female fetuses from unnecessary invasive procedures. However, while the noninvasive prenatal screening of aneuploidy was quickly incorporated into routine prenatal care (45), the NIPD of monogenic diseases remains in the proof-of-concept stage, mainly due to lack of validation studies that address the issues of sensitivity and specificity since each family carrying a monogenic disease is individually rare. Other contributing factors include technical limitations and costliness. As the inheritance of maternal mutations can be derived only indirectly by estimating the haplotype dosage imbalances, the dosage analysis typically uses tens or hundreds of heterozygous SNV sites near the gene of interest (46). Inevitably this requires a separate customized capture kit for the sequencing of the maternal genomic DNA and plasma DNA, as exon-based disease panels do not provide sufficient SNV information near the relevant genes. Whole-genome sequencing (WGS) may be used to address this problem, but deep WGS of plasma DNA remains unaffordable in clinical practice. Prior to this study, colleagues and I designed a capture probe explicitly for the 44

NIPD of Duchenne muscular dystrophy (DMD) (1, 32). One of this approach s advantages is that a single platform can be used for the NIPD of DMD as well as proband and carrier diagnosis. I speculated that this method could be extended to target not only DMD but multiple X-linked diseases by including other genes along with DMD in one capture probe. The proposed custom-designed capture probe covers the entire exonic and intronic regions of 33 genes related to monogenic X-linked diseases. Adding regions with high recombination rates to the capture probe also allows more accurate detection of possible recombination events. 45

Materials and methods Study design and participants The institutional review board of Seoul National University Hospital approved the study protocol (IRB no. 1606-017-768). Six families affected by various monogenic X-linked diseases were prospectively recruited. Genomic DNA (gdna) of the proband and his carrier mother were collected from blood samples. Carrier mother s plasma DNA samples were collected at various gestational weeks (Table 2-1) and were sequenced with the customdesigned capture probe to evaluate whether the fetus had inherited the disease-causing mutation. The prenatal diagnosis results were validated using fetal gdna obtained via either chorionic villus sampling (CVS) or amniocentesis. Capture probe design and targeted sequencing The custom-designed capture probe included all exons and introns of 33 genes related to various monogenic X-linked diseases as displayed in Figure 2-1. This probe also comprised of all exons and introns of ZFX and ZFY genes for fractional fetal DNA calculation. I referred to the recombination rates from the HapMap project (47) to select recombination hotspots ( 60 cm/mb), and included regions located 50 kb upstream and downstream of these hotspots to the capture probe (Figure 2-1). The total size of the completed custom- 46

designed capture probe was approximately 10 Mbp. Agilent SureDesign software was used to design the probe, and an Illumina HiSeq 2500 system sequenced the targeted DNA. Sequencing data analysis Samples were paired-end sequenced with a read length of 2 101 bp. The sequenced reads were aligned with the human reference genome (GRCh37) using BWA (v.0.7.15) (48). Picard software (v.2.1.1) (http://picard.sourceforge.net), SAMtools (v.1.3.1) (49), and the Genome Analysis Toolkit (GATK, v.3.8) (50) best-practices pipelines were used in subsequent data analyses. GATK HaplotypeCaller and VariantFiltration were used for variant calling and the filtering of low-quality variants. Structural variation detection In order to detect copy number variation (CNV), I first calculated reads per kilobase per million mapped reads (RPKM) using CoNIFER (51). Only reads with a mapping quality >15 were counted. To amend for coverage depth fluctuations, caused by the targeted sequencing, I used outlier-based CNV detection method by calculating the interquartile ranges (IQRs) for each sample and established the standards by the following equations: Deletion: RPKM < q25 2 IQR 47

Duplication: RPKM > q75 + 2 IQR where q25 and q75 are 25th and 75th percentile RPKM values of each sample. CNV was called valid only when two or more consecutive probes were both amplified or deleted and the total region of the breakpoints was > 5 kb. Fetal genotype prediction Since chromosome X is hemizygous in male probands, heterozygous SNVs in maternal gdna were listed by comparing proband s and mother s genotype. Then, those heterozygous SNVs were used to resolve the maternal haplotypes. Chromosome X is known to have high recombination rates, so I tested for any recombination events located 1 Mb upstream and downstream of the gene of interest. Since the PLP1 gene did not contain enough informative SNVs, I tested the SNVs located 5 Mb upstream and downstream of the PLP1 gene in family X1. Families X5 and X6 had a causal variation near the end of the F8 gene, the last gene at the 3' end of the capture probe, so recombination events were tested in the region from 1 Mb upstream only of the mutation until the end of the F8 gene. I used qcc (38) for outlier removal and changepoint (39) in the R package for recombination event prediction. The fetal genotype was predicted after adjusting for recombination events. 48

Results Sequencing Maternal plasma DNA was acquired using the same method described in earlier study (52). Genomic DNA samples acquired from blood cells of all six families were target sequenced using the single custom-designed target probe (Figure 2-1 and Table 2-2) to an average coverage depth of 200. Table 2-3 provides the detailed sequencing summary of the families. The sequencing depths were evenly distributed among targeted regions (Table 2-2). I used the coverage depths of ZFX and ZFY and the method described by Yoo et al., (32) to calculate the fractional fetal DNA concentrations in all six families (Table 2-4). Pathogenic variation detection in the proband and the carrier mother The causal mutations of four monogenic diseases were diverse and included SNVs, small insertions and deletions, and one large duplication (Table 2-1). Pathogenic CNVs, were detected using the IQR-based method as described earlier in this chapter. Figure 2-2 presents the average RPKM values of PLP1 region. The whole-gene duplication of PLP1 in family X1 was accurately detected. 49

Recombination event detection and fetal genotype prediction The carrier mothers haplotypes were phased by comparing the hemizygous sequences of the proband and heterozygous SNVs from the maternal gdna. Details of the number and distribution of the informative SNVs used for resolving the haplotypes in each family are presented in Table 2-5 and Figure 2-3. The maternal haplotype information enabled statistical testing of the read fraction change in the maternal plasma DNA to detect recombination events preceding fetal genotype prediction. Few samples had evidence of recombination events in recombination hotspot regions (Figure 2-4). However, no recombination points were detected within 1 Mb upstream or downstream of the putative causal variants in any of the families. Consequently, the haplotype imbalance in the maternal plasma was calculated within 1 Mb upstream and downstream of the inherited variation. The allele fraction of the mutant-linked haplotype was significantly higher in the maternal plasma sequencing data of X1 X5, indicating the fetus inherited the haplotype with the disease-causing variation (Figure 2-5). The wild-type allele was only overrepresented in the fetus of family X6, indicating that the fetus did not inherit the disease-causing variation (Figure 2-5). The predicted fetal genotype was compared with the actual fetal genotype obtained using routine invasive procedures such as amniocentesis or chorionic villus sampling. 50

Table 2-1. Disease, mutation, and maternal plasma profiles of the six study cohorts Study number Disease Affected gene Proband Mother Gestation age (weeks) Fetal fraction (%) Whole-gene X1 Pelizaeus-Merzbacher disease PLP1 Carrier 13 11.84 duplication X2 Duchenne muscular dystrophy DMD p.l914x Carrier 8,11 8.4, 12.65 X3 Duchenne muscular dystrophy DMD p.g185x Carrier 21 10.69 X4 Alpha-thalassemia myelodysplasia syndrome ATRX p.r1627t Carrier 11 10.15 X5 Hemophilia A F8 p.r602x Carrier 11 8.62 X6 Hemophilia A F8 p.n982fs Carrier 11 16.93 51

Table 2-2. Sequencing coverage depth of 33 genes and recombination hotspots REGION X1_M X1_13w X2_M X2_8w X2_11w X3_M X3_21w X4_M X4_11w X5_M X5_11w X6_M X6_11w ABCD1 291.1 147.64 276.34 150.67 170.41 171.53 124.94 241.92 165.05 109.06 136.94 142.49 66.87 ARHGEF9 296.89 206.08 380.69 215.98 257.59 215.01 148.3 341.96 253.49 226.46 181.35 251.1 72.86 ARX 272.38 166.27 285.83 169.31 199.96 175.32 135.12 255.25 194.33 156.09 148.55 185.93 69.57 ATP7A 154.15 154.91 280.62 168.09 196.92 142.36 105.93 250.14 198.73 206.31 148.12 198.01 63.79 ATRX 116.72 140.04 237.42 153.24 178.58 119.26 93.93 211.75 175.99 200.09 136.76 187.66 61.1 BTK 297.2 188.33 358.34 201.47 239.15 197 137.04 327.15 235.63 211.86 173.83 222.98 68.24 CASK 191.56 179.69 316.16 190.75 225.49 165.8 123.98 281.45 222.45 222.97 165.53 222.99 68.96 CDKL5 226.59 184.94 334.23 195.8 232.01 175.55 130.73 297.76 230.35 226.08 171.98 230.32 69.26 CLCN5 264.57 209.03 387.31 220.74 263.53 208.82 148.51 341.11 259.12 238.68 190.82 256.56 73.76 COL4A5 132.49 160.39 288.36 174.92 206.48 143.08 107.69 249.54 207.77 226.1 152.51 219.35 65.32 CYBB 237.08 194.11 387.06 203.07 243.42 205.57 137.21 346.29 244.33 256.77 181.47 263.49 68.36 DCX 274.69 209.06 392.19 220.12 266.8 216.04 149.08 347.91 262.3 237.48 181.95 260.93 73.15 DKC1 363.56 186.56 377.32 194.95 225.56 222.65 143.18 343.91 226.41 209.72 171.72 240.44 68.67 DMD 176.04 175.35 335.82 190.87 227.35 173.95 119.9 296.34 227.04 247.78 163.91 248.49 67.02 EMD 234.57 102.75 222.89 111.79 121.25 161.48 93.53 202.42 120.8 85.05 99.35 111.56 50.99 F8 217.01 175.92 327.36 191.18 223.44 172.54 123.92 285.58 220.23 215.86 161.73 223.66 71.37 F9 293.2 209.58 388.13 219.56 262.86 217.52 150.36 348.52 258.54 233.38 185.34 258.32 73.11 FLNA 241.1 118.37 223.77 117.52 132.99 151.1 103.74 200.72 126.07 80.31 109.32 102.97 61.64 GLA 249.37 169.17 327.46 177.56 209.07 185.83 123.64 307.93 212.81 208.76 159.09 209.84 66.43 HPRT1 158.87 165.12 297.56 175.7 210.09 147.85 113.64 260.26 208.24 208.87 159.12 202.28 65.2 IDS 367.58 201.41 390 207.43 242.26 225.83 153.15 338.59 235.68 196.41 175.75 225.82 74.51 IKBKG 311.77 166.65 301.72 169.03 189.58 187.73 142.32 270.85 189.09 129.94 154.47 159.83 90.97 L1CAM 261.07 158.35 247.21 153.76 176.6 165.12 132.72 228.59 162.85 93.3 134.33 114.19 71.45 LAMP2 216.79 187.76 343.27 197.54 234.1 182.36 130.22 309.9 232.82 240.75 175.08 240.85 68.94 MECP2 295.05 182.89 319.98 186.15 217.46 189.19 139.56 295.5 210.24 170.97 159.56 187.95 70.18 MTM1 298.38 205.37 401.62 218.25 260.9 214.14 149.5 350.68 256.86 246.65 190.44 269.96 71.47 NDUFA1 127.83 120.33 204.35 128.68 149.75 112.22 87.97 178.68 142.82 157.83 117.81 154.23 57.7 OCRL 268.21 216.62 410.99 224.78 269.92 222.05 155.41 363.4 271.14 256.16 196.97 265.84 72 OPHN1 206.12 178.68 314.09 192.47 226.91 167.45 125.04 281.08 223.22 210.45 163.87 216.53 68.23 OTC 254.66 210.43 367.25 223.02 267 192.83 148.64 321.66 258.92 242.91 186.96 259.92 73.8 PDHA1 321.93 195.44 371.73 200.87 236.27 212.04 144.34 336.92 234.83 201.93 172.48 222.2 69.54 PLP1 578.68 360.84 432.43 245.96 296.73 253.74 178.62 393.29 288.7 217.96 207.79 250.39 78.43 52

REGION X1_M X1_13w X2_M X2_8w X2_11w X3_M X3_21w X4_M X4_11w X5_M X5_11w X6_M X6_11w WAS 324.88 173.23 324.61 163.82 191.48 204.13 135.2 296.68 188.59 158.22 131.66 170.7 62.8 chrx:1118079 67-112125561 271.9 206.23 375.67 215.66 257.91 206.41 146.42 333.58 252.16 228.17 181.04 249.21 72.97 chrx:1132970 13-113397012 153.56 162.85 287.58 178.1 208.63 148.03 110.9 255.8 207.6 214.26 153.03 215.58 66.06 chrx:1135508 48-113651156 156.76 176.18 318.94 191.9 227.19 158.36 118.57 277.73 225.36 244.01 164.41 241.98 66.51 chrx:1154261 92-115527203 172.8 174.43 323.89 188.98 225.15 168.11 120.2 284.11 223.29 231.99 161.89 240.98 68.03 chrx:1165562 2-11755621 258.93 204.98 362.42 213.62 254.3 195.33 143.1 321.01 247.35 231.76 179.69 248.41 73.57 chrx:1168371 55-116937588 216.73 197.06 355.75 209.29 250.96 186.83 136.27 310.46 247.58 227.13 174.71 244.17 70.3 chrx:1198353 90-119935389 281.49 206.65 370.17 216.99 258.96 205.88 148.91 335.59 252.85 226.23 180.7 241.26 72.77 chrx:1223074 8-12330747 273.08 206.37 395.66 217.66 262.64 213.36 147.21 348.61 261.24 249.35 184.24 270.41 72.57 chrx:1235039 86-123604726 285.72 214.25 402.5 226.26 271.27 217.36 151.79 353.56 265.77 251.09 191.57 271.43 73.47 chrx:1249280 1-12592800 223.59 185.07 338.9 198.64 234.19 181.96 128.77 298.86 234.5 229.63 166.49 238.66 70.72 chrx:1281457 0-12916463 289.82 203.59 362.05 207.88 248.3 198.44 145.72 317.85 239.91 216.82 177.59 236.79 70.94 chrx:1317237 3-13515476 278.43 199.62 369.01 210.98 248.85 201.97 143.72 329.86 248.5 227.81 182.94 246.55 72.68 chrx:1345539 75-134654031 246.23 182.83 322.72 192.19 226.83 174.07 129.8 279.82 219.5 192.2 163.28 210.54 70.49 chrx:1365817 9-13758178 279.01 192.84 371.4 203.61 240.31 201.2 140.32 332.83 238.32 234.79 180.29 253.03 69.59 chrx:1370241 25-137124124 229.85 184.68 338.85 197.88 235.3 183.89 130.26 300.93 232.43 213.44 166.57 230.69 70.45 chrx:1383989 91-138551237 203.62 183.01 330.08 195.89 231.79 171.42 125.65 287.43 228.33 218.7 165.63 233.87 69.59 chrx:1403242 30-140426867 208.28 173.8 304.11 185.2 216.68 155.46 117.22 389.67 311.58 212.35 159.79 219.57 72.49 53

REGION X1_M X1_13w X2_M X2_8w X2_11w X3_M X3_21w X4_M X4_11w X5_M X5_11w X6_M X6_11w chrx:1406090 7-14161662 252.11 201.96 376.87 214.22 256.53 197.19 143.27 328.37 258.13 244.83 182.07 260.55 73.28 chrx:1411455 55-141246561 217.78 176.32 333.85 190.6 226.31 178.38 123.82 293.81 225.06 223.46 161.63 236.24 67.66 chrx:1418934 11-141994193 148.87 150.96 293.33 168.46 198.77 146.15 103.78 252.7 200.81 215.3 146.6 211.89 62.78 chrx:1425131 36-142797880 214.06 174.11 329.17 187.48 222.79 174.83 122.55 288.31 219.87 220.32 159.08 233.04 67.86 chrx:1436554 16-143757697 153.52 153.71 292.75 170.36 199.17 146.96 103.88 253.06 201.22 226.14 146.9 225.73 62.64 chrx:1442300 53-144331194 203.78 180.07 343.1 192.51 229.5 181.34 123.69 302.49 227.53 239.76 163.78 249.07 67.45 chrx:1455379 51-145639334 150.74 149.83 283.2 164.17 192.7 143.44 101.46 246.5 193.24 215.96 142.76 213.82 64.15 chrx:1481141 28-148214127 273.39 204.2 373.52 214 254.95 200.98 143.23 324.56 250.71 226.1 182.14 247.74 72.67 chrx:1496070 01-149707000 367.78 213.27 394.62 221.19 260.74 226.48 160.89 345.3 252.76 191.32 186.32 229.31 76.66 chrx:1513652 32-151465231 185.1 171.67 318.07 186.28 220.24 166.06 117.57 277 216.3 217.23 156.86 225 66.39 chrx:1945427 4-19554990 266.08 184 323.01 190.03 225.51 190.04 133.49 306.53 220.61 198.83 159.03 211.42 68.66 chrx:2271069 2-22811547 158.89 167.23 310.57 183.1 216.56 156.52 113.63 273.07 215.6 235.05 158.51 233.91 66.44 chrx:2333822 6-23439665 294.64 209.86 388.97 222.02 265.09 214.31 151.3 347.24 260.97 234.83 185.31 259.21 73.6 chrx:2572188 7-25822965 157.93 167.73 313.94 182.49 216.25 157.13 113.89 270.32 217.84 230.45 157.89 229.31 67.44 chrx:3655129 6-36651295 163.71 167.16 316.52 183.05 216.7 160.79 113.78 273.95 215.45 229.61 156.71 231.75 67.23 chrx:3803680 7-38137709 267.78 211.95 375.75 221.42 264.35 205.72 150.77 332.85 258.63 232.38 188.35 249.55 74.7 chrx:4268827 1-43873451 221.2 185.62 337.88 199.4 235.04 180.15 129.98 298.36 232.27 226.43 169.44 237.99 70.78 54

REGION X1_M X1_13w X2_M X2_8w X2_11w X3_M X3_21w X4_M X4_11w X5_M X5_11w X6_M X6_11w chrx:4511903 5-45220202 214.06 180.46 313.65 194.26 227.25 163.38 125.51 271.81 221.98 215.51 166.73 228.38 70.59 chrx:5055883 0-50658829 256.82 194.8 351.41 207.48 245.37 191 138.33 310.07 241.4 205.33 176 225.2 72.41 chrx:5312523-5416800 195.06 170.21 312.91 182.03 215.84 161.06 117.13 273.31 214.83 221.56 155.06 222.93 64.89 chrx:5385799 9-53957998 206.27 161.18 271.63 170.85 200.61 146.94 113.39 245.74 195.43 174.1 143.46 183.33 65.35 chrx:6617558-6720097 270.16 186.17 347.17 200.09 237.31 187.76 135 305.98 235.95 219.61 166.01 235.31 68.66 chrx:8436829-8538498 228.71 185.8 345.72 196.96 235.53 185.51 127.8 313.82 231.01 236.92 166.3 246.86 68.18 chrx:9287502 0-92975019 132.55 156.22 298.17 172.46 203.55 151.35 106.84 261.68 203.78 231.57 152.25 227.97 65.15 chrx:9418757 9-94287578 114.98 127.17 232.97 143.5 164.74 115.14 84.52 199.5 164.93 196.16 124.36 189.02 59.49 chrx:9705843 8-97158881 243.56 211.55 391.36 221.65 266.97 208.78 147.5 341.71 263.17 261.13 188.79 276.73 73.1 55

Table 2-3 Sequencing summary Sample ID Mean read depth (X) Total # of reads Duplicates Duplicates (%) Mapped Mapped (%) Total reads mapped to target Total # of reads mapped to target (%) Properly paired Properly paired (%) Mate mapped X1_M 239.798 103536216 3282222 3.17 103427100 99.89 26451075 25.57 100749691 97.31 103344530 X1_P 221.256 136388363 5216423 3.82 136258163 99.9 24512695 17.99 134489284 98.61 136157265 X1_F 221.024 126530219 8324919 6.58 126394609 99.89 25391967 20.09 124399897 98.32 126283874 X1_13w 212.733 106301299 14134580 13.3 106128142 99.84 28608550 26.96 95729078 90.05 105999973 X2_M 383.3 125073469 5926661 4.74 124952015 99.9 42838235 34.28 123726541 98.92 124860391 X2_P 196.204 106409826 6593587 6.2 106285202 99.88 22272179 20.96 104851044 98.54 106197104 X2_F 111.281 91775489 5391828 5.88 91666631 99.88 12950483 14.13 90759505 98.89 91589370 X2_11w 272.597 124107416 13051518 10.52 123906429 99.84 35127328 28.35 114528587 92.28 123764384 X2_8w 227.458 134524931 15307717 11.38 134272319 99.81 34205526 25.47 120587142 89.64 134112955 X3_M 215.452 76168306 5154521 6.77 76032299 99.82 24242034 31.88 73560931 96.58 75909477 X3_P 145.28 81493681 7446427 9.14 81316838 99.78 16984661 20.89 77955601 95.66 81165221 X3_F 80.3806 54873280 8889927 16.2 54777897 99.83 11175550 20.4 53412508 97.34 54692290 X3_21w 148.481 81764510 16280916 19.91 81528947 99.71 20715088 25.41 70178611 85.83 81326724 X4_M 357.744 134107828 11571617 8.63 133856683 99.81 44443010 33.2 130042081 96.97 133654238 X4_P 176.883 116095673 6346126 5.47 115864913 99.8 21255006 18.34 112671358 97.05 115688166 X4_F 239.764 149303552 16599710 11.12 148977662 99.78 31079297 20.86 144496670 96.78 148718301 X4_11w 274.641 136088946 21713548 15.96 135449774 99.53 41515381 30.65 124705727 91.64 134966350 X5_M 385.4 89815376 15970293 17.78 89644721 99.81 42556669 47.47 87809113 97.77 89486670 X5_P 98.29 95895799 22717302 23.69 95653959 99.75 21294216 22.26 92248489 96.2 95430118 X5_F 46.32 102476456 44800744 43.72 102199295 99.73 24724910 24.19 97391914 95.04 101945749 X-05_11w 278.22 90214732 12408715 13.75 89954188 99.71 29405676 32.69 81705334 90.57 89738757 X6_M 382.87 84027795 9126189 10.86 83864455 99.81 35497951 42.33 82329131 97.98 83718354 X6_P 86.32 94312288 24566422 26.05 94070537 99.74 20605831 21.9 90636721 96.1 93850448 X6_F 128.9 112039150 29234563 26.09 111765315 99.76 27397610 24.51 108172897 96.55 111520181 X-06_11w 103.23 85049725 24264193 28.53 84739494 99.64 23691866 27.96 74485074 87.58 84476120 56

Table 2-4. Fractional fetal DNA concentration Sample ID ZFX ZFY Fetal fraction X_01_13wks 400.109 25.1825 11.84% X_02_8wks 376.144 16.4906 8.40% X_02_11wks 526.692 35.5729 12.65% X_03_21wks 290.574 16.4035 10.69% X_04_11wks 491.867 26.2877 10.15% X_05_11wks 484.661 21.817 8.62% X_06_11wks 116.62 10.7828 16.93% Table 2-5. Informative SNVs used for analysis Family ID Number of SNVs in the affected gene Number of informative SNVs Number of SNVs in targeted genes of whole chrx X1 5 69 3866 X2 1036 1036 3451 X3 1028 1032 3339 X4 88 88 3703 X5 50 72 3890 X6 50 80 4134 57

Figure 2-1. Genes and recombination hotspots included in the probe for the noninvasive prenatal diagnosis of 33 monogenic X- linked diseases. The diseases and their causal genes are listed. The present study targeted the four diseases marked in red. The approximate locations of the probes are provided. The blue lines indicate the regions of the 33 X- linked genes. The red lines indicate the recombination hotspot regions. 58

Figure 2-2. The average RPKM values of the PLP1 region. P represents the proband and M represents the mother of each family. 59

Figure 2-3. Distribution of the allele fraction of the SNVs used for phasing and haplotype dosage analysis. Only the data from the earliest gestational week measured are displayed. 60

Figure 2-4. Estimated recombination events throughout the X chromosome in the earliest gestational week. The blue asterisks indicate the position of the putative pathogenic variations. The horizontal red lines indicate the read fraction of the mutation-linked allele (male proband s haplotype) across the target region. The breakpoints between the segmented read fraction lines indicate possible recombination events, which occurred from three to five times per family. 61