편집순서 1 : 겉표지 학술연구용역사업최종결과보고서 과 제 명 주 의 ( 주의내용기재 ) 2 0 1 1 ( 글 14 point 고딕체 ) 주관연구기관 질병관리본부 질병관리본부
주의내용 주 의 1. 이보고서는질병관리본부에서시행한학술연구용역사업의최종결 과보고서입니다. 2. 이보고서내용을발표할때에는반드시질병관리본부에서시행한 학술연구용역사업의연구결과임을밝혀야합니다. 3. 국가과학기술기밀유지에필요한내용은대외적으로발표또는공개 하여서는아니됩니다.
편집순서 2 : 제출문 질병관리본부학술연구용역사업최종결과보고서 과제명 수행기관 / 연구책임자 발주부서 말라리아기생충과 3
편집순서 3 : 목차 목 차 Ⅰ. 연구개발결과요약문 요약문 ( 한글 ) --------------------------------------------------------------------- 3 요약문 ( 영문 ) --------------------------------------------------------------------- 4 Ⅱ. 학술연구용역사업연구결과 제1장최종연구개발목표 ---------------------------------------------------------- 5 제2장최종연구개발내용및방법 --------------------------------------------------- 7 제3장최종연구개발결과 --------------------------------------------------------- 17 제4장연구결과고찰및결론 ------------------------------------------------------- 21 제5장연구성과및활용계획 -------------------------------------------------------- 22 제6장기타중요변경사항 ---------------------------------------------------------- 23 제7장연구비사용내역 ----------------------------------------------------------- 23 제8장첨부서류 ----------------------------------------------------------------- 24 4
편집순서 4 : 요약문 과제명 중심단어 주관연구기관 주관연구책임자 연구기간 2. 17-mer analysis and evaluate genome size 4. GC-content and Distribution analysis 5. Conclusions and recommendations 5
편집순서 5 : 요약문 ( 영문 ) Title of Project Key Words NGS, sequencing,, Institute Theragen Institute Project Leader Project Period 2011. 2. 16-2011. 12. 15 2. 17-mer analysis and evaluate genome size 4. GC-content and Distribution analysis 5. Conclusions and recommendations 6
편집순서 6 : 총괄연구과제의연구결과 학술연구용역사업연구결과 7
8
9
10
11
12
13
14
15
16
17
18
Lib ID Read length(bp) Insert Sequence Data(Mb) size(bp) depth(x) FLAjmxDEJDIAAPEI-12 100 500 16,610.00 59 total 16,610.00 59 Low quality reads were filtered, total 16.6 G data was used for further analysis, if the genome size is estimated to be 280 M in previous experiment, then the sequencing depth is expected to be 59-fold. 2. 17-mer analysis and evaluate genome size A K-mer refers to an artificial sequence division of K nucleotides. A raw sequence read with L bp contains (L-K+1) K-mers if the length of each K-mer is K bp. The frequency of each K-mer can be calculated from the raw genome sequence reads. The K-merfrequencies along the sequence depth gradient follow a Poisson distribution in a given data set. During deduction, the genome size G=K_num/K_depth, where the K_num is the total number of K-mer, and K_depth is the frequency occurring more frequently than the others. Typically, K = 17. 19
6 5 Frequency(%) 4 3 2 1 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Depth (17-mer depth distribution) (17-mer Data statistics) K-me r K-mer number K-mer depth genome size used base used read X 17 11,688,326,310 18 649M 14.3G 163M 22 Total 14.3Gb data was retained for 17-mer analysis.the 17-mer frequency distribution derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about 18, and the total K-mer count is 11688326310, then the genome size can be estimated by formula: genome size=k-mer count/peak of the K-mer distribution. If the heterozygous rate is higher, then a small peak will be presented at 1/2 of K-mer depth. So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome. Also, this distribution can be used to determinethe repeat content of the genome, if this genome contains high proportion of repeat, then the distribution will display a fat tail which indicate more than expect proportion of the genome have a high sequencing depth which may due to sequence similarly. Conclusion: Genome size is 649Mb, and the heterozygous rate in this genome is not high so that we can do whole genome shotgun sequence and assembly. 20
Contig Scaffold Size(bp) Number Size(bp) Number N90 151 732,482 533 133,844 N80 255 478,310 1,551 58,567 N70 399 322,854 3,810 32,694 N60 583 220,179 6,876 20,417 N50 813 148,193 10,720 13,096 Longest 18,736 245,648 Total Size 494,555,333 629,557,979 Total Number(>100bp) 1,145,180 463,690 Total Number(>2kb) 33,691 49,999 Conclusion: This is a initial version of assembly, due to the low-coverage depth, the length of contig N50 is short than expected. Although we can t guarantee a very large length of N50, WGS is suitable to assemble this genome. 4. GC-content and Distribution analysis (Distribution of GC depth) 21
The x-axis presented as GC content, the y-axis represents the average depth. We used 10kb non-overlapping sliding windows and calculated the GC content and average depth among the windows. The distribution of GC content versussequencing depth will provide an eye about the sequencing bias or contamination. Usually, the genomic region with high or low GC content will possess a low sequencing depth compare to median GC content region, if the distribution of a given genome project is different from the expected pattern, it may indicate sequencing bias of contamination. If predicted to be contaminated, then we can eliminate the polluted reads by aligned the reads against bacteria, virus and fungous database. Conclusion: According to the Distribution of GC depth, we can infer that the sample has no obvious sequencing bias or contamination. 5. Conclusions and recommendations Roughly estimate the heterozygous rate of this genome: not high (<0.5%) Roughly estimate the repeat content of this genome: not high Roughly estimate the genome size of the species: about 649M Whether whole genome shotgun strategy is suitable for this genome or not: YES 22
23
24
25
26
27
제 5 장연구성과및활용계획 5.1 활용성과 과제명 과제책임자박종화 / / 번호논문제목저자명저널명집 ( 권 ) 페이지 Impact factor 1 2 국내 / 국외 SCI여부 번호발표제목발표형태발표자학회명연월일발표지 1 2 국내 / 국제 번호출원 / 등록 1 2 특허명출원 ( 등록 ) 인출원 ( 등록 ) 국출원 ( 등록 ) 번호 IPC 분류 - 해당없음. - 해당없음. - 해당없음. - 해당없음. 28
5.2 활용계획 없음. 제 6 장기타중요변경사항 해당없음. 제 7 장연구비사용내역 29
제 8 장참고문헌 해당없음. 제 9 장첨부서류 해당없음. 30