1 NGS Analysis using Galaxy 2013 한국유전체학회동계심포지엄생물정보분석교육워크샵 김형용, 이규열, 이성찬 _ ~ R&D Center, Insilicogen, Inc.

2 Index NGS Analysis using Galaxy 목차있을시간지 01 Galaxy introduction 02 Galaxy examples 1,2 03 Galaxy installation 04 Galaxy function details 05 Galaxy examples 3,4 06 Galaxy tools 07 Galaxy on Grid 08 Galaxy on Cloud

3 Agenda 구분시간강의내용비고 15:00 ~ 15:20 Galaxy 소개진행김형용 1 부 : Introduction and Application 15:20 ~ 15:50 Galaxy 분석예제시연 1. Human exon 가운데가장 SNP 많은 ex on 찾기 2. NGS QC and assembly 예제 16:00 ~ 16:20 Galaxy 설치진행이성찬 16:20 ~ 17:10 Galaxy 설치및분석예제실습 1. Galaxy 설치실습 2. Human exon 가운데가장 SNP 가많은 exon 찾기실습 3. NGS QC and assembly 예제실습 2 부 : Custom operation 17:20 ~ 17:50 Galaxy 세부기능에대한설명진행김형용 09:00 ~ 09:20 Galaxy 분석예제시연진행김형용 1. RNA-seq 분석예제 2. NGS 분석예제 2 19:20 ~ 09:50 Galaxy 분석예제실습 1. RNA-seq 분석예제 2. NGS 분석예제 2 10:00 ~ 10:20 Galaxy tool의이해진행김형용 10:20 ~ 11:00 Galaxy tool 작성실습 1. Primer design 11:10 ~ 11:30 Galaxy on Grid 진행이규열 1. 그리드의이해 2. 분산작업시연 11:30 ~ 11:50 Galaxy on Cloud 진행김형용 1. 클라우드의이해 2. Galaxy on Amazon EC2 3

4 NGS Technologies

5 Sequencer Comparison Illumina 454 SOLiD Read length HiSeq 2000 HiSeq 1000 HiScan SQ GAIIx GS FLX 2X100 bp 2X150 bp 400 bp 5500 microbeads 5500xl microbeads Mate pair : 60 bp X60 bp Paired-end : 75 bp X35 bp Fragment : 75 bp 5500xl nanobeads Gb/day h Yield 600Gb 300Gb 150Gb 95Gb 35Mb 90Gb 180Gb 300Gb Required input Accuracy 50 ng with Nextera 100 ng 1 μg with TruSeq 85% (2X50 bp, >Q30) 80% (2X100 bp, >Q30) 99% (>Q20) 99.99% Illumina 의 Gb/day 는 2X100 bp run 결과 Illumina read length : 1X35, 2X50, 2X100 GA : 1X35, 2X50, 2X100, 2X150 Copyrightc Insilicogen, Inc All rights reserved. 5

6 Applications Application of NGS Technique Personal Genomics Microbiology Personal Genomics Environmentology Toxicology Chemical Biology Mutation Detection Structure Variation Transcriptional Control Interaction of DNA and Protein 6

7 Issue of New Genomic Era. many researchers, having invested in next generation sequencing instruments, now face a computational bottleneck in their research work-flow. BGI 7

8 Most Significant Improvement to Your Next Generation Sequencing Workflow ( 출처 : The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, BioInformatics, LLC) Copyrightc Insilicogen, Inc All rights reserved. 8

9 Issue of New Genomic Era. DNA shearing Insert into high and /or low copy number vectors Big Dye ABI 3730 Data compliation Gene prediction BLAST search FTP Web browser Commercial software Bioinformatics Library construction Sequence delineation Sequence annotation Data delivery Template purification Finishing & Assembly Secondary annotation PCR Amplicons BACs Cosmids/ Fosmids Primer walking Transposon insertion methods Proprietary & commercial assembly SNP Comparative genomics Expression analysis Cost Process 9

10 Application of Next Genomic Data 10

11 Practical Software Platforms for NGS data analysis

12 What kind of? Biological Features Framework (Enterprise/Informatics) Features Service Price

13 List of NGS Frameworks Copyrightc Insilicogen,Inc All rights reserved. 13

14 유전변이추출전문파이프라인 HugeSeq Copyrightc Insilicogen,Inc All rights reserved. 14

15 사용자친화적 GUI 환경을제공하는 CLC Genomics Server 1 CLC Genomics Server - 3계층시스템구조의데이터분석및공유, 관리를위한엔터프라이즈솔루션 CLC Bioinformatics Database - 데이터의중앙집중방식의저장및공유관리를위한데이터베이스 3 CLC Assembly Cell - NGS 데이터의초고속 assembly 분석솔루션 ( 커맨드라인기반 ) CLC Genomics Workbench - NGS 데이터의다양한생물정보분석솔루션 (GUI 기반 ) 5 CLC Developer Kit - 사용자가원하는생물정보분석툴과워크플로우커스터마이징솔루션 Copyrightc Insilicogen,Inc All rights reserved. 15

17 30x Human genome 1 sample (150G) 500 만원 (1 년저장 ) 17

18 구글로부터투자받아 NCBI SRA 서비스연동 온라인에서실험없이곧바로분석가능 18


22 What is Galaxy Galaxy, a web-based genome analysis platform An open-source framework for integrating various computational tools and databases into a cohesive workspace A web-based service we provide, integrating many popular tools and resources for comparative genomics A completely self-contained application for building your own Galaxy style sites 22

23 Galaxy Usage One of the fastest growing open source bioinformatics projects, a highly successful high throughput data analysis platform for Life Sciences with over 15,000 users worldwide Annual Galaxy Community Conference 23

24 Galaxy visualization External Genome Browser UCSC Ensembl GBrowse Trackster Track/data viewer in web browser HTML5 Canvas, jquery Renders in browser, not on server 24

25 Galaxy visualization 25

26 Trackster 26

27 Trackster 27

28 Trackster 28

29 Galaxy 구성요소 Galaxy 주요구성요소 Datasources : 입력데이터지정. 별도의지역시스템이나, 외부웹사이트의데이터를등록가능 Tool : 기본적인분석의최소단위, 지역설치시원하는툴을만들어넣을수있음 History : 입력데이터가 Tool의조합을거쳐얻어진중간결과물목록 Workflow : History 는입력데이터및파라메터만바꾸면새로운데이터결과를얻을수있다. 이를별도로프로세스등록 Visualization : 분석결과를가시화도구와연결 Page : 위요소들을종합한보고서작성기능 Eprimer3 tool 을별도로만들어등록한예제 29

30 Galaxy tool 은 입력포맷 Tool 출력포맷 입력데이터를 ( 포맷에맞게 ) 작업하여 ( 포맷에맞게 ) 출력데이터를만드는역할 조합하면 Workflow 가된다 30

Auto-detect
Ab1
Axt
Bam
Bed
Fasta
FastqSolexa
Gff
Gff3
Interval (Genomic Intervals)
Lav
MAF
Scf
Sff
Tabular (tab delimi ted)
Wig
Other text type

32 Galaxy 특징한번더 최근 Galaxy 사용추세 NGS 관련분석기능탑재 Amazon Cloud 이용 논문에 Galaxy URL 제공 Transparent analysis Biologist Bioinformatician Galaxy 특징한번더 파이썬으로만들어져있으나, 확장시파이썬이아니어도됨 투명한 분석플로우를만들고공유하고확장할수있다. 거의모든생물정보분석을 Galaxy 로할수있다. Galaxy만잘써도뽑겠다 (NCBI) 32

33 GALAXY Examples 1

34 Example 1. Finding Human Exons with the highest number of SNPs 1. Download all Human Exons from NCBI or Ensembl BioMart or UCSC TableBrowser 2. Download all Human SNPs from 3. Scripting Join 1, 2 according to position Group by Exon id Sort by SNP count Filter Exon which has more than 10 SNPs Have to do programming! (Python, Perl, ) 34

35 On Galaxy 35

36 On Galaxy Get data UCSC main : Exon 데이터가져오기 Get data UCSC main : SNP 데이터가져오기 Operate on Genomic Interval Join : 영역이겹치는 Exon 추출하기 Join, Substract and Group Group : Exon 이름으로그룹핑하고 SNP 세기 Filter and Sort Sort : SNP 개수로 Exon 정렬하기 Text Manipulation Select first : SNP 개수가많은 top 5 exon 추출하기 Join, Substract and Group Compare two Datasets : 잃어버린 exon 정보회복하기 36

37 GALAXY Examples 2

38 Example 2. Human NGS data QC and assembly 1. NGS Quality Control 2. NGS Single End Mapping 3. SNP Calling 4. Compare with dbsnp Have to do in Unix and need programming! (Python, Perl, ) 38

39 On Galaxy 39

40 On Galaxy NGS 분석을위해서는프로그램추가설치해야함 ( ) 프로그램 사용되는곳 설치방법 Fastx-toolkit NGS QC Ubuntu apt-get Gnuplot NGS QC boxplot Ubuntu apt-get Bowtie2 Reference assembly 복사후 PATH 설정 SAMTools SNP calling Ubuntu apt-get 40

41 On Galaxy Get data Upload File : human illumina fastq 파일업로드 NGS: QC and minipulation FASTQ Groomer : fastsanger 포맷을변경 NGS: QC and minipulation Compute quality statistics : fastq quality 통계정보보기 NGS: QC and minipulation Draw quality score boxplot : fastq quality 통계정보로 boxplot 그리기 NGS: QC and minipulation FASTQ Trimmer, Quality Trimer, Masker : 의미없는부분잘라내기, 가리기 41

42 On Galaxy Get data Upload File : Reference assembly 를위한레퍼런스서열입력 NGS: Mapping Bowtie2 : Bowtie2 를이용한 assembly NGS: SAM Tools MPileup : BAM 파일에서 SNP, indel 정보추출하기 NGS: SAM Tools Filter pileup : 추출된 SNP, indel 가운데높은점수추출하기 NGS: SAM Tools Pileup-to-interval : Genomic interval 형식으로변경 Get data UCSC Main : dbsnp 정보가져오기 Operate on Genomic Interval Join : 영역이겹치는 SNP 추출하기 42

43 Galaxy Installation

44 Install Virtualbox - Ubuntu 1. USB 에서 Virtualbox 와 Galaxy 폴더를복사합니다. 2. Virtualbox 를설치합니다. 3. Virtualbox 를실행한후, Galaxy 이미지를 Import 합니다. 4. 설정에서네트워크를브릿지 (Bridge) 로변경합니다. 5. Ubuntu 실행후, Network 설정파일을삭제합니다. rm /etc/udev/rules.d/70-persistent-net.rules 6. Linux(ubuntu) 를재시작합니다. sudo shutdown h now 44

45 Creating your own Galaxy 45

46 Running Galaxy in an production environment By default, Galaxy uses SQLite database Built-in HTTP server for all tasks Local job runnser Single process Simplest error-proof configuration Change configuration for service Disable the developer settings use_interactive = False, use_debug = False Get a real database PostgresSQL Offload the menial tasks: Proxy Nginix, Apache Let your tools free: Cluster Move intensive processing to other host, TORQUE, GRID, DRMAA Other advanced settings 46

47 Galaxy on Cluster Intensive processes to other hosts TORQUE GRID DRMAA Working with Galaxy on the Cloud 47

48 Virtualization

49 Virtualization 가상화 컴퓨터자원의추상화를일컫는말 가상의물리적리소스를만들어냄. 물리적인 1 대의하드웨어자원을논리적으로여러개로나누어사용하거나, 여러대의하드웨어자원을논리적으로통합하여이용하는기술 하드웨어관리, 재난에대한시스템복구등여러문제를해결할수있는방법으로최근각광 받고있음

50 Virtualization 비용절감 서버한대를분할하여여러대의서버를구성할수있음 서버구입비용절감, 전기, 상면비용, 서버관리비용이절감 자원의효율적인사용 서버의비활용되는자원을이용하여가상머신을만듬으로써효율적인자원사용이가능 안정적인운영 가상화의장점!! 서버를이미지로백업, 손쉬운서버이전으로장애에대한신속한대처가능 SW 의지속적인운영 서버 HW의수명주기가끝나면 OS 벤더는장치드라이버지원이중단됨 -> 마이그레이션문제가발생 가상머신에기존의시스템을가상머신에올리기때문에장치드라이버에대한문제가발생하지않음 50

51 클라우드서비스에기본적으로활용 51

52 Public Galaxy environment 52

53 Example of Cloud 출처 : isc 2012 Amazon HPC session Copyrightc Insilicogen,Inc All rights reserved. 53

54 Running Galaxy Web server 1. 자신의컴퓨터의 IP Address 를확인합니다. ifconfig 2. Galaxy 폴더로이동합니다. cd galaxy-dist 3. Galaxy web server 를실행합니다. sh run.sh 4. 자신의호스트 OS (windows) 에서웹브라우저에서주소창에다음을입력합니다. IP Address:8080 ( 예, :8080) 54

55 Galaxy Detail functions

56 Get Data 56

57 Get Data / Send Data 57

58 Text Manipulation 58

59 Convert Format 59

60 FASTA manipulation 60

61 Filter and Sort 61

62 Join, Subtract and Group 62

63 Operate on Genomic Intervals 63

64 NGS Toolbox 64

65 Galaxy Examples 3

66 Example 3. Human RNA-seq 1. RNA-seq result: adrenal_1,2.fastq, brain_1,2.fastq 2. Reference: igenome UCSC hg19, chr19 gene notation (GTF format) Have to do in Unix and need programming! (Python, Perl, ) 66

67 On Galaxy 67

68 On Galaxy RNA-seq 분석을위해서는프로그램추가설치해야함 ( ) 프로그램 사용되는곳 설치방법 java FastQC Ubuntu apt-get install openjdk-7-jre FastQC NGS QC tool-data/shared/jars/ 로복사 Tophat RNA-seq mapping ( 다음페이지참고 ) Cufflinks RNA-seq assembly Ubuntu apt-get install cufflinks 68

69 Tophat install in Ubuntu $ cp samtools tar.gz2 ~/work $ bzip2 d samtools tar.gz2 $ tar xvf samtools tar $ cd samtools $ make $ cd.. $ cp tophat tar.gz ~/work $ tar zxvf tophat tar.gz $ cd tophat $ apt-get install libboost libbam libboost-thread-dev $ cp../samtools /libbam.a /usr/local/lib $ sudo mkdir /usr/local/include/bam $ cp../samtools /*.h /usr/local/include/bam $ configure $ make $ make install 69

70 On Galaxy Get data Upload File : fastq, chr19.fa, gtf 파일업로드 NGS: QC and minipulation FASTQ Groomer : fastqsanger 포맷으로변경 NGS: QC and minipulation FastQC:Read QC : fastq quality 통계정보보기 NGS: RNA Analysis Tophat for Illumina : RNA-seq fastq 데이터에서 splice junction 찾기레퍼런스로 chr19.fa 이용 NGS: RNA Analysis Cufflinks : Transcript assembly, FPKM 추정 70

71 On Galaxy NGS: RNA Analysis Cuffmerge : brain, adrenal 데이터를 reference 에맞게합치기 NGS: RNA Analysis Cuffdiff : 유의한발현변화찾기 71

72 Galaxy Tools

73 Galaxy tool 은 입력포맷 Tool 출력포맷 입력데이터를 ( 포맷에맞게 ) 작업하여 ( 포맷에맞게 ) 출력데이터를만드는역할 조합하면 Workflow 가된다 73

74 Galaxy formats Auto-detect Ab1 Axt Bam Bed Fasta FastqSolexa Gff Gff3 Interval (Genomic Intervals) Lav MAF Scf Sff Tabular (tab delimi ted) Wig Other text type 데이터가어떤형식인지자동으로인식 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploadi ng the file. blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence li nes. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size infor mation about the alignment. It consists of 9 required fields. A binary file compressed in the BGZF format with a '.bam' file extension. Tab delimited format (tabular). Does not require header line A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file GFF lines have nine required fields that must be tab-separated. The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous fo rmats. Tab delimited format (tabular) Lav is the primary output format for BLASTZ. The first line of a.lav file begins with #:lav.. TBA and multiz multiple alignment format. The first line of a.maf file begins with ##maf. This word is followed by white-sp ace-separated "variable=value pairs". There should be no white space surrounding the "=". A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file. A binary file in 'Standard Flowgram Format' with a '.sff' file extension. Any data in tab delimited format (tabular) The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track. Any text file 74

76 Primer design tool 76

77 Primer3 Primer3 Primer design program Download from er tar.gz make & copy to PATH eprimer3 Wrapper for Primer3, it s used in EMBOSS package Easy command line interface eprimer3.html apt-get install emboss 77


79 erimer3.xml 79

80 erimer3.py 80

81 tool_conf.xml <section name="vcf Tools" id="vcf_tools"> <tool file="vcf_tools/intersect.xml" /> <tool file="vcf_tools/annotate.xml" /> <tool file="vcf_tools/filter.xml" /> <tool file="vcf_tools/extract.xml" /> </section> <section name= MyTools" id= mytools"> <tool file= mytools/eprimer3.xml" /> </section> </toolbox> 81

82 EMBOSS eprimer3 tool added 82

83 실습 Install Primer3 : make 명령으로컴파일후, primer3_core PATH 설정 Install EMBOSS : sudo apt-get install emboss Install Biopython : sudo apt-get install python-biopython Copy eprimer3.py, eprimer3.xml to galaxy-dist/tools/mytools/ : mytools 디렉토리는직접생성 Edit tool_conf.xml : mytools/eprimer3.xml 설정 83

84 Galaxy on Grid

85 Grid vs Cluster 공통점 대용량데이터에대한연산을작은소규모연산들로나누어작은여러대의컴퓨터로분산시켜수행 차이점 WAN 상에서서로다른기종의머신들을연결다양한플랫폼을서로연결함연결대수에제한이없음 85

86 Grid 86

87 Globus Toolkit 대표적인계산그리드미들웨어 Open source toolkit for building computing grids developed and provided by Globus Alliance Standards implementation Open Grid Service Architecture (OGSA) Open Grid Service Infrastructure (OGSI) Web Services Resource Framework (WSRF) Job Submission Description Language (JSDL) Distributed Resource Management Application API (DRMAA) SOAP WSDL Grid Security Infrastructure 87

88 High level Open Grid Forum API specification for submission and control of jobs to a Distributed Resource Management (DRM, Job scheduler) system, such as a Cluster or Grid computing infrastructure 88

89 PBS (Portable Batch System) Computer software that performs job scheduling in Unix cluster environment A component of the Globus Toolkit Originally developed by NASA Following versions OpenPBS TORQUE a fork of OpenPBS PBS Professional (PBS pro) - commercial 89

90 TORQUE Distributed resource manager providing control over batch jobs and distributed compute node It stands for Terascale Open Source Resource and QUEue Manager Slave 노드의 CPU 개수, core 개수, RAM 사이즈, 임시저장소등의설정정보를가지고스케줄러에의해요청이왔을때클러스터리소스를분배함 Slave 1 Master Slave 2 NFS Slave 3 > qsub a.sh a.sh 명령을스케줄러에따라 slave로넘김 90

91 Virtualized Galaxy (Test-bed) 91

92 Galaxy on Cloud

93 Cloud computing Delivery of computing and storage capacity as a service to a heterogeneous community of end-recipients. 93

94 94

95 VPS (Virtual Private Server) Internet hosting services to refer a virtual machine in a cloud 95

96 Amazon EC2 (Amazon Elastic Compute Cloud) Virtualization + Grid(Cluster) computing in a Cloud 96

97 Amazon EC2 (Amazon Elastic Compute Cloud) 97

98 Amazon EC2 (Amazon Elastic Compute Cloud) 98

99 Amazon EC2 (Amazon Elastic Compute Cloud) 99

100 Amazon S3 (Amazon Simple Storage Service) 100

101 Galaxy on Cloud Using Amazon EC2 + S3 Select AMIs in Community AMIs 101

102 Galaxy on Cloud 102

103 Galaxy on Cloud 103

104 Galaxy on Cloud 104

105 Galaxy on Cloud 105

106 Galaxy on Cloud 106

107 Galaxy on Insilicogen Galaxy localization on cluster Tool development Workflow development 107

