슬라이드 1

Similar documents
DE1-SoC Board

APOGEE Insight_KR_Base_3P11

Interstage5 SOAP서비스 설정 가이드

Portal_9iAS.ppt [읽기 전용]

김기남_ATDC2016_160620_[키노트].key

PowerPoint 프레젠테이션

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

28 THE ASIAN JOURNAL OF TEX [2] ko.tex [5]

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

PCServerMgmt7

solution map_....

vm-웨어-앞부속

0125_ 워크샵 발표자료_완성.key

CD-RW_Advanced.PDF

Orcad Capture 9.x

untitled

Solaris Express Developer Edition

Service-Oriented Architecture Copyright Tmax Soft 2005

ecorp-프로젝트제안서작성실무(양식3)

Backup Exec

1217 WebTrafMon II

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서

chapter4

CRM Fair 2004

LXR 설치 및 사용법.doc

Output file

10X56_NWG_KOR.indd

Intro to Servlet, EJB, JSP, WS

15_3oracle

04-다시_고속철도61~80p

The Self-Managing Database : Automatic Health Monitoring and Alerting

강의10

Assign an IP Address and Access the Video Stream - Installation Guide

ODS-FM1

Something that can be seen, touched or otherwise sensed

목차 1. 제품 소개 특징 개요 Function table 기능 소개 Copy Compare Copy & Compare Erase

J2EE & Web Services iSeminar

Microsoft Word - KSR2014S042

슬라이드 1

ETL_project_best_practice1.ppt

02 C h a p t e r Java

1.장인석-ITIL 소개.ppt

Web Application Hosting in the AWS Cloud Contents 개요 가용성과 확장성이 높은 웹 호스팅은 복잡하고 비용이 많이 드는 사업이 될 수 있습니다. 전통적인 웹 확장 아키텍처는 높은 수준의 안정성을 보장하기 위해 복잡한 솔루션으로 구현

AGENDA 모바일 산업의 환경변화 모바일 클라우드 서비스의 등장 모바일 클라우드 서비스 융합사례

Copyright 2004 Sun Microsystems, Inc Network Circle, Santa Clara, CA U.S.A..,,. Sun. Sun. Berkeley BSD. UNIX X/Open Company, Ltd.. Sun, Su

bn2019_2

목차 개요 3 섹션 1: 해결 과제 4 APT(지능형 지속 위협): 이전과 다른 위협 섹션 2: 기회 7 심층 방어 섹션 3: 이점 14 위험 감소 섹션 4: 결론 14 섹션 5: 참조 자료 15 섹션 6: 저자 소개 16 2

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

ARMBOOT 1

Remote UI Guide

<31325FB1E8B0E6BCBA2E687770>

PRO1_04E [읽기 전용]

MAX+plus II Getting Started - 무작정따라하기

PowerPoint 프레젠테이션

SchoolNet튜토리얼.PDF

Analyst Briefing

초보자를 위한 C++

SMB_ICMP_UDP(huichang).PDF

06_ÀÌÀçÈÆ¿Ü0926

Microsoft PowerPoint Android-SDK설치.HelloAndroid(1.0h).pptx

untitled

감각형 증강현실을 이용한

<31332DB9E9C6AEB7A2C7D8C5B72D3131C0E528BACEB7CF292E687770>

Sena Technologies, Inc. HelloDevice Super 1.1.0

Voice Portal using Oracle 9i AS Wireless

PowerPoint 프레젠테이션

PRO1_02E [읽기 전용]

HTML5가 웹 환경에 미치는 영향 고 있어 웹 플랫폼 환경과는 차이가 있다. HTML5는 기존 HTML 기반 웹 브라우저와의 호환성을 유지하면서도, 구조적인 마크업(mark-up) 및 편리한 웹 폼(web form) 기능을 제공하고, 리치웹 애플리케이 션(RIA)을

학습영역의 Taxonomy에 기초한 CD-ROM Title의 효과분석

Social Network

05Àå

04_오픈지엘API.key

PowerPoint 프레젠테이션

Cloud Friendly System Architecture

K7VT2_QIG_v3



thesis

Sun Java System Messaging Server 63 64

IPAK 윤리강령 나는 _ 한국IT전문가협회 회원으로서 긍지와 보람을 느끼며 정보시스템 활용하 자. 나는 _동료, 단체 및 국가 나아가 인류사회에 대하여 철저한 책임 의식을 가진 다. 나는 _ 활용자에 대하여 그 편익을 증진시키는데 최선을 다한다. 나는 _ 동료에 대해

Copyright 2012, Oracle and/or its affiliates. All rights reserved.,.,,,,,,,,,,,,.,...,. U.S. GOVERNMENT END USERS. Oracle programs, including any oper

Intra_DW_Ch4.PDF

1

°í¼®ÁÖ Ãâ·Â

13 Who am I? R&D, Product Development Manager / Smart Worker Visualization SW SW KAIST Software Engineering Computer Engineering 3

슬라이드 1

chapter1,2.doc

클라우드컴퓨팅확산에따른국내경제시사점 클라우드컴퓨팅확산에따른국내경제시사점 * 1) IT,,,, Salesforce.com SaaS (, ), PaaS ( ), IaaS (, IT ), IT, SW ICT, ICT IT ICT,, ICT, *, (TEL)

, N-. N- DLNA(Digital Living Network Alliance).,. DLNA DLNA. DLNA,, UPnP, IPv4, HTTP DLNA. DLNA, DLNA [1]. DLNA DLNA DLNA., [2]. DLNA UPnP. DLNA DLNA.

Open Cloud Engine Open Source Big Data Platform Flamingo Project Open Cloud Engine Flamingo Project Leader 김병곤

Special Theme _ 모바일웹과 스마트폰 본 고에서는 모바일웹에서의 단말 API인 W3C DAP (Device API and Policy) 의 표준 개발 현황에 대해서 살펴보고 관 련하여 개발 중인 사례를 통하여 이해를 돕고자 한다. 2. 웹 애플리케이션과 네이

歯CRM개괄_허순영.PDF


GEAR KOREA

소프트웨어개발방법론

歯I-3_무선통신기반차세대망-조동호.PDF

Domino Designer Portal Development tools Rational Application Developer WebSphere Portlet Factory Workplace Designer Workplace Forms Designer

Analytics > Log & Crash Search > Unity ios SDK [Deprecated] Log & Crash Unity ios SDK. TOAST SDK. Log & Crash Unity SDK Log & Crash Search. Log & Cras

PowerPoint 프레젠테이션

Transcription:

NGS Analysis using Galaxy 2013 한국유전체학회동계심포지엄생물정보분석교육워크샵 김형용, 이규열, 이성찬 _ 2013. 02. 05 ~ 2013.02.06 R&D Center, Insilicogen, Inc.

Index NGS Analysis using Galaxy 목차있을시간지 01 Galaxy introduction 02 Galaxy examples 1,2 03 Galaxy installation 04 Galaxy function details 05 Galaxy examples 3,4 06 Galaxy tools 07 Galaxy on Grid 08 Galaxy on Cloud

Agenda 구분시간강의내용비고 15:00 ~ 15:20 Galaxy 소개진행김형용 1 부 : Introduction and Application 15:20 ~ 15:50 Galaxy 분석예제시연 1. Human exon 가운데가장 SNP 많은 ex on 찾기 2. NGS QC and assembly 예제 16:00 ~ 16:20 Galaxy 설치진행이성찬 16:20 ~ 17:10 Galaxy 설치및분석예제실습 1. Galaxy 설치실습 2. Human exon 가운데가장 SNP 가많은 exon 찾기실습 3. NGS QC and assembly 예제실습 2 부 : Custom operation 17:20 ~ 17:50 Galaxy 세부기능에대한설명진행김형용 09:00 ~ 09:20 Galaxy 분석예제시연진행김형용 1. RNA-seq 분석예제 2. NGS 분석예제 2 19:20 ~ 09:50 Galaxy 분석예제실습 1. RNA-seq 분석예제 2. NGS 분석예제 2 10:00 ~ 10:20 Galaxy tool의이해진행김형용 10:20 ~ 11:00 Galaxy tool 작성실습 1. Primer design 11:10 ~ 11:30 Galaxy on Grid 진행이규열 1. 그리드의이해 2. 분산작업시연 11:30 ~ 11:50 Galaxy on Cloud 진행김형용 1. 클라우드의이해 2. Galaxy on Amazon EC2 3

NGS Technologies

Sequencer Comparison Illumina 454 SOLiD Read length HiSeq 2000 HiSeq 1000 HiScan SQ GAIIx GS FLX 2X100 bp 2X150 bp 400 bp 5500 microbeads 5500xl microbeads Mate pair : 60 bp X60 bp Paired-end : 75 bp X35 bp Fragment : 75 bp 5500xl nanobeads Gb/day 55 35 17.5 6.5 10h 10-15 20-30 30-45 Yield 600Gb 300Gb 150Gb 95Gb 35Mb 90Gb 180Gb 300Gb Required input Accuracy 50 ng with Nextera 100 ng 1 μg with TruSeq 85% (2X50 bp, >Q30) 80% (2X100 bp, >Q30) 99% (>Q20) 99.99% Illumina 의 Gb/day 는 2X100 bp run 결과 Illumina read length : 1X35, 2X50, 2X100 GA : 1X35, 2X50, 2X100, 2X150 Copyrightc Insilicogen, Inc. 2011. All rights reserved. 5

Applications Application of NGS Technique Personal Genomics Microbiology Personal Genomics Environmentology Toxicology Chemical Biology Mutation Detection Structure Variation Transcriptional Control Interaction of DNA and Protein 6

Issue of New Genomic Era. many researchers, having invested in next generation sequencing instruments, now face a computational bottleneck in their research work-flow. BGI 7

Most Significant Improvement to Your Next Generation Sequencing Workflow ( 출처 : The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, 2011. BioInformatics, LLC) Copyrightc Insilicogen, Inc. 2010. All rights reserved. 8

Issue of New Genomic Era. DNA shearing Insert into high and /or low copy number vectors Big Dye ABI 3730 Data compliation Gene prediction BLAST search FTP Web browser Commercial software Bioinformatics Library construction Sequence delineation Sequence annotation Data delivery Template purification Finishing & Assembly Secondary annotation PCR Amplicons BACs Cosmids/ Fosmids Primer walking Transposon insertion methods Proprietary & commercial assembly SNP Comparative genomics Expression analysis Cost Process 9

Application of Next Genomic Data 10

Practical Software Platforms for NGS data analysis

What kind of? Biological Features Framework (Enterprise/Informatics) Features Service Price

List of NGS Frameworks Copyrightc Insilicogen,Inc. 2012. All rights reserved. 13

유전변이추출전문파이프라인 HugeSeq Copyrightc Insilicogen,Inc. 2012. All rights reserved. 14

사용자친화적 GUI 환경을제공하는 CLC Genomics Server 1 CLC Genomics Server - 3계층시스템구조의데이터분석및공유, 관리를위한엔터프라이즈솔루션 2 5 2 CLC Bioinformatics Database - 데이터의중앙집중방식의저장및공유관리를위한데이터베이스 3 CLC Assembly Cell - NGS 데이터의초고속 assembly 분석솔루션 ( 커맨드라인기반 ) 3 1 4 4 CLC Genomics Workbench - NGS 데이터의다양한생물정보분석솔루션 (GUI 기반 ) 5 CLC Developer Kit - 사용자가원하는생물정보분석툴과워크플로우커스터마이징솔루션 Copyrightc Insilicogen,Inc. 2012. All rights reserved. 15

16

30x Human genome 1 sample (150G) 500 만원 (1 년저장 ) 17

구글로부터투자받아 NCBI SRA 서비스연동 온라인에서실험없이곧바로분석가능 18

GALAXY

20

21

What is Galaxy Galaxy, a web-based genome analysis platform http://usegalaxy.org An open-source framework for integrating various computational tools and databases into a cohesive workspace A web-based service we provide, integrating many popular tools and resources for comparative genomics A completely self-contained application for building your own Galaxy style sites 22

Galaxy Usage One of the fastest growing open source bioinformatics projects, a highly successful high throughput data analysis platform for Life Sciences with over 15,000 users worldwide Annual Galaxy Community Conference 23

Galaxy visualization External Genome Browser UCSC Ensembl GBrowse Trackster Track/data viewer in web browser HTML5 Canvas, jquery Renders in browser, not on server 24

Galaxy visualization 25

Trackster 26

Trackster 27

Trackster 28

Galaxy 구성요소 Galaxy 주요구성요소 Datasources : 입력데이터지정. 별도의지역시스템이나, 외부웹사이트의데이터를등록가능 Tool : 기본적인분석의최소단위, 지역설치시원하는툴을만들어넣을수있음 History : 입력데이터가 Tool의조합을거쳐얻어진중간결과물목록 Workflow : History 는입력데이터및파라메터만바꾸면새로운데이터결과를얻을수있다. 이를별도로프로세스등록 Visualization : 분석결과를가시화도구와연결 Page : 위요소들을종합한보고서작성기능 Eprimer3 tool 을별도로만들어등록한예제 29

Galaxy tool 은 입력포맷 Tool 출력포맷 입력데이터를 ( 포맷에맞게 ) 작업하여 ( 포맷에맞게 ) 출력데이터를만드는역할 조합하면 Workflow 가된다 30

Galaxy formats Auto-detect Ab1 Axt Bam Bed Fasta FastqSolexa Gff Gff3 Interval (Genomic Intervals) Lav MAF Scf Sff Tabular (tab delimi ted) Wig Other text type 데이터가어떤형식인지자동으로인식 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploadi ng the file. blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence li nes. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size infor mation about the alignment. It consists of 9 required fields. A binary file compressed in the BGZF format with a '.bam' file extension. Tab delimited format (tabular). Does not require header line A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file GFF lines have nine required fields that must be tab-separated. The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous fo rmats. Tab delimited format (tabular) Lav is the primary output format for BLASTZ. The first line of a.lav file begins with #:lav.. TBA and multiz multiple alignment format. The first line of a.maf file begins with ##maf. This word is followed by white-sp ace-separated "variable=value pairs". There should be no white space surrounding the "=". A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file. A binary file in 'Standard Flowgram Format' with a '.sff' file extension. Any data in tab delimited format (tabular) The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track. Any text file 31

Galaxy 특징한번더 최근 Galaxy 사용추세 NGS 관련분석기능탑재 Amazon Cloud 이용 논문에 Galaxy URL 제공 Transparent analysis Biologist Bioinformatician Galaxy 특징한번더 파이썬으로만들어져있으나, 확장시파이썬이아니어도됨 투명한 분석플로우를만들고공유하고확장할수있다. 거의모든생물정보분석을 Galaxy 로할수있다. Galaxy만잘써도뽑겠다 (NCBI) 32

GALAXY Examples 1

Example 1. Finding Human Exons with the highest number of SNPs 1. Download all Human Exons from NCBI or Ensembl BioMart or UCSC TableBrowser 2. Download all Human SNPs from 3. Scripting Join 1, 2 according to position Group by Exon id Sort by SNP count Filter Exon which has more than 10 SNPs Have to do programming! (Python, Perl, ) 34

On Galaxy http://usegalaxy.org 35

On Galaxy Get data UCSC main : Exon 데이터가져오기 Get data UCSC main : SNP 데이터가져오기 Operate on Genomic Interval Join : 영역이겹치는 Exon 추출하기 Join, Substract and Group Group : Exon 이름으로그룹핑하고 SNP 세기 Filter and Sort Sort : SNP 개수로 Exon 정렬하기 Text Manipulation Select first : SNP 개수가많은 top 5 exon 추출하기 Join, Substract and Group Compare two Datasets : 잃어버린 exon 정보회복하기 36

GALAXY Examples 2

Example 2. Human NGS data QC and assembly 1. NGS Quality Control 2. NGS Single End Mapping 3. SNP Calling 4. Compare with dbsnp Have to do in Unix and need programming! (Python, Perl, ) 38

On Galaxy http://usegalaxy.org 39

On Galaxy NGS 분석을위해서는프로그램추가설치해야함 ( http:// http://wiki.galaxyproject.org/admin/ngs%20local%20setup ) 프로그램 사용되는곳 설치방법 Fastx-toolkit NGS QC Ubuntu apt-get Gnuplot NGS QC boxplot Ubuntu apt-get Bowtie2 Reference assembly 복사후 PATH 설정 SAMTools SNP calling Ubuntu apt-get 40

On Galaxy Get data Upload File : human illumina fastq 파일업로드 NGS: QC and minipulation FASTQ Groomer : fastsanger 포맷을변경 NGS: QC and minipulation Compute quality statistics : fastq quality 통계정보보기 NGS: QC and minipulation Draw quality score boxplot : fastq quality 통계정보로 boxplot 그리기 NGS: QC and minipulation FASTQ Trimmer, Quality Trimer, Masker : 의미없는부분잘라내기, 가리기 41

On Galaxy Get data Upload File : Reference assembly 를위한레퍼런스서열입력 NGS: Mapping Bowtie2 : Bowtie2 를이용한 assembly NGS: SAM Tools MPileup : BAM 파일에서 SNP, indel 정보추출하기 NGS: SAM Tools Filter pileup : 추출된 SNP, indel 가운데높은점수추출하기 NGS: SAM Tools Pileup-to-interval : Genomic interval 형식으로변경 Get data UCSC Main : dbsnp 정보가져오기 Operate on Genomic Interval Join : 영역이겹치는 SNP 추출하기 42

Galaxy Installation

Install Virtualbox - Ubuntu 1. USB 에서 Virtualbox 와 Galaxy 폴더를복사합니다. 2. Virtualbox 를설치합니다. 3. Virtualbox 를실행한후, Galaxy 이미지를 Import 합니다. 4. 설정에서네트워크를브릿지 (Bridge) 로변경합니다. 5. Ubuntu 실행후, Network 설정파일을삭제합니다. rm /etc/udev/rules.d/70-persistent-net.rules 6. Linux(ubuntu) 를재시작합니다. sudo shutdown h now 44

Creating your own Galaxy 45

Running Galaxy in an production environment By default, Galaxy uses SQLite database Built-in HTTP server for all tasks Local job runnser Single process Simplest error-proof configuration Change configuration for service Disable the developer settings use_interactive = False, use_debug = False Get a real database PostgresSQL Offload the menial tasks: Proxy Nginix, Apache Let your tools free: Cluster Move intensive processing to other host, TORQUE, GRID, DRMAA Other advanced settings 46

Galaxy on Cluster Intensive processes to other hosts TORQUE GRID DRMAA Working with Galaxy on the Cloud 47

Virtualization

Virtualization 가상화 컴퓨터자원의추상화를일컫는말 가상의물리적리소스를만들어냄. 물리적인 1 대의하드웨어자원을논리적으로여러개로나누어사용하거나, 여러대의하드웨어자원을논리적으로통합하여이용하는기술 하드웨어관리, 재난에대한시스템복구등여러문제를해결할수있는방법으로최근각광 받고있음

Virtualization 비용절감 서버한대를분할하여여러대의서버를구성할수있음 서버구입비용절감, 전기, 상면비용, 서버관리비용이절감 자원의효율적인사용 서버의비활용되는자원을이용하여가상머신을만듬으로써효율적인자원사용이가능 안정적인운영 가상화의장점!! 서버를이미지로백업, 손쉬운서버이전으로장애에대한신속한대처가능 SW 의지속적인운영 서버 HW의수명주기가끝나면 OS 벤더는장치드라이버지원이중단됨 -> 마이그레이션문제가발생 가상머신에기존의시스템을가상머신에올리기때문에장치드라이버에대한문제가발생하지않음 50

클라우드서비스에기본적으로활용 51

Public Galaxy environment 52

Example of Cloud 출처 : isc 2012 Amazon HPC session Copyrightc Insilicogen,Inc. 2012. All rights reserved. 53

Running Galaxy Web server 1. 자신의컴퓨터의 IP Address 를확인합니다. ifconfig 2. Galaxy 폴더로이동합니다. cd galaxy-dist 3. Galaxy web server 를실행합니다. sh run.sh 4. 자신의호스트 OS (windows) 에서웹브라우저에서주소창에다음을입력합니다. IP Address:8080 ( 예, 172.20.8.162:8080) 54

Galaxy Detail functions

Get Data 56

Get Data / Send Data 57

Text Manipulation 58

Convert Format 59

FASTA manipulation 60

Filter and Sort 61

Join, Subtract and Group 62

Operate on Genomic Intervals 63

NGS Toolbox 64

Galaxy Examples 3

Example 3. Human RNA-seq 1. RNA-seq result: adrenal_1,2.fastq, brain_1,2.fastq 2. Reference: igenome UCSC hg19, chr19 gene notation (GTF format) Have to do in Unix and need programming! (Python, Perl, ) 66

On Galaxy http://usegalaxy.org 67

On Galaxy RNA-seq 분석을위해서는프로그램추가설치해야함 ( http://wiki.galaxyproject.org/admin/ngs%20local%20setup ) 프로그램 사용되는곳 설치방법 java FastQC Ubuntu apt-get install openjdk-7-jre FastQC NGS QC tool-data/shared/jars/ 로복사 Tophat RNA-seq mapping ( 다음페이지참고 ) Cufflinks RNA-seq assembly Ubuntu apt-get install cufflinks 68

Tophat install in Ubuntu $ cp samtools-0.1.18.tar.gz2 ~/work $ bzip2 d samtools-0.1.18.tar.gz2 $ tar xvf samtools-0.1.18.tar $ cd samtools-0.1.18 $ make $ cd.. $ cp tophat-1.4.1.tar.gz ~/work $ tar zxvf tophat-1.4.1.tar.gz $ cd tophat-1.4.1 $ apt-get install libboost libbam libboost-thread-dev $ cp../samtools-0.1.18/libbam.a /usr/local/lib $ sudo mkdir /usr/local/include/bam $ cp../samtools-0.1.18/*.h /usr/local/include/bam $ configure $ make $ make install 69

On Galaxy Get data Upload File : fastq, chr19.fa, gtf 파일업로드 NGS: QC and minipulation FASTQ Groomer : fastqsanger 포맷으로변경 NGS: QC and minipulation FastQC:Read QC : fastq quality 통계정보보기 NGS: RNA Analysis Tophat for Illumina : RNA-seq fastq 데이터에서 splice junction 찾기레퍼런스로 chr19.fa 이용 NGS: RNA Analysis Cufflinks : Transcript assembly, FPKM 추정 70

On Galaxy NGS: RNA Analysis Cuffmerge : brain, adrenal 데이터를 reference 에맞게합치기 NGS: RNA Analysis Cuffdiff : 유의한발현변화찾기 71

Galaxy Tools

Galaxy tool 은 입력포맷 Tool 출력포맷 입력데이터를 ( 포맷에맞게 ) 작업하여 ( 포맷에맞게 ) 출력데이터를만드는역할 조합하면 Workflow 가된다 73

Galaxy formats Auto-detect Ab1 Axt Bam Bed Fasta FastqSolexa Gff Gff3 Interval (Genomic Intervals) Lav MAF Scf Sff Tabular (tab delimi ted) Wig Other text type 데이터가어떤형식인지자동으로인식 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploadi ng the file. blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence li nes. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size infor mation about the alignment. It consists of 9 required fields. A binary file compressed in the BGZF format with a '.bam' file extension. Tab delimited format (tabular). Does not require header line A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file GFF lines have nine required fields that must be tab-separated. The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous fo rmats. Tab delimited format (tabular) Lav is the primary output format for BLASTZ. The first line of a.lav file begins with #:lav.. TBA and multiz multiple alignment format. The first line of a.maf file begins with ##maf. This word is followed by white-sp ace-separated "variable=value pairs". There should be no white space surrounding the "=". A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file. A binary file in 'Standard Flowgram Format' with a '.sff' file extension. Any data in tab delimited format (tabular) The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track. Any text file 74

Creating your own Galaxy 75

Primer design tool 76

Primer3 Primer3 Primer design program http://primer3.sourceforge.net/releases.php Download from http://sourceforge.net/projects/primer3/files/primer3/1.1.4/prim er3-1.1.4.tar.gz make & copy to PATH eprimer3 Wrapper for Primer3, it s used in EMBOSS package Easy command line interface http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/ eprimer3.html apt-get install emboss 77

erimer3 $ eprimer3 sequence INPUT_FASTA_FILE outfile PRIMER_DESIGN_RESULT -osize OSIZE -gcclamp GCCLAMP # EPRIMER3 RESULTS FOR GL020027.1 # Start Len Tm GC% Sequence 1 PRODUCT SIZE: 199 FORWARD PRIMER 571071 20 60.06 45.00 CTTGCCAATAGCGAATGGAT REVERSE PRIMER 571250 20 59.99 55.00 GACGGCGTAGATCTTCAAGC 2 PRODUCT SIZE: 199 FORWARD PRIMER 55074 20 60.05 55.00 TAACACCACTGCTCCTGCTG REVERSE PRIMER 55253 20 59.97 50.00 CATTGCATGGTCAGAACCAC 이결과형식을수정하여다른 Galaxy tool 의입력으로쓰고싶다. 3 PRODUCT SIZE: 200 FORWARD PRIMER 71990 20 60.03 45.00 GGGGTTGATTTTCATTGTGG REVERSE PRIMER 72170 20 59.88 45.00 GTTTGCACCAACCTGGTTTT 4 PRODUCT SIZE: 200 FORWARD PRIMER 427182 20 59.83 50.00 CTGATGTGCTCTGTGGGAAA REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT 직접 Primer design Galaxy tool 만들기 5 PRODUCT SIZE: 197 FORWARD PRIMER 427185 20 59.97 50.00 ATGTGCTCTGTGGGAAAACC REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT 78

erimer3.xml 79

erimer3.py 80

tool_conf.xml <section name="vcf Tools" id="vcf_tools"> <tool file="vcf_tools/intersect.xml" /> <tool file="vcf_tools/annotate.xml" /> <tool file="vcf_tools/filter.xml" /> <tool file="vcf_tools/extract.xml" /> </section> <section name= MyTools" id= mytools"> <tool file= mytools/eprimer3.xml" /> </section> </toolbox> 81

EMBOSS eprimer3 tool added 82

실습 Install Primer3 : make 명령으로컴파일후, primer3_core PATH 설정 Install EMBOSS : sudo apt-get install emboss Install Biopython : sudo apt-get install python-biopython Copy eprimer3.py, eprimer3.xml to galaxy-dist/tools/mytools/ : mytools 디렉토리는직접생성 Edit tool_conf.xml : mytools/eprimer3.xml 설정 83

Galaxy on Grid

Grid vs Cluster 공통점 대용량데이터에대한연산을작은소규모연산들로나누어작은여러대의컴퓨터로분산시켜수행 차이점 WAN 상에서서로다른기종의머신들을연결다양한플랫폼을서로연결함연결대수에제한이없음 85

Grid 86

Globus Toolkit 대표적인계산그리드미들웨어 Open source toolkit for building computing grids developed and provided by Globus Alliance Standards implementation Open Grid Service Architecture (OGSA) Open Grid Service Infrastructure (OGSI) Web Services Resource Framework (WSRF) Job Submission Description Language (JSDL) Distributed Resource Management Application API (DRMAA) SOAP WSDL Grid Security Infrastructure 87

High level Open Grid Forum API specification for submission and control of jobs to a Distributed Resource Management (DRM, Job scheduler) system, such as a Cluster or Grid computing infrastructure 88

PBS (Portable Batch System) Computer software that performs job scheduling in Unix cluster environment A component of the Globus Toolkit Originally developed by NASA Following versions OpenPBS TORQUE a fork of OpenPBS PBS Professional (PBS pro) - commercial 89

TORQUE Distributed resource manager providing control over batch jobs and distributed compute node It stands for Terascale Open Source Resource and QUEue Manager Slave 노드의 CPU 개수, core 개수, RAM 사이즈, 임시저장소등의설정정보를가지고스케줄러에의해요청이왔을때클러스터리소스를분배함 Slave 1 Master Slave 2 NFS Slave 3 > qsub a.sh a.sh 명령을스케줄러에따라 slave로넘김 90

Virtualized Galaxy (Test-bed) 91

Galaxy on Cloud

Cloud computing Delivery of computing and storage capacity as a service to a heterogeneous community of end-recipients. 93

94

VPS (Virtual Private Server) Internet hosting services to refer a virtual machine in a cloud 95

Amazon EC2 (Amazon Elastic Compute Cloud) Virtualization + Grid(Cluster) computing in a Cloud 96

Amazon EC2 (Amazon Elastic Compute Cloud) 97

Amazon EC2 (Amazon Elastic Compute Cloud) 98

Amazon EC2 (Amazon Elastic Compute Cloud) 99

Amazon S3 (Amazon Simple Storage Service) 100

Galaxy on Cloud Using Amazon EC2 + S3 Select AMIs in Community AMIs 101

Galaxy on Cloud 102

Galaxy on Cloud 103

Galaxy on Cloud 104

Galaxy on Cloud 105

Galaxy on Cloud 106

Galaxy on Insilicogen Galaxy localization on cluster Tool development Workflow development 107

www.insilicogen.com E-mail codes@insilicogen.com Tel 031-278-0061 Fax 031-278-0062

www.insilicogen.com E-mail km@insilicogen.com Tel 031-548-1008,1009 Fax 031-278-0062