생명정보학의이해 (Introduction to Bioinformatics) Chapter 5. DNA Microarray 데이터분석 박성희 (shpark@ssu.ac.kr) ac kr) 목차 DNA Microarray 실험의원리 Microarray 데이터전처리 이미지처리 (image preprocessing) Microarray 데이터정규화 (Normalization) Microarray 데이터의생명정보학적분석 군집화 (Clustering) 계층적클러스터링 K-Means 클러스터링 분류 ( classification) 숭실대학교생명정보학과 8-5-8 (c)sung Hee Park 생명정보학의이해 DNA Chip 이란? 매우작은금속또는유리표면에수천, 수만종의 DNA 를고밀도로부착시키고이들 DNA와 hybridization 되는유전자를초고속으로분석하는장치 One High-Throughput Method: Microarrays DNA microarray is large-scale gene expression analysis 대량의유전자발현을분석할수있는실험방법 what is varied: individuals, strains ( 계통 ), cell types, environmental conditions, disease states, etc. what is measured: RNA quantities for thousands of genes, exons or other transcribed sequences 8-5-8 (c)sung Hee Park 생명정보학의이해 3 8-5-8 (c)sung Hee Park 생명정보학의이해
Microarray 의데이터 D matrix (차원행렬 ) 로표현 행 (row) : 유전자 (gene), 단백질 ( proteins) 등 열 (column) : 개인 (individuals), strains( 계통 ), 세포타입 (cell types) 등 Microarray 의목적 how active are various genes in different cell/tissue types? how does the activity level of various genes change under different conditions? stages of a cell cycle environmental conditions disease states what genes seem to be regulated together? Find the genes that change expression between experimental and control samples Classify samples based on a gene expression profile Find patterns: Groups of biologically related genes that change expression together across samples/treatments 8-5-8 (c)sung Hee Park 생명정보학의이해 5 8-5-8 (c)sung Hee Park 생명정보학의이해 6 Pool of Cell Lines Tumor Different amounts of starting material. Differential labeling efficiency of dyes Different amounts of RNA in each channel Differential efficiency of scanning in each channel. Differential efficiency of hybridization over slide surface. Microarray DNA chips, gene chips, DNA arrays Spot 에놓이는종류에따른분류 cdna microarray chip (Pat Brown, Stanford Univ.) 이미밝혀진 ORF(open reading frame) 을 chip 에집적 생체내 mrna를역전사효소로 cdna를합성하여위 ORF와의 hybridization시키면그발현량에따라 signal 크기변화 특정유전자의발현정도분석 Oligonucleotide chip (Affymetrix Inc.) ~5 개의 nucleotide로이루어진 DNA probe 를집적 개의 nucleotide(a,g,c,t) 의조합으로이루어진 probe와시료dna를 hybridization시키면둘의염기서열일치정도에따라 signal크기변화 DNA 염기서열, 돌연변이된염기서열분석 anchoring pieces of DNA to glass/silicon slides complementary hybridization 8-5-8 (c)sung Hee Park 생명정보학의이해 7 8-5-8 (c)sung Hee Park 생명정보학의이해 8
Microarry 실험 원리: Pin microarray Complementary Hybridization 1995년 미국 Stanford (Dr. Pat. Brown)대학에서 개발 미리 제작된 oligonucleotide나 cdna를 pin으로 칩 위에 이식시킴 1cm 칩안에 ~3천개 유전자 집적 가능 8-5-8 (c)sung Hee Park 생명정보학의 이해 9 Probe Extract and Labeling WILD 8-5-8 (c)sung Hee Park 생명정보학의 이해 1 Hybridization and Scanning Cy3-labeled Cy3 labeled Cy5-labeled Cy5 labeled wild cdna mutant cdnam MUTANT Laser Cy y 3: 533 nm Cy 5: 6 nm cells or tissues 8 srna 18 srna 15 ug g total RNA and QC RNA polymerase mrna RNA Detector cdna synthesis Hybridization Reverse transcriptase cdna 7.5k cdna chip cy3 cy5 8-5-8 (c)sung Hee Park 생명정보학의 이해 11 8-5-8 (c)sung Hee Park 생명정보학의 이해 1
1 1 8 6 - - -6 R =.1 R =.6185 Intensity Dependence Comparison 6 8 1 1 1 16 18.5*(Log(G) + Log(R)) Slide3 Slide7 Poly. (Slide7) Poly. (Slide3) Processing and Log(R/G) Image Processing Data Normalization Image Processing Spot 의위치를파악 Gridding 이라고함 실험시스팟의위치가커버글라스에의해밀려나기도함 Segmentation 찾아진스팟의밝기를결정 Fore ground 와 back ground Differential Gene Expression Cluster Pathway 8-5-8 (c)sung Hee Park 생명정보학의이해 13 8-5-8 (c)sung Hee Park 생명정보학의이해 1 Gridding Segmentation 8-5-8 (c)sung Hee Park 생명정보학의이해 15 8-5-8 (c)sung Hee Park 생명정보학의이해 16
1 1 8 6 - - -6 R =.1 R =.6185 Intensity Dependence Comparison 6 8 1 1 1 16 18.5*(Log(G) + Log(R)) Slide3 Slide7 Poly. (Slide7) Poly. (Slide3) Processing of Array data Pixel images of a spot Which genes are interested Log Cy3 G=log cy5/cy3 Each pixel have a cy3 and cy5 ratio. Mean and median of pixels from a given spot for both cy5 and cy3 channel. Intensity of a given spot is calculated by a cy5/cy3 ratio. Log-transformed intensities approach a normal distribution 8-5-8 (c)sung Hee Park 생명정보학의이해 17 Log Cy5 8-5-8 (c)sung Hee Park 생명정보학의이해 18 Processing and and Data Mining Differential Gene Expression Cluster Log(R/G) Image Processing Data Normalization Pathway 전산학적분석 identifying differential expression which h genes have different expression levels l across two groups clustering genes which genes seem to be regulated together clustering samples which treatments/individuals have similar profiles classifying genes to which functional class does a given gene belong classifying samples to which class does a given sample belong 8-5-8 (c)sung Hee Park 생명정보학의이해 19 8-5-8 (c)sung Hee Park 생명정보학의이해
Cluster Clustering: organization of a collection of unlabeled patterns into clusters based on similarity Patterns within the same cluster are more similar to each other than they are to a pattern belong to a different cluster. Putative ti mitochondrial i carrier Clustering gene expression data Group the genes together that share the similar gene expression pattern across a data set Gene expression across several treatments genes involved in the same biological process are likely co- regulated Arrays showing similar gene expression profiles in order to discover sample groups Chlorophyll binding protein Hypothesis: Genes with similar function have similar expression profiles 8-5-8 (c)sung Hee Park 생명정보학의이해 1 8-5-8 (c)sung Hee Park 생명정보학의이해 클러스터링기법 (Clustering Method) Hierarchical clustering ( 계층적군집화 ) Agglomerative Single, complete, average linkage k-means or k-medoids SOMs (Self Organized Maps) 8-5-8 (c)sung Hee Park 생명정보학의이해 3 8-5-8 (c)sung Hee Park 생명정보학의이해
Hierarchical clustering Every gene (or array) is placed at a specific node in a hierarchy (tree-like structure) so that it is possible to address distance between points Dendrogram or hierarchical tree ( 계층트리 ) The number of clusters is determined by distance cutoff K-means or SOM partitions the data into pre-defined number of nodes without a hierarchy between data points The hierarchy can be constructed by either top-down (divisive) or bottom-up (agglomerative) 8-5-8 (c)sung Hee Park 생명정보학의이해 5 Hierarchical approach Agglomerative Start t with the points as individual id clusters At each step, merge the closest pair of clusters Divisive Start with one, all-inclusive cluster At each step, split a cluster until only singleton clusters. 15.15.1.5 p1 p p3 p p5... 6 5 p1 3 p 5 p3 p 1 3 1 p5 1 3 5 6.. Dendrogram Nested cluster diagram. Proximity Matrix 8-5-8 (c)sung Hee Park 생명정보학의이해 6 How to measure similarity two individual patterns How to measure distance of two clusters Measure of dissimilarity between two individual patterns (gene vector) a gene expression pattern a is represented by a vector of measurements [a 1,a,.,a N] Euclidean distance dissimilarity MIN (single linkage) MAX (complete linkage) Scalar product dissimilarity Correlation coefficient Group average (average linkage) Distance Between Centroids 8-5-8 (c)sung Hee Park 생명정보학의이해 7 8-5-8 (c)sung Hee Park 생명정보학의이해 8
Single, Complete, Average Linkage Algorithms Distance between Clusters In method="single", we use the smallest dissimilarity between a point in the first cluster and a point in the second cluster (nearest neighbor method). When method="complete", we use the largest dissimilarity between a point in the first cluster and a point in the second cluster (furthest neighbor method). For method="average", the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster. 8-5-8 (c)sung Hee Park 생명정보학의이해 9 Sample data.6.5..3..1 5.1..3..5.6 Set of 6 two-dimensional points 3 1 6 Point X y P1..53 P..38 P3.35.3 P 6.6 19.19 P5.8.1 p6.5.3 P1 P P3 P P5 p6 P1....37.3.3 P...15..1.5 P3. 15.15. 15.15 8.8 11.11 P.37..15..9. P5.3.1.8.9..39 P6.3.5.11..39. xy coordinates of 6 points Euclidean distance matrix for 6 points 8-5-8 (c)sung Hee Park 생명정보학의이해 3 MIN 3 1 5..15 5 1 3 6.1.5 3 6 5 1 Dendrogram Nested Clusters P1 P P3 P P5 p6 P1....37.3.3 P...15..1.5 P3. 15.15. 15.15 8.8 11.11 P.37..15..9. P5.3.1.8.9..39 P6.3.5.11..39. Dist({3,6}, {,5}) = min(dist(3,), dist(6,), dist(3,5), dist(6,5)) = min(.15,.5,.8,.39) =.15 8-5-8 (c)sung Hee Park 생명정보학의이해 31 8-5-8 (c)sung Hee Park 생명정보학의이해 3