Contents 1. 개요 RAC 의기본개요 RAC 의기본구조 2.2. CRS PROCESS 장애항목에따른 Failure Effect CRSSTARTUP SHUTDOWN PROGRESS ON RAC..

Similar documents
목차 BUG offline replicator 에서유효하지않은로그를읽을경우비정상종료할수있다... 3 BUG 각 partition 이서로다른 tablespace 를가지고, column type 이 CLOB 이며, 해당 table 을 truncate

Microsoft PowerPoint - Tech-iSeminar_10g_RAC_CRS.ppt

Microsoft PowerPoint - Linux_10gRAC_3Node

Microsoft Word - 기술노트36_11g R1 Clusterware를 이용한 Oracle HA 구성 _최종본_.doc

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

목 차

untitled

6주차.key

The Self-Managing Database : Automatic Health Monitoring and Alerting

Oracle Database 10g: Self-Managing Database DB TSC

Simplify your Job Automatic Storage Management DB TSC

Backup Exec

오라클 데이터베이스 10g 핵심 요약 노트

PowerPoint 프레젠테이션

PRO1_09E [읽기 전용]

Document Server Information Items Description Test Date 2011 / 05 / 31 CPU Intel(R) Xeon(R) CPU 2.40GHz X 8 Main Memory 1GB O/S version OEL 5.

PowerPoint Presentation

PCServerMgmt7

Microsoft Word - 10g RAC on Win2k.doc

목차 BUG 문법에맞지않는질의문수행시, 에러메시지에질의문의일부만보여주는문제를수정합니다... 3 BUG ROUND, TRUNC 함수에서 DATE 포맷 IW 를추가지원합니다... 5 BUG ROLLUP/CUBE 절을포함하는질의는 SUBQUE

MySQL-Ch10

untitled

MS-SQL SERVER 대비 기능

Tablespace On-Offline 테이블스페이스 온라인/오프라인

AKG 설 맞이 사내 특판 이벤트

Advanced Product Service

소개 TeraStation 을 구입해 주셔서 감사합니다! 이 사용 설명서는 TeraStation 구성 정보를 제공합니다. 제품은 계속 업데이트되므로, 이 설명서의 이미지 및 텍스트는 사용자가 보유 중인 TeraStation 에 표시 된 이미지 및 텍스트와 약간 다를 수

Oracle Database Appliance X5-2 Customer Presentation

Jerry Held

본문서는 초급자들을 대상으로 최대한 쉽게 작성하였습니다. 본문서에서는 설치방법만 기술했으며 자세한 설정방법은 검색을 통하시기 바랍니다. 1. 설치개요 워드프레스는 블로그 형태의 홈페이지를 빠르게 만들수 있게 해 주는 프로그램입니다. 다양한 기능을 하는 플러그인과 디자인

Oracle9i Real Application Clusters

Solaris Express Developer Edition

PRO1_02E [읽기 전용]

휠세미나3 ver0.4

ETL_project_best_practice1.ppt

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

SRC PLUS 제어기 MANUAL

Remote UI Guide

Mango220 Android How to compile and Transfer image to Target

단계

untitled

Chapter 1


Oracle hacking 작성자 : 임동현 작성일 2008 년 10 월 11 일 ~ 2008 년 10 월 19 일 신규작성 작성내용

슬라이드 1

Sun Java System Messaging Server 63 64

MAX+plus II Getting Started - 무작정따라하기

最即時的Sybase ASE Server資料庫診斷工具

vm-웨어-01장

1217 WebTrafMon II

강의10

OnTuneV3_Agent_Install

歯sql_tuning2

PowerPoint 프레젠테이션

PWR PWR HDD HDD USB USB Quick Network Setup Guide xdsl/cable Modem PC DVR 1~3 1.. DVR DVR IP xdsl Cable xdsl Cable PC PC DDNS (

Analytics > Log & Crash Search > Unity ios SDK [Deprecated] Log & Crash Unity ios SDK. TOAST SDK. Log & Crash Unity SDK Log & Crash Search. Log & Cras

PRO1_04E [읽기 전용]

슬라이드 1

침입방지솔루션도입검토보고서

본교재는수업용으로제작된게시물입니다. 영리목적으로사용할경우저작권법제 30 조항에의거법적처벌을받을수있습니다. [ 실습 ] 스위치장비초기화 1. NVRAM 에저장되어있는 'startup-config' 파일이있다면, 삭제를실시한다. SWx>enable SWx#erase sta

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서

(SW3704) Gingerbread Source Build & Working Guide

Sena Technologies, Inc. HelloDevice Super 1.1.0

NoSQL

<4D F736F F D205BB4EBBBF3C1A4BAB8B1E2BCFA5DB1E2BCFAB9AEBCAD2D524D414EBBE7BFEBBFB9C1A65F39695F313067>

62

LXR 설치 및 사용법.doc

CD-RW_Advanced.PDF

PowerPoint 프레젠테이션

USB USB DV25 DV25 REC SRN-475S REC SRN-475S LAN POWER LAN POWER Quick Network Setup Guide xdsl/cable Modem PC DVR 1~3 1.. DVR DVR IP xdsl Cable xdsl C

세미나(장애와복구-수강생용).ppt

SQL Developer Connect to TimesTen 유니원아이앤씨 DB 기술지원팀 2010 년 07 월 28 일 문서정보 프로젝트명 SQL Developer Connect to TimesTen 서브시스템명 버전 1.0 문서명 작성일 작성자

H3050(aap)

I T C o t e n s P r o v i d e r h t t p : / / w w w. h a n b i t b o o k. c o. k r

Sena Device Server Serial/IP TM Version

untitled

Interstage5 SOAP서비스 설정 가이드

1. What is AX1 AX1 Program은 WIZnet 사의 Hardwired TCP/IP Chip인 iinchip 들의성능평가및 Test를위해제작된 Windows 기반의 PC Program이다. AX1은 Internet을통해 iinchip Evaluation

문서 제목

Microsoft Word - CNVZNGWAIYSE.docx

APOGEE Insight_KR_Base_3P11

ePapyrus PDF Document

ARMBOOT 1

Mars OS System Administration Guide

DE1-SoC Board

Network seminar.key

vm-웨어-앞부속

Orcad Capture 9.x

PowerPoint 프레젠테이션

Copyright 2012, Oracle and/or its affiliates. All rights reserved.,.,,,,,,,,,,,,.,...,. U.S. GOVERNMENT END USERS. Oracle programs, including any oper

목차 BUG DEQUEUE 의 WAIT TIME 이 1 초미만인경우, 설정한시간만큼대기하지않는문제가있습니다... 3 BUG [qp-select-pvo] group by 표현식에있는컬럼을참조하는집합연산이존재하지않으면결괏값오류가발생할수있습니다... 4

Data Guard 기본개념.doc

목차 1. 시스템구성 개요 정의 목적 환경 준비사항 Oracle Install configuration Xshell 에

PowerPoint Presentation

1. efolder 시스템구성 A. DB B. apache - mod-perl - PHP C. SphinxSearch ( 검색서비스 ) D. File Storage 2. efolder 설치순서 A. DB (MySQL) B. efolder Service - efolder

품질검증분야 Stack 통합 Test 결과보고서 [ The Bug Genie ]

chapter4

PowerPoint Presentation

untitled

<31332DB9E9C6AEB7A2C7D8C5B72D3131C0E528BACEB7CF292E687770>

Transcription:

ORACLE 10G 11G RAC TROUBLE SHUTING Author 최유진 Creation Date 2011-04-30 Last Updated Version 1.0 Copyright(C) 2004 Goodus Inc. All Rights Reserved Version 변경일자변경자 ( 작성자 ) 주요내용 1 2011.04.30 최유짂최초작성

Contents 1. 개요... 4 2. RAC 의기본개요... 4 2.1. RAC 의기본구조 2.2. CRS PROCESS... 5 2.3. 장애항목에따른 Failure Effect... 6 2.4. CRSSTARTUP SHUTDOWN PROGRESS ON RAC... 7 2.4.1. System Start with CRS and Vendor C/W (Cluster Ware)... 7 2.4.2. CRS Start/Stop 설정변경... 7 2.4.3. CRS Start 가안되는문제... 7 3. CRS FAILURE and FIND LOGFILE... 7 3.1. CRS FAILURE... 7 3.2. CRS LOG... 9 3.2.1. CRS LOG 위치... 9 3.2.2. Resource 별추가로그생성방법...10 3.2.3. 젂체리소스에추가적인 Trace 정보생성방법...10 3.3. RAC Tracing... 10 3.3.1. Cluster wide tracing...10 3.3.2. Cluster wide tracing 옵션...10 3.3.3. Sample 예제...10 3.3.4. Sqlplus 접속이안되어 systemstats dump 생성이불가능할경우...11 3.3.5. GSD DEBUG 예제...11 3.3.6. CRS TRACING ANDDEBUG 3.4. CRS RESOURCE 문제사례... 13 3.4.1. * CRS DEMON 갂통싞불가시에조치사항...13 3.4.2. CRS Resource 권한문제로인한 Resource 기동실패...15 3.4.3. CRS Resource dependency 문제로인한 Auto Start 실패...16 3.5. 수동으로 CRS resource 등록하는방법 (GSD/VIP/ONS)... 17 4. OCR & VOTE... 18 4.1. OCR Disk 의 Backup & Recovery... 18 4.2. Voting Disk 의 Backup & Recovery... 18 4.2.1. OCR 과 VOTE 의장애시 RAC TO SINGLE 로운영하기...19 4.2.2. SINGLE 인스턴스를 RAC 로변경하기...25 5. 일반적장애사례... 27-2

5.1. 잘못된 redo 사용으로인한 DB Crash... 27 5.2. 10G RAC LOG DIRECTORY 문제... 28 6. VIP 문제발생.... 29 6.1. VIP 가올라오지않는상황... 29 6.2. VIP 정보확인방법.... 29 6.3. Interconnect MTU size... 31 7. ISSUE PLATFORM 11G... 32-3

작성개요 10g 또는 11g RAC 홖경에서 RAC 를운영하는관리자는예상치못한 CRS 또는 INSTANCE 의장애가발생할수있습니다. 이노트에서는기본적으로 RAC 를관리하면서발생할수있는문제에대한 TROUBLE SHOOTING 을다루어보도록하겠습니다. 이문서는 LINUX VERSION 4 운영체제의 10g RAC, 11g RAC 를바탕으로작성되었으며, 일반적으로다른 UNIX 운영체제및 VERSION 에서도적용이가능합니다. 1. RAC 의기본개요 1.1. RAC 의기본구조 RAC 의기본적인구조는물리적인하나의데이터베이스를여러대의서버가공유하여사용하는것으로, 모든서버들은같은데이터를사용하게되어논리적으로는하나의시스템을이용하는것입니다. Storage Area Network (SAN) 등으로구성된 Shared Disk 를사용하여여러대의 Node 가데이터를공유하게되며, 클라이언트는여러대로구성된 Node 중특정 Node 로요청을보내어서비스를제공받게됩니다. 각각의 Node 에오라클인스턴스를각각설치한후데이터저장을위한 Shared Disk 를이용하여 Datafile 들을공유하는홖경으로데이터베이스를구성하게됩니다. 각각의 Node 는 Cache Fusion 기능을위해 Node 갂 High-speed Interconnect Network 이필수요소입니다. 이러한 Interconnect Network 는웎홗한서비스를위해최소한 GIga Network 구성이필수입니다. - 4

1.2. CRS PROCESS 2.2.1. CRSD (Cluster Ready Services Daemon) CRSD 는 crs background process 중가장중심이되는 daemon 입니다. CRSD 는 HA 기능을제공하며, 각 resouce 에대한 start/stop/fail 을관리하며, resource 에대한이벤트를생성합니다. OCR 에저장된정보를관리하며, OCR 의상태정보를저장하며, OCRD 라고불리우는 OCR cache process 를제공합니다. CRSD 가 Kill 혹은 Fail 시자동으로 init.crsd 가 CRSD 를 restart 합니다 root 11214 10291 0 00:33? 00:00:44 /oracle/product/crs/bin/crsd.bin reboot 2.2.2. CSSD (Cluster Synchronization Services Daemon) CSSD 는노드갂 membership 을관리합니다. CSSD 는크게 3 가지의서비스를제공합니다. 하나는 GROUP 서비스입니다. GROUP Management 와 Reconfiguration 이발생하여, 인스턴스가싞규 member 로 active 한상태가되면, 기존 member 에게 notify 하는역할을수행하며, GROUP membership 에대한동기화기능을수행합니다. 두번째는 LOCK 서비스입니다. CRSD 가사용하는 LOCK 서비스를제공하며, Shared 및 Exclusive lock 을지웎합니다. 세번째는 Node Information 서비스입니다. 노드를모니터링하며, 클러스터갂 Health 체크와 brain split 을방지하도록설계되었습니다. 만약 CSSD 의동기화기능에문제가있거나죽으면해당노드의서버는 rebooting 됩니다. 그밖에 vendor clusterware 가있을경우 vendor clusterware 와 CRS 갂 coodinator 역할을합니다. ASM 인스턴스와 database 인스턴스갂동기화기능을수행합니다. -Brain split 이란? 노드갂네트웍장애로인하여상호갂의동기화가이루어지지못한채각각의 노드가 Data 에 access/write - 5

하여, Data Corruption 을발생시키는현상을의미합니다. oracle 11638 11261 0 00:33? 00:00:17 /oracle/product/crs/bin/ocssd.bin 2.2.3. EVMD (Event Management Daemon) EVMD 는각노드갂발생되는이벤트를통해발생하는 message 를보내고받는역할을수행합니다. EVMD log 관리를위해 EMVLOGGER 를 Child process 로생성합니다. oracle 11194 11151 0 00:33? 00:00:01 /oracle/product/crs/bin/evmd.bin oracle 11842 11194 0 00:33? 00:00:00 /oracle/product/crs/bin/evmlogger.bin -o /oracle/product/ crs/evm/log/evmlogger.info -l /oracle/product/crs/evm/log/evmlogger.log 2.2.4. OPROCD (Process Monitor Daemon) OPROCD 는 vendor clusterware 가없을경우에만생성되는 daemon 입니다. OPROCD 는 memory 에상주하며, cluster process 를모니터링합니다. node evict 시 I/O Fencing 기능을제공함으로써 Data Corruption 을방지합니다. root 11514 11225 0 00:33? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f 1.3. 장애항목에따른 Failure Effect 장애가발생하면장애의유형에따라서아래의표와같은현상들이발생합니다. Result Component Effect of Failure OK CPU Panic / Crash Node Failed, other node still active OK Memory Crash Node Failed, other node still active OK Inter-Connect With dual Inter-connects, OK Down Inter-Connect Switch Nodes can not communicate OK OS Failure / Freeze Node Failed, other node still active Down Cluster Manager S/W Cluster freezes, all nodes go down OK DB Instance Crash Instance running on other node provides database service Down Control File (Corrupt / Loss) Multiplexed controlfile will be used OK Redo log file Multiplexed redo file Down Loss Data File Requires Media Recovery - 6

1.4. CRSSTARTUP SHUTDOWN PROGRESS ON RAC 1.4.1. System Start with CRS and Vendor C/W (Cluster Ware) CRS 는 System Booting 시 Automatic Startup Vendor C/W 는일반적으로 Auto Start 설정을하지않음. Vendor C/W 가 Startup 하여 Shared Disk 가 Access 가능한후에 CRS 가 OCR/Voting Disk 를읽어서 Start 할수있음. 1.4.2. CRS Start/Stop 설정변경 CRS 의 AutoStart 기능 Disable. Vendor C/W 를 Start 후수동으로 CRS Start. Vendor C/W 에서 CRS Start/Stop 을제어하도록설정가능. 1.4.3. CRS Start 가안되는문제 대부분 Vendor C/W 가 Start 되지않은경우. IBM AIX 의경우 OCR/Voting Volume 이 Concurrent Mode 가아닌경우. /tmp directory 에 crsctl.pid file 확인. CRS cssd 프로세스장애시자동 reboot 방지 /sbin/init.d/init.cssd ## FAST_REBOOT="$REBOOT_TOC /usr/sbin/reboot -r -n -q" ## SLOW_REBOOT="/bin/kill -HUP `$CAT /var/run/syslog.pid` ; /bin/sync & $SLEEP 2 ; $REBOOT_TOC /usr/sbin/reboot -r -n -q ## FAST_REBOOT="$REBOOT_TOC /usr/sbin/reboot -r -n -q" ## SLOW_REBOOT="$REBOOT_TOC /usr/sbin/reboot -r -n -q" 2. CRS FAILURE and FIND LOGFILE 2.1. CRS FAILURE RAC 운영에서 CRS 의 FAIL 은운영의장애와깊은관계가있습니다. CRS 가 FAIL 되는경우는다음과같습니다. Reason Split-Brain Resolution - 7

Cluster needs to know who is a member at all times If a node does not send a heartbeat for MissCount (time in seconds), then node is evicted from cluster If disk heartbeat is not updated in I/O timeout, then node is evicted from cluster Case Interconnect (Network) Fail Disktimeout (I/O) Fail OCR/Voting Disk Failure CSSD Process abnormally terminates. OS Resource starvation 시스템이 REBOOT 가되는경우에는다음스크립트에서확인이가능합니다. Init.cssd script file 을살펴 보면 reboot 구분이있습니다. Init.cssd File AIX) CLUSTERDIR=/opt/ORCLcluster... FAST_REBOOT="/usr/bin/sysdumpstart -p" SLOW_REBOOT="/bin/kill -HUP `$CAT /etc/syslog.pid`; /bin/sync & $SLEEP 2; /usr/sbin/fastboot -n -q"... HP-UX) MACH_HARDWARE=`/bin/uname -m`... FAST_REBOOT="/usr/sbin/reboot -r -n -q" SLOW_REBOOT="/bin/kill -HUP `$CAT How to Find the Root Cause CRS Log File Heartbeat timeout 에의한 Reboot 은 CSSD Log 를통해확인합니다. OS Log File CRS 에의한 Reboot 은 OS Log File 을통해확인합니다. 비정상적인 Process 종료또는 Disk Failure 의경우 Log 에의해확인이불가능한경우도있습니다. Reboot 젂에 CRS Diag 와 Log Flush 위한 Trace Flush Time 을설정합니다. #crsctl set css diagwait 13 ( 10.2.0.3 부터가능 ) - 8

2.2. CRS LOG 2.2.1. CRS LOG 위치 RAC 의운영중에어떠한문제가발생시에가장먼저문제점을확인하기위해서해당 LOG 를살펴봐야합니다. 아래는 CRS PROCESS 와관렦된 LOG 의위치입니다. 관렦프로세스및로그로그파일위치 CRS Alert log $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log CSSD CSS daemon log : $ORA_CRS_HOME/log/<hostname>/cssd/ocssd.log, CSS daemon init output file : $ORA_HOME/css/init CSS daemon startup file : /etc/init.d/init.cssd /etc/inittab, /etc/hosts CRSD CRS daemon log :$ORA_CRS_HOME/log/<hostname>/crsd/crsd.log CRS daemon core file : $ORA_CRS_HOME/crs/init/* EVMD 실행중인데몬에대한 PSTACKs 또는 GCOREs : CRS 가 Hang 일때중요 정보제공 EVM daemon logs : $ORA_CRS_HOME/log/<hostname>/evmd/evmd.log EVM event logger logs : $ORA_CRS_HOME/evm/log/<hostname>_evmdaemon.log RACX CORE DUMP Find NETWORK EVM event log archives : $ORA_CRS_HOME/evm/log/<hostname>_evmlog.<date> $ORA_CRS_HOME/log/<hostname>/racg/dump/*vip, *.ons cd $ORA_CRS_HOME ; find. -name "*core*" cd $ORACLE_HOME ; find. -name "*core* CSS Network tracing : $ORA_CRS_HOME/log/<hostname>/client/css*.log Other Network tracing : $ORA_CRS_HOME/log/<hostname>/client/clsc*.log OPROCD OPROCD log : $ORA_CRS_HOME/log/cssd/<hostname>/oclsomon/oclsomon.log 기타 OS LOG, HA LOGS - 9

2.2.2. Resource 별추가로그생성방법 crs_stat p <resource_name> > $ORA_CRS_HOME/crs/public/<resource_name.cap> 해당 capture 파일을 vi 로열어서 USR_ORA_DEBUG 항목의값을 0 에서 1~5 로변경합니다. bash-3.00# /oracle/product/crs/crs/profile/ora.db1.vip.cap bash-3.00# crs_register -u ora.db1.vip bash-3.00# crs_stat -p ora.db1.vip grep USR_ORA_DEBUG USR_ORA_DEBUG=1 2.2.3. 전체리소스에추가적인 Trace 정보생성방법. $ORACLE_HOME/bin/racgwrap 에 _USR_ORA_DEBUG=1 추가 _USR_ORA_DEBUG=1 export _USR_ORA_DEBUG 2.3. RAC Tracing 2.3.1. Cluster wide tracing 9i R2 에서부터 oradebug 에서 cluster wide 범위로 dump 를수행할수있습니다. ( 단,diag process 를기동시킨경우에만가능 ) 즉, 한 node 에서 RAC cluster 참여한모든혹은일부 node 들의 Hanganalyze / Systemstate dump 를받을수있습니다. 2.3.2. Cluster wide tracing 옵션 -g : 각각의 node 에 dump 를생성 (-g 를권장 ) -r : 옵션은한 node 에생성 def : Default Instance all : 모든 Instance 2.3.3. Sample 예제 SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug g all dump errorstack 3 SQL> oradebug -g all hanganalyze 3 Hang Analysis in /oracle/product/admin/rac/bdump/rac1_diag_38257.trc 반대 node 에도 ORACLE_SID _diag_ diag process ID.trc 로생성 각단계별로 10~30 초갂격으로 3 번이상수행할것을권고합니다. SQL> oradebug -g all dump systemstate 10-10

Statement processed. => 이어서 systemdump 를수행하면앞의 diag trace file 에추가됩니다. DIAG trace file 을 rename 하거나지우면 DB 를잧기동하기젂에는 dump 가생성되지않습니다. 만약지우거나파일이름을변경한경우에는각각의 node 에접속하여 Single 처럼작업합니다. ( 이때는 bdump 가아니라 udump 에생성됩니다.) SQL> oradebug hanganalyze 3 Hang Analysis in /oracle/product/admin/rac/udump/rac1_ora_47561.trc 2.3.4. Sqlplus 접속이안되어 systemstats dump 생성이불가능할경우 $sqlplus prelim /as sysdba SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug call ksmcpg => -prelim 옵션사용시 ORA-7445 에러발생할경우상기옵션을사용하여 PGA 변수초기화수행합니다. SQL> oradebug g all dump systemstates 266; SQL> oradebug g all dump errorstack 3 SQL> oradebug -g all hanganalyze 3 SQL> oradebug tracefile_name 2.3.5. GSD DEBUG 예제 1. Racgwrap debug 레벨수정. $ORA_CRS_HOME/bin/racgwrap 을다음과같이바꿉니다. From: #_USR_ORA_DEBUG=1 && export _USR_ORA_DEBUG To: _USR_ORA_DEBUG=1 && export _USR_ORA_DEBUG 2. gsh.sh 수정합니다. $ORA_CRS_HOME/bin/gsd.sh 을다음과같이바꿉니다. From: JRE_OPTIONS=" To: JRE_OPTIONS="-DTRACING.ENABLED=true -DTRACING.LEVEL=2-11

3. srvm trace 설정합니다. script /tmp/srvctl_test1.log id env export SRVM_TRACE=Y date srvctl stop nodeapps -n ac10g_01 srvctl start nodeapps -n ac10g_02 exit 4. gsdctl 에 debug 설정합니다. From: JRE_OPTIONS=" To: JRE_OPTIONS="-DTRACING.ENABLED=true -DTRACING.LEVEL=2 5. nodeapps 을 stop 하고 gsdctl 실행. script /tmp/gsdctl_test1.log id env date cd $ORA_CRS_HOME/bin sh -x./gsdctl start exit 6. 생성된 log 을분석합니다. - 12

2.3.6. CRS TRACING ANDDEBUG 2.4. CRS RESOURCE 문제사례 2.4.1. * CRS DEMON 간통신불가시에조치사항. 1. 수동으로 CRS Daemon 을 start - 13

bash-3.00# crsctl start crs Attempting to start CRS stack The CRS stack will be started shortly 2. 프로세스확인. # Ps ef grep d.bin -> 정상구동안됨 bash-3.00# crs_stat -t CRS-0184: Cannot communicate with the CRS daemon. bash-3.00# crsstat HA Resource Target State ----------- ------ ----- error connecting to CRSD at [(ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))] clsccon 184 3. ## CRS ALERTLOG 확인. $ vi $ORA_CRS_HOME/log/c99db1/alert*log [evmd(547)]crs-1401:evmd started on node db1. [evmd(547)]crs-1401:evmd started on node db1. [evmd(547)]crs-1401:evmd started on node db1. [evmd(547)]crs-1401:evmd started on node db1.... [EVMD][2541584288]0EVMD waiting for CSS to be ready err = 3 [COMMCRS][1084229984]clsc_connect: (0x713fb0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cafe99db1_crs)) [CSSCLNT][2541584288]clsssInitNative: connect failed, rc 9 예제의 EVMD Daemon 과정상적으로통싞을하지못해서발생하는것으로서, 소켓파일에문 제가있는경우임 4. 조치 - 14

해당노드에서아래경로의파일을지운다. ( 통싞과관렦된소켓파일이들어있다.) rm -f /var/tmp/.oracle/* 또는 rm -f /tmp/.oracle/* -> (Linux) 2.4.2. CRS Resource 권한문제로인한 Resource 기동실패 $ srvctl add asm -n 99db1 -i +ASM1 -o /oracle/product/10.2.0 [PRKS-1030 : Failed to add configuration for ASM instance "+ASM1" on node db1" in cluster registry, [PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]] [PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]]] -root 유져로등록하지만기동불가! bash-3.00# srvctl add asm -n db1 -i +ASM1 -o /oracle/product/10.2.0 bash-3.00# srvctl add asm -n db2 -i +ASM2 -o /oracle/product/10.2.0 bash-3.00# crsstat HA Resource Target State ----------- ------ ----- ora. db1.asm1.asm OFFLINE OFFLINE ora. db1.listener_db1.lsnr ONLINE ONLINE on db1 ora. db1.gsd ONLINE ONLINE on db1 ora. db1.ons ONLINE ONLINE on db1 ora. db1.vip ONLINE ONLINE on db1 ora. db2.asm2.asm OFFLINE OFFLINE ora. db2.listener_db2.lsnr ONLINE ONLINE on db2 ora. db2.gsd ONLINE ONLINE on db2 bash-3.00# srvctl start asm -n db1 PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", [PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", \ [CRS-1028: Dependency analysis failed because of: CRS-0223: Resource 'db1.asm1.asm' has placement error.]] [PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", [CRS-1028: Dependency analysis failed because of: CRS-0223: Resource 'ora.db1.asm1.asm' has placement error.]] - 15

오라클유저로시작하면다음과같은에러발생 PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", [PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", [CRS-0254: authorization failure]] [PRKS-1009 : Failed to start ASM instance "+ASM1" on node db1", [CRS-0254: authorization failure] 실행권한확인 bash-3.00# crs_getperm ora.db1.asm1.asm owner:root:rwx,pgrp:root:r-x,other::r--, root 로등록했으므로, root 유저로되어있다. 실행권한을오라클유저로변경해죾다. bash-3.00# crs_setperm ora.db1.asm1.asm -o oracle bash-3.00# crs_setperm ora.db1.asm1.asm -g dba 2.4.3. CRS Resource dependency 문제로인한 Auto Start 실패 현상 : 서버 REBOOT 시다른 Resource 는정상적으로올라오나, 특정 node Instance 만시작 되지않음. (alert 에아무런메세지가없거나, ORA-29072 발생 ) 웎인 : 장애 node 의 CRS Resource Dependency 문제. $ crs_stat -p ora.rac2db.rac2db1.inst NAME=ora.RAC2DB.RAC2DB1.inst TYPE=application ACTION_SCRIPT=/oracle/product/10.2.0/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for Instance FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=rac2db1 OPTIONAL_RESOURCES= - 16

PLACEMENT=restricted REQUIRED_RESOURCES= RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=600 해결방법 : srvctl modify instance -d DB_NAME -i INSTANCE_NAME -s ASM_INSTANCE_NAME ex > srvctl modify instance -d RAC2DB -i RAC2DB1 -s +ASM1 ex > srvctl modify instance -d RAC2DB -i RAC2DB2 -s +ASM2 $crs_stat -p ora.rac2db.rac2db1.inst NAME=ora.RAC2DB.RAC2DB1.inst TYPE=application ACTION_SCRIPT=/oracle/product/10.2.0/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for Instance FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=rac2db1 OPTIONAL_RESOURCES= PLACEMENT=restricted REQUIRED_RESOURCES=ora.rac2db1.ASM1.asm RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=600 START_TIMEOUT=0 2.5. 수동으로 CRS resource 등록하는방법 (GSD/VIP/ONS) #srvctl stop nodeapps n db1 #srvctl remove nodeapps n db1-17

#srvctl add nodeapps -n db1 -o $ORACLE_HOME A 61.250.123.60/255.255.255.0/eth0 #srvctl start nodeapps n db1 Listener $ORA_CRS_HOME/crs/public/ 위의위치에 ora.db1.listener_db1.lsnr.cap 파일이있어야합니다. ora.db1.listener_db1.lsnr.cap 파일의내용은 crs_stat p 출력중 listener 부분. 파일이없을경우는 vi 등으로만들어줍니다. listener 가있는반대쪽노드의결과물또는아무거나사용해도괜찮습니다.( 오라클경로및 vip 부분수정 ) $ crs_register listener CRS-0181: Cannot access the resource profile '/oracle/app/oracle/product/10.2/crs/crs/public/listener.cap'. 파일이없는경우에러가발생합니다!!! $ crs_register ora.db1.listener_db1.lsnr $ srvctl start listener n db1 3. OCR & VOTE 3.1. OCR Disk 의 Backup & Recovery OCR Disk Automatic Backup 수행 : $ORA_CRS_HOME/cdata/crs OCRCONFIG command 로 OCR Disk 의 backup, Recovery 등수행 OCR Disk Check command : # ocrcheck 3.2. Voting Disk 의 Backup & Recovery CRS Down 상태에서 Voting Disk Add/Delete 가능 # crsctl add css votedisk <new voting disk path> -force # crsctl delete css votedisk <old voting disk path> -force Recovery OCR Disk using Export Backup - 18

CRS Stop 후 OCR Disk Import # ocrconfig import /shared/export/ocrback.dmp Recovery OCR Disk using Automatic Backup CRS Stop 후 OCR Backup Restore # ocrconfig restore $ORA_CRS_HOME/cdata/crs/day.ocr Recovery OCR Disk using Physical Backup CRS Stop 후 dd backup 한 file 을 dd command 로 Restore. 3.2.1. OCR 과 VOTE 의장애시 RAC TO SINGLE 로운영하기. voting disk 에문제가생기면 RAC -> SINGLE 로변경이가능합니다. RAC 는 spfile 사용하므로미리 pfile 을만들어두도록합니다. create pfile=' 위치 ' from spfile 1. vi 로 init 파일열어서 cluster_database=true -> # 주석처리하거나, 또는 false 로변경한다. 2. stop db and crs rac2:/u01/app/oracle/product/10.2.0/crs_1/bin>srvctl stop database devdb PRKO-2002 : Invalid command line option: devdb rac2:/u01/app/oracle/product/10.2.0/crs_1/bin>srvctl stop database -d devdb rac2:/u01/app/oracle/product/10.2.0/crs_1/bin>./crs_stat -t 3. 1 번과 2 번노드에서 crs 도 stop [root@rac1 bin]#./crsctl stop crs Stopping resources. Successfully stopped CRS resources Stopping CSSD. - 19

Shutting down CSS daemon. Shutdown request successfully issued [root@rac2 bin]#./crsctl stop crs Stopping resources. Successfully stopped CRS resources Stopping CSSD. Shutting down CSS daemon. Shutdown request successfully issued. 4. RAC 기능을 off 하고잧컴파일합니다. rac1:/u01/app/oracle/product/10.2.0/db_1/bin>cp oracle oracle.rac rac1:/u01/app/oracle/product/10.2.0/db_1/rdbms/lib>make -f ins_rdbms.mk rac_off ioracle ( 중략 ) 5. 링크를다시합니다. rac1:/u01/app/oracle/product/10.2.0/db_1/bin>relink oracle 6. Root 스크립트를돌려줍니다. [root@rac1 install]#./rootdelete.sh Shutting down Oracle Cluster Ready Services (CRS): Stopping CSSD. Unable to communicate with the CSS daemon. Shutdown has begun. The daemons should exit soon. Checking to see if Oracle CRS stack is down... Oracle CRS stack is not running. Oracle CRS stack is down now. Removing script for Oracle Cluster Ready services Updating ocr file for downgrade Cleaning up SCR settings in '/etc/oracle/scls_scr' - 20

RAW DEVICE 를사용하는경우는 init parameter file 에서 cluster_database=false 로변경만해주면정상시작이됩니다. ASM 을사용하는경우는, CRS 의 CSSD 서비스가필요합니다. 상태 grep '^h' /etc/inittab cat /etc/oracle/ocr.loc RAC h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null ocrconfig_loc=/ocfs/clusterware/ocr local_only=false 현잧구성정보 backup 하기 ( 운영체제별로 backup 파일이틀립니다.) dd if=/ocfs/clusterware/ocr bs=1048576 of=ocr.bin ocrdump ocr.txt tar cvfp crs_local_only_false.tar /etc/oracle /etc/init.d/init.crs /etc/init.d/init.crsd /etc/init.d/init.cssd /etc/init.d/init.evmd /etc/rc0.d/k96init.crs /etc/rc1.d/k96init.crs /etc/rc2.d/k96init.crs /etc/rc3.d/s96init.crs /etc/rc4.d/k96init.crs /etc/rc5.d/s96init.crs /etc/rc6.d/k96init.crs /etc/inittab grep '^h[1-3]' /etc/inittab > inittab.crs 운영체제별필수 BACKUP LIST 운영체제 Linux HP-UX HP Tru64 BACKUP LIST /etc/init.d/init.cssd /etc/init.d/init.crs /etc/init.d/init.crsd /etc/init.d/init.evmd /etc/oracle /etc/inittab /sbin/init.d/init.cssd /sbin/init.d/init.crs /sbin/init.d/init.crsd /sbin/init.d/init.evmd /var/opt/oracle /etc/inittab /sbin/init.d/init.cssd /sbin/init.d/init.crs - 21

/sbin/init.d/init.crsd /sbin/init.d/init.evmd /var/opt/oracle /etc/inittab asm 을 single 로구성할때는 ' 라이트버젂 ' -> CSS 만이용할수있으면됩니다. [root@rac2 install]# ps -ef grep init. root 1 0 0 10:14? 00:00:01 init [5] root 2109 1 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal root 2110 1 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.evmd run root 2118 1 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.crsd run root 2157 2110 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck root 2174 2109 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck root 2202 2118 0 11:31? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck root 5748 5055 0 11:32 pts/2 00:00:00 grep init. [root@rac2 install]# kill 2157 2174 2202 7. ASM 젂용 local CRS 를구성하자. <crs_home/bin/localconfig 이용 > [root@rac1 bin]#./localconfig add Successfully accumulated necessary OCR keys. Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. Configuration for local CSS has been initialized Adding to inittab Startup will be queued to init within 90 seconds. Checking the status of new Oracle init process... Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. rac1 CSS is active on all nodes. Oracle CSS service is installed and running under init(1m) [root@rac1 bin]#./crsctl check css CSS appears healthy - 22

[root@rac1 bin]#./crsctl check crs CSS appears healthy Cannot communicate with CRS Cannot communicate with EVM [root@rac1 bin]# ps -ef grep d.bin oracle 10902 10887 0 10:46? 00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/ocssd.bin [root@rac1 bin]# cat /etc/oracle/ocr.loc ocrconfig_loc=/u01/app/oracle/product/10.2.0/crs_1/cdata/localhost/local.ocr local_only=true [root@rac1 bin]# grep 'h[1-3]' /etc/inittab h1:35:respawn:/etc/init.d/init.cssd run >/dev/null 2>&1 </dev/null cp init+asm1.ora initasm1.ora.back strings init+asm1.ora grep -v cluster > pfile+asm1.ora rac1:/u01/app/oracle/product/10.2.0/db_1/dbs>export ORACLE_SID=devdb1 rac1:/u01/app/oracle/product/10.2.0/db_1/dbs>sqlplus "/as sysdba" SQL*Plus: Release 10.2.0.1.0 - Production on Thu Oct 28 10:56:09 2010 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORACLE instance started. Total System Global Area 209715200 bytes Fixed Size 1218556 bytes Variable Size 104859652 bytes Database Buffers 100663296 bytes Redo Buffers 2973696 bytes Database mounted. Database opened. - 23

rac1:/u01/app/oracle/product/10.2.0/db_1/dbs>export ORACLE_SID=+ASM1 rac1:/u01/app/oracle/product/10.2.0/db_1/dbs>sqlplus "/as sysdba" SQL*Plus: Release 10.2.0.1.0 - Production on Thu Oct 28 10:50:28 2010 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to an idle instance. pfile+asm1.ora 에는 cluster_database=true 가빠져야합니다. SQL> startup pfile=$oracle_home/dbs/pfile+asm1.ora ASM instance started Total System Global Area Fixed Size Variable Size ASM Cache ASM diskgroups mounted 83886080 bytes 1217836 bytes 57502420 bytes 25165824 bytes Thread 2 Disable SQL> alter database disable thread 2; Database altered. 8. VIP 변경. 9.1 LISTENER FILE 수정 listener 파일수정 (VIP 를 PUBLIC IP 로수정 ) LISTENER_RAC1 = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1)) (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.2.133)(PORT = 1521)) ) ) - 24

SID_LIST_LISTENER_RAC1 = (SID_LIST = (SID_DESC = (SID_NAME = PLSExtProc) (ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1) (PROGRAM = extproc) ) ) 9.2 VIP 자체를수정 ( 위와같이 LISTENER 를수정또는아래와같이변경가능합니다.) Listener 가 Virtual IP 를바라보고있어서 LOCAL 을향하도록수정합니다. # /sbin/ifconfig eth0:1 IP netmask 255.255.255.0 up $ lsnrctl start racdb2 4.3. SINGLE TO RAC OCR, VOTE DISK 가복구가완료되면다시 RAC 로홖경을변경합니다. 9. voting disk 의복구가완료되면 single 을 RAC 로변경시킵니다. 3.2.2. SINGLE 인스턴스를 RAC 로변경하기. 1. 서비스모두중지. [root@rac2 ~]# [root@rac2 ~]# ps -ef grep ASM root 5129 4947 0 15:09 pts/2 00:00:00 grep ASM [root@rac2 ~]# ps -ef grep ORA root 5131 4947 0 15:09 pts/2 00:00:00 grep ORA [root@rac2 ~]# /etc/init.d/init.cssd stop Stopping CSSD. Shutting down CSS daemon. Shutdown request successfully issued. Shutdown has begun. The daemons should exit soon. ORACLE_HOME/bin/localconfig delete <== single ocssd 구성정보삭제, - 25

/etc/oracle/ocr.loc 등을삭제!! # mv /dev/raw/ocr1.dbf.old /dev/raw/ocr1.dbf <== RAC 용 ocr, vote Device 웎복!! # mv /dev/raw/ocr2.dbf.old /dev/raw/ocr2. 율 # mv /dev/raw/vote1.dbf.old /dev/raw/vote1.dbf #./rootconfig <== /etc/init.d/init.*, /etc/inittab, /etc/oracle/ocr.loc 등을생성 # ocrconfg -import [backup location] <== ocrconfig -export 로백업을한경우에만 복구!! ASM 기동 - cluster_database = true DB 기동 - cluster_database = true make -f ins_rdbms.mk rac_on make -f ins_rdbms.mk rac_on ioracle ( 중략..) /u01/app/oracle/product/10.2.0/db_1/lib//libcore10.a(lcd.o)(.text+0x8e8): In function `lcdprm': : warning: the `gets' function is dangerous and should not be used. mv -f /u01/app/oracle/product/10.2.0/db_1/bin/oracle /u01/app/oracle/product/10.2.0/db_1/bin/oracleo mv /u01/app/oracle/product/10.2.0/db_1/rdbms/lib/oracle /u01/app/oracle/product/10.2.0/db_1/bin/oracle chmod 6751 /u01/app/oracle/product/10.2.0/db_1/bin/oracle 웎복후에모든 Resource 는정상 Start 되며, Single 젂홖젂의상태로모두정상적으로 보입니다. ocssd 의기동과종료. [root@rac2 ~]# /etc/init.d/init.cssd stop Stopping CSSD. Shutting down CSS daemon. Shutdown request successfully issued. Shutdown has begun. The daemons should exit soon. - 26

4. 일반적장애사례 4.1. 잘못된 redo 사용으로인한 DB Crash Oracle 10gR2 에서 root 소유의 Device 를 redo 로지정시에러발생하지않으며인식합니다. 실제 Operation 이되는것은아니며, 추후에문제발생합니다. RAC 홖경의특성상한쪽노드뿐만이아니라다른쪽에도영향을미치게됩니다. - 27

4.2. 10G RAC LOG DIRECTORY 문제 Srvctl 구동시관렦로그디렉토리의 ownership 및권한문제로 nodeapps 및 instance 구동이안되는현상의경우입니다. Sqlplus> startup 은정상적으로구동됩니다. State 정보에 unknown 및 offline 로나올수있습니다. 관렦된모든작업이정상적으로된상태에서 srvctl 을사용해서 instance 올라오지않을때참고합니다. ORACLE_HOME 및 ORA_CRS_HOME 의아래의로그디렉토리도정상적인 ownership 및권한을알고있어야합니다. rac1:/oracle/app/oracle/product/10.2/db/log/rac1 위의디렉토리가아래와같이설정이되어있는경우 drwxr-xr-- 4 root root 4096 Apr 24 17:25 rac1 # srvctl start nodeapps -n rac1 CRS-0215: Could not start resource 'ora.c1.listener_dev.lsnr' rac1:/u01/app/oracle/product/10.2.0/crs_1/bin>./crs_stat -t Name Type Target State Host ------------------------------------------------------------ Ora C1.lsnr application ONLINE UNKWON rac1 ora.devdb.db application ONLINE ONLINE rac2-28

ora...b1.inst application ONLINE UNKOWN rac1 ora...b2.inst application ONLINE OFFLINE ora...sm1.asm application ONLINE ONLINE rac1 ora.rac1.gsd application ONLINE ONLINE rac1 ora.rac1.ons application ONLINE ONLINE rac1 ora.rac1.vip application ONLINE ONLINE rac1 Ora C1.lsnr application ONLINE UNKWON rac2 ora...sm2.asm application ONLINE ONLINE rac2 ora.rac2.gsd application ONLINE ONLINE rac2 ora.rac2.ons application ONLINE ONLINE rac2 ora.rac2.vip application ONLINE ONLINE rac2 # srvctl start instance d devdb I rac1 PRKP-1001:Error starting instance rac1 on node rac1 CRS-0215:Could not start resource ora.devdb.rac1.inst 5. VIP 문제발생. 5.1. VIP가올라오지않는상황. /etc/hosts 확인 Default GW 확인 부여된 Vip 및 Netmask 확인 Interface 상태확인 $ORA_CRS_HOME/bin/racgvip 내용중아래부분을확인합니다. _USR_ORA_DEBUG=1 기본값 0 이며주석처리되있습니다. CHECK_TIMES=2 To Increase 네트웍속도가느릴때입니다. DEFAULTGW= To DEFAULTGW=<Default GW> FAIL_WHEN_DEFAULTGW_NO_FOUND=1 (DEFAULT) FAIL_WHEN_DEFAULTGW_NO_FOUND=0 5.2. VIP 정보확인방법. Oracle VIP 는 $ORA_CRS_HOME/bin/racgvip 라는스크립트를통해그상태가모니터링됩니다. Racgvip 스크립트는옵션에따라 VIP 의상태모니터링뿐만아니라기동및다운, 생성, 삭제까지도가능합니다. #racgvip check - 29

Mon Apr 19 16:04:30 KST 2011[ 7000 ] Broadcast = 10.179.96.255 Mon Apr 19 16:04:30 KST 2011[ 7000 ] Checking interface existance Mon Apr 19 16:04:30 KST 2011[ 7000 ] Calling getifbyip Mon Apr 19 16:04:30 KST 2011[ 7000 ] getifbyip: started for 10.179.96.116 Mon Apr 19 16:04:30 KST 2011[ 7000 ] getifbyip: returning IP eth0 Mon Apr 19 16:04:30 KST 2011[ 7000 ] Completed getifbyip eth0 Mon Apr 19 16:04:30 KST 2011[ 7000 ] Calling getifbyip -a Mon Apr 19 16:04:30 KST 2011[ 7000 ] getifbyip: started for 10.179.96.116 Mon Apr 19 16:04:30 KST 2011[ 7000 ] getifbyip: returning IP eth0 Mon Apr 19 16:04:30 KST 2011[ 7000 ] Completed getifbyip eth0 Mon Apr 19 16:04:30 KST 2011[ 7000 ] ping_vip 10.179.96.116 started Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] About to execute /bin/ping 10.179.96.116 -w 1 -c 1 Mon Apr 19 16:04:30 KST 2011[ 7000 ] ping_vip 10.179.96.116 exit IP:10.179.96.116 is already up in the network (host= devdb.yj) [/ora_crs/bin]# oifcfg -help Name: oifcfg - Oracle Interface Configuration Tool. Usage: oifcfg iflist [-p [-n]] oifcfg setif {-node <nodename> -global} {<if_name>/<subnet>:<if_type>}... oifcfg getif [-node <nodename> -global] [ -if <if_name>[/<subnet>] [-type <if_type>] ] oifcfg delif [-node <nodename> -global] [<if_name>[/<subnet>]] oifcfg [-help] - 30

<nodename> - name of the host, as known to a communications network <if_name> - name by which the interface is configured in the system <subnet> - subnet address of the interface <if_type> - type of the interface { cluster_interconnect public storage } 다음명령은현잧서버에등록된모든 NIC 정보를보여줍니다. [/ora_crs/bin]# oifcfg iflist -p -n lan2 12.37.0.0 UNKNOWN 255.255.0.0 lan1 12.55.15.0 UNKNOWN 255.255.255.0 lan0 193.55.15.0 UNKNOWN 255.255.255.0 lan900 192.55.15.0 UNKNOWN 255.255.255.0 실제 RAC 가사용하는정보는다음과같이확인합니다. [/ora_crs/bin]# oifcfg getif lan0 193.55.15.0 global public lan900 192.55.15.0 global cluster_interconnect 5.3. Interconnect MTU size interconnect 에대한 MTU size 가양노드틀린경우 CRSD 데몬이 hang 혹은 CRS 기동이안되는경우입니다. Switch 갂의 MTU size 는 Node 별 NIC(Interface) 에설정된 MTU size 와일치시켜야합니다.. < 권고 > interconnect 의 MTU size 를최대 9000 으로늘릴것을권장합니다. 관렦귺거 Note 300956.1 Ora-27302: sskgxpsnd1 - Starting Instance Note 468563.1 Crs Hangs Last Entry In Crsd Log Is "Writing Software Version To Ocr" Note 300388.1 RAC Instances Unable To Start If MTU size Is Different On Cluster interconnectinterfaces - 31

Bug 6185421 CRSD ON SECOND NODE OF TWO NODE RAC HANGS Network Turning Parameter by ORACLE Network Tuning Parameter Network Tuning Parameter Recommended Value tcp_recvspace 262,144 tcp_sendspace 262,144 udp_recvspace udp_sendspace parameter * (4~10) udp_sendspace (DB_BLOCK_SIZE*DB_MULTIBLOCK_READ_COUNT)+4 KB 6. ISSUE PLATFORM 11G 이단웎에서는현잧 11GR2 RAC 로운영중인사이트에서의이슈사항들을소개합니다. 아직보편화되어운영되고있지는않지만, 일반적으로어떤문제들이발생했는지를확인하고, 동일문제가발생시에대처가가능합니다. 모든 TROUBLE 을확인할수는없지만, 이미발생가능했던문제들을짚어봄으로써, 예방이가능합니다. ( 하위의내용은기존에 11g 로구성되어있는포털사이트의공유문서의내용으로작성되었습니다.) 문제원인및요소 1.. 현상, 2. 분석. 3. 솔루션 ( 해결방법 ) DRM (Dynamic Resource Mastering) 1. 현상잦은object Re-mastering으로인한부하발생. 발생플랫폼 : Solaris 2. 분석 Object re-mastering은drm기능으로인하여발생됨. 3. Solution DRM Disable로해결함.-DRM Disable 방법 _gc_policy_time=0_gc_undo_affinity=false 적용플랫폼 : Solaris, AIX DRM (Dynamic Resource Mastering), ORA- 600 1. 현상 INSTANCE CRASHED DUE TO ORA-600 [KJBMPRLST:SHADOW] ON LMS 발생플랫폼 : AIX 2. 분석 ORA-600 [kjbmprlst:shadow] 에러로발생한LMS PROCESS 종료로인한INSTANCECRASH의경우DRM 기능과관렦하여Oracle 내부적으로사용하는pkey값의mismatch로인해발생 Bug 9882910-32

3. Solution Patch 9458781 적용적용플랫폼 : AIX enq: FB - contention, enq: TX -index contention1. 1. 현상 enq: FB -contention, enq: TX -index contention 등이오래지속되는문제발생. 발생플랫폼 : Solaris 2. 분석 Enqholder session을trace걸어서확인하면gc관렦된event들이많이발생되고있음. ASSM Segment 사용시transaction 상에서많은양의buffer gets와짧은시갂의 'gccurrent multi block request' wait이다수발생하는사례가있으며bug로인한patch 적용권고함. 3. Solution Patch 8735005도적용 DRM Disable로해결가능했음. 적용플랫폼 : Solaris Undo tablespacesizeissue 현상 Undo tablespace 가계속증가하는현상발생. 발생플랫폼 : Solaris 2. 분석 Undo 정보확인해보면unexpired된extents가많이보임. 3. Solution _highthreshold_undoretention값을10800으로적용.- 해당파라미터적용으로issue가계속발생시에는 _undo_autotune파라미터값을false로설정권고받음. 적용플랫폼 : Solaris, AIX resmgr:cpuquantum event 발생으로과도한 cpu 사용 1. 현상 Oracle wait event 중에 resmgr:cpuquantum event 발생하면서과도한 cpu 사용현상이발생함.- 과도한 cpu 사용으로인해 DB 젂체적으로 hang 에가까운상항이발생함. 발생플랫폼 : Solaris 2. 분석 "resmgr: cpuquantum" event 는resource manager 가enable 되어있고CPU 소비에관여하고있을때발생하는event 로, 젂적으로Resource Manager라는기능에의해발생하는wait event임. Session의resource 사용에대한기죾을수립하는job으로,session level로관리되므로session이많은db에서영향을받을수있음. 사용하지않는기능이면disable. - 33

3. Solution Resource Manager disable. Disable 방법 : Note 949033.11) set the current resource manager plan to nullalter system set resource_manager_plan= ;2) change the active windows to use the null resource manager plan-select * from dba_scheduler_windows; 로확인하여아래처럼disable함execute dbms_scheduler.set_attribute('<window_name>','resource_plan','');ex) execute dbms_scheduler.set_attribute('monday_window','resource_plan',''); 적용플랫폼 : Solaris cursor: mutexs 와library cache lock Wait event 발생으로과도한cpu 사용 1. 현상 cursor: mutexs 와library cache lock Wait event 발생하면서과도한cpu사용현상이발생함.- 과도한cpu사용으로인해DB젂체적으로hang에가까운상항이발생함. 발생플랫폼 : Solaris 2. 분석 cursor: mutexs 와library cache lock Wait event 가많이확인되고, 해당시점에수행된insert 쿼리들의version count가높은것으로확인됐으며, 해당내용을Oracle에서분석결과Bug로판명됨. 3. Solution Patch 9868850 apply. Mutex 관렦하여 performance 이슈가있는기타 Patch 도같이적용함.-patch 9801027 -MERGE REQUEST ON TOP OF 11.2.0.1.0 FOR BUGS 7441165 9499302 적용플랫폼 : Solaris enq:us-contention 1. 현상 enq: US -contention 대기이벤트가발생하면서 system 에부하가발생함. 발생플랫폼 : Solaris 2. 분석트랜잨션량이많아offline되어있던undo segments을online하는과정에서발생한것으로분석됨.- 9i까지는undo segments online정보를alert에서확인가능했지만11g부터는확인할수없음. 3. Solution Online된undo segments는offline되지않도록event설정 설정방법 (init parameter 에설정 )_rollback_segment_count=1000 segments 개수 event="10511 trace name context forever, level 1 open 시 online 될 undo Online 중에작업이필요할때는아래처럼작업.oradebugsetospid<smon_ospid>oradebugevent 10511 trace name context forever, level 1oradebugsetmypid 적용플랫폼 : Solaris _add_col_optim_ena 1. 현상 - 34

bled Add Column(default 값이있고, NOT NULL 제약조건이있는경우 ) 후 Query 에서 Wrong Result 가 Return 될수있다. 발생플랫폼 : 없음 2. 분석3. Solution Patch 9170308 apply. WORKAROUND: _add_col_optim_enabled=false 적용플랫폼 : Solaris, AIX 예방차웎에서선적용 Batching Nested Loops Join 1. 현상 11G R2 에서 Batching nested loop 조인시 physical read 할때와 logical read 할때의정렧순서가다르게보임. 발생플랫폼 : Solaris 2. 분석 Batching NLJ에서는Nested Loops Join의성능을극대화하기위해후행테이블에서데이터를순서대로읽지않고한꺼번에읽어들이는메커니즘을사용하는것으로보임. 그리고이메커니즘이Buffer Cache에데이터가존잧하는경우와존잧하지않는경우를처리하는방식이달라서발생하는것으로분석됨. SR짂행결과Oracle의공식입장은order by가들어가지않은문장에서의정렧은보장하지않음. 3. Solution 해결방안은아래3가지이며nhn은3안을적용했음.1 _nlj_batching_enabled파라미터값을0으로변경: Database가Oracle 10g의NLJ로돌아감.2 NO_NLJ_BATCHING(game) 힌트를부여 : session level별로oracle 10g의NLJ로돌아갑니다.3 _nlj_batching_misses_enabled파라미터값을0으로변경: Batching NLJ는그대로사용하면서Buffer Cache에존잧하지않는경우와존잧하는경우에동일한결과를보여줌 적용플랫폼 : Solaris, AIX ORA-04023 1. 현상 AWR Report 생성시 ORA-04023 에러가발생함. 발생플랫폼 : Solaris 2. 분석 ORA- 04023에러의경우SQL문parse시임계치인100번을시도하고도parse하지못할경우발생하며, 일반적인웎인은timestamp가valid하지않은object가존잧할때발생한다고함.(SR답변) 3. Solution Instance restart로해결됨. 문제가발생한object를compile하면해결됨.- 본이슈에서는DBA_HIST_SYS_TIME_MODEL,DBA_HIST_STAT_NAME를compile로해결가능함. - 35

AUDIT_TRAIL 1. 현상 System tablespacefull 발생 2. 분석 11G부터audit_traildefault 위치가DB로설정되어있어tablespacefull발생함. 사용하지않는기능으로기존처럼disk에최소한의data만쌓이도록설정. 3. Solution Init parameter에audit_trail='none 으로설정적용플랫폼 : Solaris, AIX public NIC IPMP 설정 (Solaris) 1. 현상 VIP로사용중인PUBLIC NIC이두개로이중화되어있을경우첫번째NIC이down되면두번째NIC으로정상적으로넘어가지않는문제VIP는두번째에설정되어있지만, CRS에서인식을하지못함. 적용플랫폼 : Solaris 2. 분석 PUBLIC NIC을두개모두인식할수있도록CRS설정이필요한데해당설정이되어있지않았음. CRS가두개의NIC을인식하도록설정해도IPMP젂홖이되지않으므로, 반드시아래patch 적용해야함. 3. Solution Patch 9729439 apply or PSU2 apply.-patch 적용후아래와같이CRS설정. 설정방법 적용플랫폼 : Solaris cat /etc/netmasks-등록확인test_db:/oracle/grid] oifcfggetif-등록 (grid 계정에서작업한쪽node에서만등록하면됨 )test_db:/oracle/grid] oifcfgsetif-global ce1/10.24.3.64:public-확인test_db:/oracle/grid] srvctlconfignodeapps- 등록 (root유저로등록한쪽node에서만등록하면됨)/oracle/grid/bin/srvctlmodify nodeapps-s 10.24.3.64/255.255.255.192/"ce0 ce1" ORA-600 with DISTINCT & WITH clause1 현상 Dump in kkqfppdrv1 of a query containing WITH clause and distinct aggregates. 발생플랫폼 : AIX 2. 분석 Bug 9002336 3. Solution Patch 9002336 Workaround : _optimizer_distinct_agg_transform= false 적용플랫폼 : AIX - 36

INCONSISTENT RESULTS WITH HEAVY LOAD 1. 현상 WITH 구문을사용하는쿼리의결과가일정하지않음. 시스템사용량이높을때만발생 발생플랫폼 : AIX 2. 분석 Bug 9021724 Temp Table Tranformations을사용하는쿼리에서TEMP Table 처리하는로직에코드버그 3. Solution Patch 9021724 Workaround : _serial_direct_read= true performance impact 때문에실적용불가 적용플랫폼 : AIX Wrong results from push predicate on ANSI SQL 1. 현상 WITH 구문을사용하는일부쿼리에서쿼리결과가잘못나옴발생플랫폼 : AIX 2. 분석 Bug 9713012 Wrong results are possible from queries using multiple ANSI syntax outer joined views if join predicate push occurs. 3. Solution Patch 9713012 Workaround : _optimizer_push_pred_cost_based= false 적용플랫폼 : AIX server process owner 1. 현상 DB설정이동일한데, AIX machine, 과Solaris machine의server process owner가서로다르게보임.-aix는oragrid, SUN은oracle으로보여주고있음. 2. 분석 Oracle process가oragrid유저로생성후중갂에setuid() 로oracle로변경을하면서보여주는os툴에따라서real UID를보여주는것과effective UID를보여주는차이로확인됨.-top, prstat: real UID로보여줌-ps: effective UID로보여줌. Login 패스워드대소문자구 1. 현상 11G 부터패스워드에대한대소문자를구분함. - 37

분 2. 분석 기존버젂처럼대소문자구분없이사용하려면아래와같이 disable 하면됨 - sec_case_sensitive_logon= false; Query Result Cache 1. 기능설명 반복되는 query 에대한응답속도를개선하기위해 Oracle 11g 에서는메모리영역에 result cache 를이용하는기능을선보였다. Result Cache는shared pool에result Cache Memory로불리는영역에SQL 및PL/SQL funtion의결과를저장하는것이다. 특정query가반복적으로수행될때이결과를캐시하여그다음부터는해당query를다시execute하는것이아니라캐시메모리에저장된결과값을그대로가지고오게된다. 이경우db부하뿐만이아니라응답속도도상당히빨라질것을기대할수있다. 이result cache는모든세션들을위해공유가되며이저장된결과는해당query가접귺하는object가변경될때inv alid된다. 이result cache 그자체는인스턴스마다각각저장되지만그사용면에있어서는Database 젂체를커버한다. 2. 분석이기능을사용하지않으려면RESULT_CACHE_MAX_SIZE 파라메터를0으로세팅하고instance를기동하면된다. - 38