RAC Comprehensive Diagnostic (racdiag_9i.sql)

What This Script Does

This is a comprehensive RAC (Real Application Clusters) diagnostic script originally developed by Oracle Support Services. It provides an all-in-one solution for diagnosing RAC hangs, performance issues, and cluster communication problems by collecting critical diagnostic information from all cluster nodes simultaneously.

Script

-- NAME:  RACDIAG.SQL
--    SYS OR INTERNAL USER, CATPARR.SQL ALREADY RUN, PARALLEL QUERY OPTION ON
-- ------------------------------------------------------------------------
-- AUTHOR:
--    Michael Polaski - Oracle Support Services - DataServer Group
--    Copyright 2002, Oracle Corporation
-- ------------------------------------------------------------------------
-- PURPOSE:
-- This script is intended to provide a user friendly guide to troubleshoot RAC
-- hung sessions or slow performance scenerios.  The script includes information
-- to gather a variety of important debug information to determine the cause of an
-- RAC hang.  The script will create a file called racdiag_<timestamp>.out
-- in your local directory while dumping hang analyze dumps in the user_dump_dest(s)
-- and background_dump_dest(s) on all nodes.
--
-- ------------------------------------------------------------------------
-- DISCLAIMER:
--    This script is provided for educational purposes only. It is NOT
--    supported by Oracle World Wide Technical Support.
--    The script has been tested and appears to work as intended.
--    You should always run new scripts on a test instance initially.
-- ------------------------------------------------------------------------
-- Script output is as follows:

set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,'Mondd_hhmi') timecol,
'.out' spool_extension from sys.dual;
column output new_value dbname
select value || '_' output
from v$parameter where name = 'db_name';
spool c:\racdiag_&&dbname&&timestamp&&suffix
set lines 200
set pagesize 35
set trim on
set trims on
alter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';
alter session set timed_statistics = true;
set feedback on
select to_char(sysdate) time from dual;

set numwidth 5
column host_name format a20 tru
select inst_id, instance_name, host_name, version, status, startup_time
from gv$instance
order by inst_id;

set echo on

-- Taking Hang Analyze dumps
-- This may take a little while...
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
-- This part may take the longest, you can monitor bdump or udump to see if the
-- file is being generated.
oradebug -g all dump systemstate 10

-- WAITING SESSIONS:
-- The entries that are shown at the top are the sessions that have
-- waited the longest amount of time that are waiting for non-idle wait
-- events (event column).  You can research and find out what the wait
-- event indicates (along with its parameters) by checking the Oracle
-- Server Reference Manual or look for any known issues or documentation
-- by searching Metalink for the event name in the search bar.  Example
-- (include single quotes): [ 'buffer busy due to global cache' ].
-- Metalink and/or the Server Reference Manual should return some useful
-- information on each type of wait event.  The inst_id column shows the
-- instance where the session resides and the SID is the unique identifier
-- for the session (gv$session).  The p1, p2, and p3 columns will show
-- event specific information that may be important to debug the problem.
-- To find out what the p1, p2, and p3 indicates see the next section.
-- Items with wait_time of anything other than 0 indicate we do not know
-- how long these sessions have been waiting.
--
set numwidth 10
column state format a7 tru
column event format a25 tru
column last_sql format a40 tru
select sw.inst_id, sw.sid, sw.state, sw.event, sw.seconds_in_wait seconds,
sw.p1, sw.p2, sw.p3, sa.sql_text last_sql
from gv$session_wait sw, gv$session s, gv$sqlarea sa
where sw.event not in
('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and seconds_in_wait > 0
and (sw.inst_id = s.inst_id and sw.sid = s.sid)
and (s.inst_id = sa.inst_id and s.sql_address = sa.address)
order by seconds desc;

-- EVENT PARAMETER LOOKUP:
-- This section will give a description of the parameter names of the
-- events seen in the last section.  p1test is the parameter value for
-- p1 in the WAITING SESSIONS section while p2text is the parameter
-- value for p3 and p3 text is the parameter value for p3.  The
-- parameter values in the first section can be helpful for debugging
-- the wait event.
--
column event format a30 tru
column p1text format a25 tru
column p2text format a25 tru
column p3text format a25 tru
select distinct event, p1text, p2text, p3text
from gv$session_wait sw
where sw.event not in ('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and seconds_in_wait > 0
order by event;

-- PCM LOCK BLOCKERS:
-- This section will show us any sessions that are holding locks that
-- are blocking other users.  The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session.  The grant_level will show us how the PCM lock is granted to
-- the user.  The request_level will show us what status we are trying to obtain.
-- The lockstate column will show us what status the lock is in.  The last column
-- shows how long this session has been waiting.
--
set numwidth 5
column lockstate format a16 tru;
column event format a30 tru;
select dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.lockstate,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Canceling','KJUSERCV','Converting') as lockstate,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$dlm_locks dl, gv$process p, gv$session s, gv$session_wait sw
where blocker = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc;

-- [Additional sections continue for 549 total lines...]
-- Full script includes comprehensive analysis of:
-- - PCM Lock waiters and blockers
-- - Latch contention analysis
-- - Global cache performance metrics
-- - Instance communication diagnostics
-- - Session and lock analysis
-- - System state dumps and hang analysis

Key Features

1. Automated Diagnostic Collection

Creates timestamped output files
Generates hang analyze dumps across all nodes
Produces system state dumps for detailed analysis
Collects information from all RAC instances

2. Comprehensive Analysis Sections

Instance Status - All cluster nodes health
Waiting Sessions - Sessions with non-idle waits
Event Parameters - Wait event details and parameters
PCM Lock Analysis - Lock blockers and waiters
Latch Contention - Latch wait analysis
Global Cache - Inter-node communication metrics
Resource Analysis - Memory and process utilization

3. RAC-Specific Diagnostics

Parallel Cache Management (PCM) locks
Global Cache Service (GCS) analysis
Global Enqueue Service (GES) diagnostics
Inter-instance communication analysis

Usage

-- Connect as SYSDBA to any RAC instance
sqlplus / as sysdba

-- Run the diagnostic script
@racdiag_9i.sql

-- Output files created:
-- racdiag_<dbname>_<timestamp>.out (main report)
-- hanganalyze dumps in user_dump_dest
-- systemstate dumps in background_dump_dest

Required Privileges

SYSDBA or INTERNAL connection required
Access to ORADEBUG commands
SELECT privileges on all GV$ views
CATPARR.SQL must be run (parallel query option)

Sample Output Sections

1. Instance Overview

INST_ID INSTANCE_NAME        HOST_NAME            VERSION          STATUS  STARTUP_TIME
------- -------------------- -------------------- ---------------- ------- -----------------
      1 ORCL1               server1.domain.com   19.3.0.0.0       OPEN    01-JAN-2024 08:00:00
      2 ORCL2               server2.domain.com   19.3.0.0.0       OPEN    01-JAN-2024 08:01:15

2. Waiting Sessions Analysis

INST_ID   SID STATE   EVENT                   SECONDS P1         P2         P3         LAST_SQL
------- ----- ------- ----------------------- ------- ---------- ---------- ---------- --------
      1   145 WAITING gc current block 2-way     1234 file#      block#     class#     SELECT...
      2   267 WAITING buffer busy waits          567  file#      block#     reason#    UPDATE...
      1   89  WAITING gc cr block 2-way           234 file#      block#     class#     INSERT...

3. PCM Lock Blockers

INST_ID   SID  SPID RESOURCE_NAME1   GRANT_LEVEL REQUEST_LEVEL LOCKSTATE  EVENT                      SEC
------- ----- ----- ---------------- ----------- ------------- ---------- -------------------------- ---
      1   145  5234 [0x15a][0x2c4]  Exclusive   Null          Granted   SQL*Net message from client  0
      2   267  6789 [0x15a][0x2c4]  Null        Exclusive     Converting gc current block 2-way    45

Understanding the Output

1. Waiting Sessions Priority

Sessions are ordered by wait time (longest first):

High seconds_in_wait = Serious performance issue
Event types indicate problem category
P1, P2, P3 provide event-specific details

2. PCM Lock Analysis

Blockers = Sessions holding locks that others need
Waiters = Sessions waiting for blocked resources
Grant/Request levels show lock compatibility

3. Common RAC Wait Events

gc current block 2-way - Global cache block requests
gc cr block 2-way - Consistent read block transfers
buffer busy waits - Block contention
gc current block busy - Hot block issues

Diagnostic Workflow

1. Initial Assessment

-- Check instance status first
SELECT inst_id, status, host_name
FROM gv$instance;

-- Verify cluster interconnect
SELECT inst_id, name, ip_address
FROM gv$cluster_interconnects;

2. Run RACDIAG Script

Execute during performance problem
Allow several minutes for completion
Monitor progress in dump directories

3. Analyze Results

Review waiting sessions section first
Identify blocking chains
Examine wait event patterns
Check global cache statistics

Advanced Analysis

1. Global Cache Hit Ratios

-- Analyze global cache efficiency
SELECT inst_id,
       gc_current_blocks_received,
       gc_current_blocks_served,
       gc_cr_blocks_received,
       gc_cr_blocks_served
FROM gv$sysstat
WHERE name LIKE 'gc%blocks%';

2. Interconnect Performance

-- Check interconnect latency
SELECT inst_id, name, value
FROM gv$sysstat
WHERE name LIKE '%gc%time%'
ORDER BY inst_id, name;

3. Block Transfer Analysis

-- Examine block transfer patterns
SELECT inst_id,
       ROUND(physical_reads/1000) phys_reads_k,
       ROUND(gc_current_blocks_received/1000) gc_curr_k,
       ROUND(gc_cr_blocks_received/1000) gc_cr_k
FROM gv$sysstat s1, gv$sysstat s2, gv$sysstat s3
WHERE s1.name = 'physical reads'
AND s2.name = 'gc current blocks received'
AND s3.name = 'gc cr blocks received'
AND s1.inst_id = s2.inst_id
AND s2.inst_id = s3.inst_id;

Troubleshooting Common Issues

1. High Global Cache Waits

-- Identify hot objects
SELECT owner, object_name, object_type,
       COUNT(*) gc_requests
FROM gv$session_wait sw, dba_objects o
WHERE sw.event LIKE 'gc%'
AND sw.p1 = o.data_object_id
GROUP BY owner, object_name, object_type
ORDER BY COUNT(*) DESC;

2. Interconnect Problems

-- Check for interconnect errors
SELECT inst_id, name, value
FROM gv$sysstat
WHERE name LIKE '%interconnect%'
AND value > 0;

3. Memory Pressure

-- Analyze buffer cache pressure
SELECT inst_id,
       ROUND(physical_reads/1000) phys_reads_k,
       ROUND(db_block_gets/1000) block_gets_k,
       ROUND(consistent_gets/1000) cons_gets_k
FROM gv$sysstat
WHERE name IN ('physical reads', 'db block gets', 'consistent gets');

Best Practices

1. Regular Monitoring

Run during peak hours to establish baselines
Schedule regular diagnostic collections
Monitor trends over time
Document normal vs abnormal patterns

2. Emergency Response

Run immediately when hangs occur
Collect multiple snapshots 5-10 minutes apart
Preserve dump files for Oracle Support
Document symptoms and timeline

3. Performance Tuning

Identify frequently waited-for objects
Optimize SQL causing global cache waits
Consider partitioning hot objects
Review application design patterns

Integration with Other Tools

1. AWR Reports

-- Generate RAC-specific AWR
BEGIN
  DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();
END;
/

2. ADDM Analysis

-- Run ADDM for RAC instance
BEGIN
  DBMS_ADVISOR.CREATE_TASK(
    advisor_name => 'ADDM',
    task_name    => 'RAC_ADDM_TASK'
  );
END;
/

3. Cluster Ready Services

# Check cluster status
crsctl status resource -t

# Check interconnect
oifcfg getif

AWR System Wait Event Analysis (dhsysev.md) - Historical wait analysis
Blocking Locks (gvlockb.md) - Current lock analysis
Active Queue Analysis (gvaq.md) - Queue analysis
RAC Instance Status (../database-info/gvinst.md) - Instance overview

Modern RAC Alternatives

For Oracle 12c and later, consider these alternatives:

1. Cluster Health Monitor (CHM)

-- Enable automatic diagnostic collection
ALTER SYSTEM SET DIAGNOSTIC_DEST = '/u01/app/oracle/diag';

2. Real-Time ADDM

Automatic problem detection
Continuous monitoring
Root cause analysis

3. Database Performance Hub

Real-time performance monitoring
Historical trend analysis
Integrated diagnostics

Notes

Legacy script - Designed for Oracle 9i/10g RAC
Still valuable - Core diagnostics remain relevant
Adaptation needed - Some views may have changed
Supplement with - Modern monitoring tools
File locations - Adjust paths for your environment (currently hardcoded to C:)

This script provides foundational RAC diagnostic capabilities that remain valuable for understanding cluster behavior and diagnosing performance issues in Oracle RAC environments.