distributed transactions troubleshooting – complete reference for the new oracle dba





If you are encountering distributed transaction errors on your production database then use this complete reference to troubleshoot the errors.

How to Troubleshoot Distributed Transactions [ID 100664.1]  

 
  Modified 23-FEB-2010     Type TROUBLESHOOTING     Status PUBLISHED  
       

In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     Steps to Troubleshoot Distributed Transactions
     Syntax Notes
     SCN Recovery Steps
  References


Applies to:

Oracle Server – Enterprise Edition – Version: 8.1.5.0 to 10.2.0.1
Information in this document applies to any platform.
Checked for relevance on 13-June-2008 by Faye Todd.
Appearance updated, no changes to technical content 19-Feb-2007

Purpose

You need to resolve a distributed transaction error because of errors received.
But before you try to implement any other steps, please diagnose whether it is
or it is not a REAL distributed transaction.

Last Review Date

June 13, 2008

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Troubleshoot Distributed Transactions

1. Check the alert.log for errors, like:

ORA-1591 "lock held by in-doubt distributed transaction %s"
ORA-2062 "distributed recovery received dbid x, expected y"
ORA-2068 "following severe error from %s%s"

2. Make sure that the network is up and all db_links are valid and operational.
V$DBLINK AND GV$DBLINK show the db_links currently used by distributed
transactions:

 
SVRMGR> desc v$dblink 

Column Name Null? Type 
------------------------------ -------- ---- 
DB_LINK VARCHAR2(128) 
OWNER_ID NUMBER 
LOGGED_ON VARCHAR2(3) 
HETEROGENEOUS VARCHAR2(3) 
PROTOCOL VARCHAR2(6) 
OPEN_CURSORS NUMBER 
IN_TRANSACTION VARCHAR2(3) 
UPDATE_SENT VARCHAR2(3) 
COMMIT_POINT_STRENGTH NUMBER 

SVRMGR> desc gv$dblink 

Column Name Null? Type 
------------------------------ -------- ---- 
INST_ID NUMBER 
DB_LINK VARCHAR2(128) 
OWNER_ID NUMBER 
LOGGED_ON VARCHAR2(3) 
HETEROGENEOUS VARCHAR2(3) 
PROTOCOL VARCHAR2(6) 
OPEN_CURSORS NUMBER 
IN_TRANSACTION VARCHAR2(3) 
UPDATE_SENT VARCHAR2(3) 
COMMIT_POINT_STRENGTH NUMBER

3. Run the following query against DBA_2PC_PENDING:

SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID, STATE, MIXED, HOST, COMMIT#
FROM DBA_2PC_PENDING
WHERE LOCAL_TRAN_ID = '??.'; [Insert code here

LOCAL_TRAN_ID is the Local transactionID (number that is given when error
is reported). If LOCAL_TRAN_ID = GLOBAL_TRAN_ID, it means that this site
is the Global Coordinator, i.e. where the transaction originated from.

You can also get LOCAL_TRAN_ID from the alert.log.

4. Run the following query against DBA_2PC_NEIGHBORS view on all Nodes:

SELECT LOCAL_TRAN_ID, IN_OUT, DATABASE, INTERFACE
FROM DBA_2PC_NEIGHBORS;

5. Get the values for COMMIT_POINT_STRENGTH init.ora parameter on all Nodes.
If you are not aware of what other Nodes are participating in the distributed transaction, query the DBA_2PC_NEIGHBORS view. The Node with COMMIT_POINT_STRENGTH that has highest value in init.ora has to commit first.
6. Check the STATE column from DBA_2PC_PENDING.
If STATE column states COMMIT, it means the local database has successfully committed.

This also means you most likely do not have to force anything on that Node.
Or if other Nodes have not committed and this particular Node happens to be
a Global Coordinator – you need to note COMMIT# i.e. SCN number from
DBA_2PC_PENDING view, as this is THE SCN# you will need to use if you need
to perform COMMIT FORCE.
7. Take the GLOBAL_TRAN_ID and COMMIT# (SCN number from DBA_2PC_PENDING) and
compare it to the value on the other nodes.
On Nodes where there are no entries for this COMMIT# and transactionID in DBA_2PC_PENDING, RECO process has resolved the problem and we don’t have to do anything.

Note: RECO is always automatically started whenever Distributed Transactions are enabled.

If you issue a UNIX command : ps – ef | grep -i reco and see RECO running, distributed transactions are correctly enabled.

If you combine this output with values of the init.ora parameter DISTRIBUTED_TRANSACTONS (which should be higher than 0 for distributed environment), you may safely say that the customer is running in a distributed environment even if this particular customer has not set it up.

8. If the STATE (from DBA_2PC_PENDING view) is PREPARED, then this Node has not completed the transaction.

Take GLOBAL_TRAN_ID (from DBA_2PC_PENDING) and COMMIT# (SCN number) and compare it to the value on the other nodes.

You need to also check DBA_2PC_NEIGHBORS and see whether there are any other
Nodes that reference the same transactionID in their DBA_2PC_PENDING. If you
see it referenced, it means there are children and they need to be resolved
(either committed or rollback) as well. Please take ‘children’ i.e. Nodes
that other Nodes (not Global Coordinator) into account.

Note: for clarification on terminology please see Note 13229.1
9. If there are no children, it is safe to manually interfere and either commit force or rollback force.

For global integrity, use the same COMMIT# (SCN) when you force the transaction. To force the transaction please see SYNTAX NOTES below.
10. After the transaction is forced, make sure that both DBA_2PC_PENDING and DBA_2PC_NEIGHBORS data dictionary views are empty.

If for some reason it is necessary to clean those views, use packaged procedure DBMS_TRANSACTION.purge_lost_db_entry.

Note 1012842.102 ORA-2019 ORA-2058 ORA-2068 ORA-2050:
FAILED DISTRIBUTED TRANSACTIONS

_$#$_

Note: Be very careful with this procedure as it is always more reliable and more consistent to let RECO resolve distributed transaction conflicts. Please attempt to use DBMS_TRANSACTION.purge_lost_db_entry only as a
last resort after every other measure was implemented including restarting the database to resolve the locking conflict.
11. If even then there are entries in the views and errors still appear, please check for Bug 684157. This is fixed in 8.0.6 and higher.

Syntax Notes

COMMIT FORCE command 

Example of syntax where highest committed SCN is 88123887 (from dba_2pc_pending) and local transaction ID is 1.13.5197 (from either dba_2pc_pending or alert.log):

 
SVRMGR> COMMIT FORCE 'your local transactionID on this node', 'highest SCN from already committed site'; 

SVRMGR> COMMIT FORCE '1.13.5197', '88123887';

ROLLBACK FORCE command

Example of syntax where local transactionID is 1.13.5197 (from either dba_2pc_pending or alert.log):

SVRMGR> ROLLBACK FORCE 'your local transactionID on this node';
SVRMGR> ROLLBACK FORCE '1.13.5197';
 

PURGING VIEWS:

  • Example of syntax for lost local transactionID 1.13.5197 (from either dba_2pc_pending or alert.log). 
    NOTE: Run as SYS where 1.13.5197 is the local_tran_id
SVRMGR> Execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY ('1.13.5197'); 
COMMIT;
 
  • Example of syntax for mixed local transaction ID 1.13.5197 (from either dba_2pc_pending or alert.log).  Note, Run as SYS: 
SVRMGR> Execute DBMS_TRANSACTION.PURGE_MIXED ('1.13.5197'); 
COMMIT;
 

 Run PURGE_MIXED Procedure only if significant reconfiguration has occurred so that automatic recover (RECO) cannot resolve the transaction.

 
Example of appropriate cases when it can be used:

– Total loss of the remote database, or reconfiguration of software resulting in loss of two-phase commit capability.

NOTE: A ‘commit;’ should be issued after each DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY
and DBMS_TRANSACTION.PURGE_MIXED otherwise errors are given.

SCN Recovery Steps

Complete Recovery:

1. At the down site recover completely if possible (treat as regular recovery).

2. SCN will appear in the alert.log after recover is done on the crashed Node.

Incomplete Recovery:

1. If time-based or cancel-based recovery was used on the crashed node, other sites must be placed back to the same point in time for global consistency. Get last SCN from the alert.log of the crashed node.

2. At each node perform a shutdown normal or immediate.

3. Take a cold backup.

4. Restore the control file if necessary.

5. Restore the last backup of all datafiles along with archived redo logs.

6. Choose which tool to use to perform SCN recovery – either Server Manager line mode or RMAN.

7. Connect internal, startup mount, check status from v$datafile to make sure all datafiles are online. Issue ‘Alter database datafile ‘?/?/?’ online;’, for each datafile with status offline, to bring it online.

8. Issue the following command using the latest SCN from alert.log on the Node that had to be recovered:

RECOVER DATABASE UNTIL CHANGE '1.13.5197';

NOTE: If for some reason (e.g.when issuing commit force command) automatic recovery (RECO process) needs to be disabled, use the following command to put RECO to sleep:

 
ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY;

Use this command to wake RECO up after that:

 
ALTER SYSTEM ENABLE DISTRIBUTED RECOVERY;

References

BUG:684157 – ORA-24756 RUNNING PLSQL PROGRAM FROM 8.0.4.1–>7.3.4.
NOTE:1012842.102 – ORA-2019 ORA-2058 ORA-2068 ORA-2050: Failed Distributed Transactions
NOTE:126069.1 – Manually Resolving In-Doubt Transactions: Different Scenarios
NOTE:13229.1 – Distributed Database, Transactions and Two Phase Commit

Related


Products


  • Oracle Database Products > Oracle Database > Oracle Database > Oracle Server – Enterprise Edition

Keywords


DISTRIBUTED DATABASE; PROPAGATION; STREAMS

Errors


ORA-2058; ORA-2019; ORA-2062; ORA-1591; ORA-2068; ORA-2015; ORA-2050

 

Back to top

Rate this document 

Article Rating

Rate this document
Excellent
Good
Poor
 
Did this document help you?
Yes
No
Just browsing
 
How easy was it to find this document?
Very easy
Somewhat easy
Not easy

 

  Comments

 
Cancel    

 

Author: admin