Sunday, November 29, 2009

UNDO Brain Damage

I "met" Mrs. Pedersen on the oracle-l mailing list. I can't remember what the topic was, but she supplied this detailed message full of stats and opinion. I immediately went out and looked for her blog, but couldn't find it. She didn't have one.

So I dropped her a line and offered her some space here.

I won't take (all) the credit, but she has since started her own blog, DBA Kevlar.


Automatic Undo can be one of those areas that make you go "hmmmm..." I am fully aware that the documentation that came with Oracle 9i new features for DBA's when Automatic Undo was introduced clearly claimed, "You won't receive another ORA-01555 again!" yet Oracle hasn't quite lived up to that promise in either 9i or 10g and I haven't had the chance to prove otherwise in 11g.

I had the opportunity to revisit my old friend, "ORA-01555" when one of our environments began to page a fellow DBA with the issue the other day. I became intrigued, as I often am by any database oddity and when I queried the amount of undo the process was consuming, it didn't add up with the amount of undo that was being consumed. In our environment, it is not uncommon for 800GB or more of undo to be utilized, but it became quickly apparent something was amiss when looking at the query below vs. the amount of undo allocation of 1.2TB that had been recently allocated by the DBA on call:
select distinct
vs.sid, vs.osuser,
vs.process,
vs.sql_id,
round((vu.undoblks*32768) / (1024*1024*1024)) GB,
vu.tuned_undoretention
from v$undostat vu, v$session vs
where vu.maxqueryid in vs.sql_id
group by
vs.sid,
vs.osuser,
vs.process,
vs.sql_id,
vu.undoblks,
vu.tuned_undoretention
order by vs.sid
TOTAL GB used from the query above: 45GB

The first aspect I chose to take into consideration is the "tuned undo retention", as many folks forget that the Oracle parameter UNDO_RETENTION can be over-ridden by the SMON if a process requires blocks from the undo to stay unexpired over the set retention period. I have found in my experience, that if there is a rollback issue with an existing process, inspecting the TUNED_UNDO_RETENTION is an important area to start as a DBA.

After verifying that there were over 7000 different maxquerid's in v$undostat, but that only 20% of them were returning when we were looking at active sessions, I queried to find out what queries have undo blocks set with expirations greater than the set time, (currently 43200 seconds) that were also not in active sessions.
select distinct 
vu.maxqueryid,
vu.maxquerylen,
vu.tuned_undoretention
from
v$undostat vu,
v$session vs
where vu.tuned_undoretention > 43200
and vu.maxqueryid not in vs.sql_id
group by
vu.maxqueryid,
vu.maxquerylen,
vu.tuned_undoretention;
One SQL_ID came up as the offender with the majority of undo blocks retained over the retention period and not in active sessions: 1zuumhmqc93zj

The maxqueryid we are looking for isn't active, but if we searched historical SQL, then we found our statement:
select sql_text
from DBA_HIST_SQLTEXT
where sql_id = '1zuumhmqc93aj';

select col1
from prod.Tbl1
where cnt = 1
Why would a select statement create undo blocks and how could it be using undo if it isn't a current running process and yet showing active undo blocks???

If the query is not active - what could it be?

Answer: PREVIOUS
select vs.sid
from v$undostat vu, v$session vs
where vu.max_queryid=’1zuumhmqc93aj’
and vu.maxqueryid = vs.prev_sql_id;

SID
---
817
We now have the SID currently running that is the cause, why worry about the SQL anymore? It's not the current SQL, but a previous sql_id in an active undo process in v$undostat, so at this point, we need to focus on the SID.

So what transpired here?

Upon inspecting SID 817, it was found to be a background session for Oracle two weeks back where someone had issued a kill session command on a poorly coded process! Oracle decided that the select, which was a full scan on a 59GB table, used a cursor into a for loop, was going to take just over two weeks to rollback the killed session! The process to rollback was so sublime, that it hadn't come up as a red flag on any monitoring script or tool, including OEM!

Since no actual data block changes had been performed, it was perfectly safe to stop the rollback and execute a session kill. To free the undo vs. allowing the kill command to continue, we selected the OS Session from V$SESSION for SID 817, proceeded to kill the process at the OS level and issued a reset on the undo tablespace to clear the remaining undo blocks that were active and/or unexpired:
begin
reset_undo_stat();
end;
By resetting the undo, we were able to free up the undo that had been tuned to be retained and were able to disregard the rollback on a select statement that had been killed.

Once we reset the undo, we promptly informed the developer that he/she was never to attempt to use a 59GB non-partitioned table without indexes for a cursor ever, ever again!! :)

Moral of this story
If you have significant undo retained by non-active, non-transactional processes, you should inspect them carefully to verify that they are valid utilization of your undo tablespace. If not, you could receive ORA-01555, snapshot too old errors on transactions that would normally complete successfully, causing incomplete transactions, more rollback occurrence and users asking you why their processes didn't finish.

BIO
Kellyn Pedersen is a Performance Tuning and Operations Database Administration Specialist with 10 years of DBA experience in Oracle, SQL Server and MySQL. She is a Database Administrator and Developer at I-Behavior in Louisville, CO, is busy writing articles, sharing tips on her own blog, DBA Kevlar, and will be presenting at RMOUG 2010 in Denver, CO.

She lives in Westminster, CO with her husband and children - who are wonderfully willing to put up with her strange fascination with Oracle's Cost Based Optimizer.


[ kellyn dot pedersen at gmail | Twitter | LinkedIn ]

No comments: