Saturday, May 26, 2012

A Buggy Release Leads to Lazy Debugging and ORA-15041's


The longer you support something, the more likely you'll run into it - a buggy release on which all weird issues are blamed.  I'm currently supporting a number of installs on 10.2.0.2 and yes, this release has a lot of bugs.  And yes, it's gotten to the point where anything that isn't easily explained is just blamed on 10.2.0.2.

One of the problems with this situation is it's leading DBAs to be a bit lazy when investigating an error/problem.  Just because you can't immediately explain the situation doesn't mean the release is to blame.  Take a recent situation.  We had a spurious "ORA-15041: diskgroup space exhausted" error during weekend processing.  A quick check showed DBFLASH was at 10% free (which was the diskgroup that received the error), leaving nearly 700 GB available for archivelogs.  All main processing was done and the error didn't occur again, so the conclusion was 10.2.0.2 threw out a bogus error and the database is fine.

But, ORA-15041 aren't typically errors that are bogus.  You either have enough space or you don't and if you're getting this error yet you have free space, then it may be time to panic.  In this case further investigation into ASM alert logs revealed the underlying issue - inconsistent LUN sizes within the diskgroup.  One of the ASM alert logs had generated messages similar to "WARNING: allocation failure on disk ASMxxx for file 3660 xnum 24" at the same time of the ORA-15041 errors.  It turns out the LUN ASMxxx was ½ the expected size, compared to all other LUNs in the diskgroup.  This issue snuck through our regular safeguards because a mistake was made creating it's partition.  Only ½ the cylinders were assigned and system administration output from presenting the LUNs lists the LUN, not the actual partition, so the size on this report was what we expected.

To me this is a good example of the importance of understanding what your database is telling you.

No comments:

Post a Comment