The longer you support something, the more likely you'll run
into it - a buggy release on which all weird issues are blamed. I'm currently supporting a number of installs
on 10.2.0.2 and yes, this release has a lot of bugs. And yes, it's gotten to the point where
anything that isn't easily explained is just blamed on 10.2.0.2.
One of the problems with this situation is it's leading DBAs
to be a bit lazy when investigating an error/problem. Just because you can't immediately explain
the situation doesn't mean the release is to blame. Take a recent situation. We had a spurious "ORA-15041: diskgroup space
exhausted" error during weekend processing. A quick check showed DBFLASH was at 10% free
(which was the diskgroup that received the error), leaving nearly 700 GB
available for archivelogs. All main
processing was done and the error didn't occur again, so the conclusion was
10.2.0.2 threw out a bogus error and the database is fine.
But, ORA-15041 aren't typically errors that are bogus. You either have enough space or you don't and
if you're getting this error yet you have free space, then it may be time to
panic. In this case further
investigation into ASM alert logs revealed the underlying issue - inconsistent
LUN sizes within the diskgroup. One of
the ASM alert logs had generated messages similar to "WARNING: allocation failure
on disk ASMxxx for file 3660 xnum 24" at the same time of the
ORA-15041 errors. It turns out the LUN
ASMxxx was ½ the expected size, compared to all other LUNs in the
diskgroup. This issue snuck through our
regular safeguards because a mistake was made creating it's partition. Only ½ the cylinders were assigned and system
administration output from presenting the LUNs lists the LUN, not the actual
partition, so the size on this report was what we expected.
To me this is a good example of the importance of
understanding what your database is telling you.
No comments:
Post a Comment