Saturday, May 26, 2012

A Buggy Release Leads to Lazy Debugging and ORA-15041's


The longer you support something, the more likely you'll run into it - a buggy release on which all weird issues are blamed.  I'm currently supporting a number of installs on 10.2.0.2 and yes, this release has a lot of bugs.  And yes, it's gotten to the point where anything that isn't easily explained is just blamed on 10.2.0.2.

One of the problems with this situation is it's leading DBAs to be a bit lazy when investigating an error/problem.  Just because you can't immediately explain the situation doesn't mean the release is to blame.  Take a recent situation.  We had a spurious "ORA-15041: diskgroup space exhausted" error during weekend processing.  A quick check showed DBFLASH was at 10% free (which was the diskgroup that received the error), leaving nearly 700 GB available for archivelogs.  All main processing was done and the error didn't occur again, so the conclusion was 10.2.0.2 threw out a bogus error and the database is fine.

But, ORA-15041 aren't typically errors that are bogus.  You either have enough space or you don't and if you're getting this error yet you have free space, then it may be time to panic.  In this case further investigation into ASM alert logs revealed the underlying issue - inconsistent LUN sizes within the diskgroup.  One of the ASM alert logs had generated messages similar to "WARNING: allocation failure on disk ASMxxx for file 3660 xnum 24" at the same time of the ORA-15041 errors.  It turns out the LUN ASMxxx was ½ the expected size, compared to all other LUNs in the diskgroup.  This issue snuck through our regular safeguards because a mistake was made creating it's partition.  Only ½ the cylinders were assigned and system administration output from presenting the LUNs lists the LUN, not the actual partition, so the size on this report was what we expected.

To me this is a good example of the importance of understanding what your database is telling you.

Friday, May 25, 2012

tar + find + "." = too Many Files

I know this blog post has a somewhat strange title, especially for a DBA, but it's all related to an issue I ran into recently that was more difficult to resolve than it should have been, mostly due to time constraints.

I was working on an issue under Oracle 10gR2.  I had to load all trace and log files under /bdump and /udump to MOS for help in analyzing the problem.  The system was a 4-node RAC and even with regular file cleanup jobs running nightly those directories had hundreds of files each.  But, I knew a specific time range and wanted all files and directories, so I just used "tar" with the file list generated from a "find" command using the "-mmin" argument.

After looking more closely after the first few tarballs were created I found that ALL files were getting tar'ed each time, as if the "find" command's argument was being ignored.  I ran the "find" separately, which worked as expected, but when used as input for "tar" it gave me all files.

It turns out my problem was the "." directory.  I don't think twice about seeing the current directory (".") or the parent directory ("..") in listings, but they obviously can affect output of commands.  As a simple example of what I ran into, let's say we have 5 files of 1KB, 2KB, ... 5KB in size and need to create a tarball of any over 2KB.

First, create files for the simple test:

for KB in 1 2 3 4 5
do
   dd if=/dev/zero of=${KB}kb_file.txt bs=1024 count=$KB
done

% ls -ltr
total 24
-rw-r--r--  1 oracle oinstall 5120 May 25 16:18 5kb_file.txt
-rw-r--r--  1 oracle oinstall 4096 May 25 16:18 4kb_file.txt
-rw-r--r--  1 oracle oinstall 3072 May 25 16:18 3kb_file.txt
-rw-r--r--  1 oracle oinstall 2048 May 25 16:18 2kb_file.txt
-rw-r--r--  1 oracle oinstall 1024 May 25 16:18 1kb_file.txt

Next, show that the "find" command gets what I want:

% find . -size +2049c -ls
 98960    4 drwxr-xr-x   2 oracle   oinstall     4096 May 25 16:18 .
 98971    4 -rw-r--r--   1 oracle   oinstall     3072 May 25 16:18 ./3kb_file.txt
 98972    4 -rw-r--r--   1 oracle   oinstall     4096 May 25 16:18 ./4kb_file.txt
 98973    8 -rw-r--r--   1 oracle   oinstall     5120 May 25 16:18 ./5kb_file.txt

And last, see how this works with "tar":

% tar -cvf 3kb_or_bigger.tar `find . -size +2049c -print`
./
./1kb_file.txt
./2kb_file.txt
./3kb_file.txt
./4kb_file.txt
./5kb_file.txt
tar: ./3kb_or_bigger.tar: file is the archive; not dumped
./3kb_file.txt
./4kb_file.txt
./5kb_file.txt

As can be seen, all files are in the tarball, along with a second set of just those that I really wanted.  What's happening is "." is passed to "tar", which tells "tar" to pull all files from that directory.  Filtering on filename and/or file type in the "find" command would resolve this, but at the time I was taking every shortcut possible.  You can bet that I'll respect the "." directory more from now on!