Thursday, October 4, 2012

Finally! A use for Enterprise Manager Target Properties with Jobs!


Okay, perhaps I'm overreacting, as I haven't been thinking hard about using Target Properties ("%%") within EM non-stop.  But periodically when I'd see them during the Job creation process, I'd wonder what good are they and will I ever find a use for them.

To explain how I've found a use for them, let me first share a few details about how we configure our environments within EM and how that relates to Job definitions.  We assign all environments to 1 of 2 Groups - Production or Non-Production.  From there all target types are lumped within the appropriate Group.  When we create EM Jobs we try our best to always have the Target refer to a Group and let the Target Type define what resources within the Group will be accessed.

This works great for most situations, except that there's duplication in some cases.  For example, OS Jobs running against a Target Type of "host" will pick up both hosts AND clusters, which means on a given cluster your Job will run once on each node plus once on the cluster's primary node.

In comes Target Properties.  Target Properties are listed on the right side of the screen under the "Parameters" tab when creating a Job in EM, with the list varying based on the Target Type of the Job.

The following is a capture of these Target Properties for a Target Type of "host":


Back to the problem at hand, to have clusters skipped we simply check the "%TargetType%" in the Command:

if [ "%TargetType%" = "host" ]; then . ~/.bash_profile; ; else exit 0; fi

Notice the "else exit 0".  This is needed to avoid clusters returning a non-zero status because they weren't picked by the if-test.

Friday, August 17, 2012

Speeding up "fuser"Activity


"fuser" is a great utility for Unix systems for determining who has a file opened or just if a file is opened by anyone.  Related to Oracle maintenance, its handy when performing any kind of cleanup of tracefiles or audit files.  Oracle dumps plenty of files under [a|b|u]dump yet as part of regular cleanup it's not wise to blindly remove these files.  If a file is still open removing it doesn't release the space.

The problem is "fuser" can also be very slow.  The point of the utility is to find all users who have a given file open, but that literally means check every process on the system to see if they have the associated file descriptor reserved, in DBA terms "do a full scan" of all processes.  On a system with thousands of processes and thousands of files to check, the impact can be rather large.

The good news is Oracle is nice enough to frequently provide the PID who owns a file within the filename itself.  Since these audit and tracefiles are exclusive to a PID this means you could grab the PID off the filename and check one spot to see if that process still has the file open, instead of checking all processes on the system.

As an example, the following shows an "fuser" method and an lookup-by-PID method for checking all audit files to see if any are open:

"fuser" Method
ls -1 $ORACLE_BASE/admin//adump/ora*.aud | while read AUD_FILE
do
   [[ -n "`fuser $AUD_FILE | cut -d':' -f1`" ]] && echo "File $AUD_FILE is open"
done

Lookup-by-PID Method
ls -1 $ORACLE_BASE/admin//adump/ora*.aud | while read AUD_FILE
do
   AUD_FILE_BASE=`basename $AUD_FILE .aud`
   AUD_FILE_PID=`echo $AUD_FILE_BASE | cut -d'_' -f2`
   if [ `ls -l /proc/$AUD_FILE_PID/fd 2>/dev/null | grep -c "$AUD_FILE"` -eq 1 ]; then
      echo "File $AUD_FILE is open"
   fi
done

The "Lookup" method takes a few extra lines of code but the performance is dramatic.  I tested this on a system with 2500+ active processes with a few thousand audit files, using "time" to compare the 2 methods. 

For "fuser":
real    8m3.663s
user    0m11.197s
sys     0m50.258s

For "Lookup -by-PID"
real    0m0.951s
user    0m0.272s
sys     0m0.895s

Not every system has as many files to check so I ran both tests against just 1 file, still with 2500+ active processes.  The difference was still 1000x - "fuser" = 0m3.144s, "Lookup-by-PID" = 0m0.003s.

For me the bottom line is that you can have a huge, positive impact on performance if you're willing to add a smidge of intelligence to your code.

Saturday, May 26, 2012

A Buggy Release Leads to Lazy Debugging and ORA-15041's


The longer you support something, the more likely you'll run into it - a buggy release on which all weird issues are blamed.  I'm currently supporting a number of installs on 10.2.0.2 and yes, this release has a lot of bugs.  And yes, it's gotten to the point where anything that isn't easily explained is just blamed on 10.2.0.2.

One of the problems with this situation is it's leading DBAs to be a bit lazy when investigating an error/problem.  Just because you can't immediately explain the situation doesn't mean the release is to blame.  Take a recent situation.  We had a spurious "ORA-15041: diskgroup space exhausted" error during weekend processing.  A quick check showed DBFLASH was at 10% free (which was the diskgroup that received the error), leaving nearly 700 GB available for archivelogs.  All main processing was done and the error didn't occur again, so the conclusion was 10.2.0.2 threw out a bogus error and the database is fine.

But, ORA-15041 aren't typically errors that are bogus.  You either have enough space or you don't and if you're getting this error yet you have free space, then it may be time to panic.  In this case further investigation into ASM alert logs revealed the underlying issue - inconsistent LUN sizes within the diskgroup.  One of the ASM alert logs had generated messages similar to "WARNING: allocation failure on disk ASMxxx for file 3660 xnum 24" at the same time of the ORA-15041 errors.  It turns out the LUN ASMxxx was ½ the expected size, compared to all other LUNs in the diskgroup.  This issue snuck through our regular safeguards because a mistake was made creating it's partition.  Only ½ the cylinders were assigned and system administration output from presenting the LUNs lists the LUN, not the actual partition, so the size on this report was what we expected.

To me this is a good example of the importance of understanding what your database is telling you.

Friday, May 25, 2012

tar + find + "." = too Many Files

I know this blog post has a somewhat strange title, especially for a DBA, but it's all related to an issue I ran into recently that was more difficult to resolve than it should have been, mostly due to time constraints.

I was working on an issue under Oracle 10gR2.  I had to load all trace and log files under /bdump and /udump to MOS for help in analyzing the problem.  The system was a 4-node RAC and even with regular file cleanup jobs running nightly those directories had hundreds of files each.  But, I knew a specific time range and wanted all files and directories, so I just used "tar" with the file list generated from a "find" command using the "-mmin" argument.

After looking more closely after the first few tarballs were created I found that ALL files were getting tar'ed each time, as if the "find" command's argument was being ignored.  I ran the "find" separately, which worked as expected, but when used as input for "tar" it gave me all files.

It turns out my problem was the "." directory.  I don't think twice about seeing the current directory (".") or the parent directory ("..") in listings, but they obviously can affect output of commands.  As a simple example of what I ran into, let's say we have 5 files of 1KB, 2KB, ... 5KB in size and need to create a tarball of any over 2KB.

First, create files for the simple test:

for KB in 1 2 3 4 5
do
   dd if=/dev/zero of=${KB}kb_file.txt bs=1024 count=$KB
done

% ls -ltr
total 24
-rw-r--r--  1 oracle oinstall 5120 May 25 16:18 5kb_file.txt
-rw-r--r--  1 oracle oinstall 4096 May 25 16:18 4kb_file.txt
-rw-r--r--  1 oracle oinstall 3072 May 25 16:18 3kb_file.txt
-rw-r--r--  1 oracle oinstall 2048 May 25 16:18 2kb_file.txt
-rw-r--r--  1 oracle oinstall 1024 May 25 16:18 1kb_file.txt

Next, show that the "find" command gets what I want:

% find . -size +2049c -ls
 98960    4 drwxr-xr-x   2 oracle   oinstall     4096 May 25 16:18 .
 98971    4 -rw-r--r--   1 oracle   oinstall     3072 May 25 16:18 ./3kb_file.txt
 98972    4 -rw-r--r--   1 oracle   oinstall     4096 May 25 16:18 ./4kb_file.txt
 98973    8 -rw-r--r--   1 oracle   oinstall     5120 May 25 16:18 ./5kb_file.txt

And last, see how this works with "tar":

% tar -cvf 3kb_or_bigger.tar `find . -size +2049c -print`
./
./1kb_file.txt
./2kb_file.txt
./3kb_file.txt
./4kb_file.txt
./5kb_file.txt
tar: ./3kb_or_bigger.tar: file is the archive; not dumped
./3kb_file.txt
./4kb_file.txt
./5kb_file.txt

As can be seen, all files are in the tarball, along with a second set of just those that I really wanted.  What's happening is "." is passed to "tar", which tells "tar" to pull all files from that directory.  Filtering on filename and/or file type in the "find" command would resolve this, but at the time I was taking every shortcut possible.  You can bet that I'll respect the "." directory more from now on!