Thursday, November 3, 2011

"ulimit", ASM, and Thinking Things Through


The other day while reviewing an instance's alert log I found an interesting message:
WARNING:Oracle instance running on a system with low open file descriptor
        limit. Tune your system to increase this limit to avoid
        severe performance degradation.

A quick search of MOS pointed me to bug 5862719.  It turns out on our RAC system (10.2.0.2) on RHEL 4, the /etc/init.d/init.crsd file has the command "ulimit -n unlimited" in it, which fails and drops the file descriptor limit to its default, which in our case is 1024.
% ps -ef | grep smon_dchdb
oracle   20720     1  0 Oct31 ?        00:00:16 ora_smon_dchdb1

% grep "open files" /proc/20720/limits
Max open files            1024                 1024                 files

The system recently crashed (which was why I was checking the alert log) and so all auto-started Oracle processes inherited their ulimits from the init processes.
My point of this isn't to re-document an Oracle bug but emphasize the importance of taking a step back periodically and thinking things through.  Is an FD limit of 1024 potentially hurting our system, performance-wise?  If so, can I duplicate this elsewhere to prove the impact, justifying an emergency fix?
On Linux it's pretty easy to check on open file counts for processes running under the same account.  As "oracle" I issued the following:
#------------------------------------------------------------------------------------|
# Check all "oracle" processes associated with database "dchdb" and ASM, grab the PID
# and current command, then count how many files are opened and display the results.
#------------------------------------------------------------------------------------|
ps -f -u oracle | grep -E '(dchdb1|ASM)' | awk '{print $2" "$8}' | \
   while read IN_PID IN_CMD
   do
      echo "Files opened for PID $IN_PID $IN_CMD: `ls -1 /proc/${IN_PID}/fd 2>/dev/null | wc -l`"
   done | sort -nr -k7,7 | head -6

#-- output --#
Files opened for PID 21810 ora_rbal_dchdb1: 28
Files opened for PID 21790 ora_asmb_dchdb1: 28
Files opened for PID 22080 ora_arc1_dchdb1: 27
Files opened for PID 22174 ora_qmnc_dchdb1: 26
Files opened for PID 22078 ora_arc0_dchdb1: 26
Files opened for PID 20859 ora_lck0_dchdb1: 24

No process has more than 28 files open, so a limit of 1024 seems pretty safe.  Yet the database has a few hundred datafiles, so why isn't the number higher?  This is where ASM comes in to play.  ASM bypasses the file system so database "files" aren't managed by the OS and FD's, so for the most part FD limits have a different context when dealing with ASM.
Result?  While I believe it's not always wise to leave code as is even though 1 or more commands/parts are failing each time, in this case spending a bit of time to think things through allowed this change to be scheduled on a lower priority.