The other day while reviewing an instance's alert log I
found an interesting message:
WARNING:Oracle instance running on a system with low
open file descriptor
limit.
Tune your system to increase this limit to avoid
severe
performance degradation.
A quick search of MOS pointed me to bug 5862719. It turns out on our RAC system (10.2.0.2) on
RHEL 4, the /etc/init.d/init.crsd file has the command "ulimit -n
unlimited" in it, which fails and drops the file descriptor limit to its
default, which in our case is 1024.
% ps -ef | grep smon_dchdb
oracle
20720 1 0 Oct31 ? 00:00:16 ora_smon_dchdb1
% grep "open files" /proc/20720/limits
Max open files 1024 1024 files
The system recently crashed (which was why I was checking
the alert log) and so all auto-started Oracle processes inherited their ulimits
from the init processes.
My point of this isn't to re-document an Oracle bug but
emphasize the importance of taking a step back periodically and thinking things
through. Is an FD limit of 1024
potentially hurting our system, performance-wise? If so, can I duplicate this elsewhere to
prove the impact, justifying an emergency fix?
On Linux it's pretty easy to check on open file counts for
processes running under the same account.
As "oracle" I issued the following:
#------------------------------------------------------------------------------------|
# Check all "oracle" processes associated
with database "dchdb" and ASM, grab the PID
# and current command, then count how many files are
opened and display the results.
#------------------------------------------------------------------------------------|
ps -f -u oracle | grep -E '(dchdb1|ASM)' | awk
'{print $2" "$8}' | \
while read
IN_PID IN_CMD
do
echo
"Files opened for PID $IN_PID $IN_CMD: `ls -1 /proc/${IN_PID}/fd 2>/dev/null
| wc -l`"
done | sort
-nr -k7,7 | head -6
#-- output --#
Files opened for PID 21810 ora_rbal_dchdb1: 28
Files opened for PID 21790 ora_asmb_dchdb1: 28
Files opened for PID 22080 ora_arc1_dchdb1: 27
Files opened for PID 22174 ora_qmnc_dchdb1: 26
Files opened for PID 22078 ora_arc0_dchdb1: 26
Files opened for PID 20859 ora_lck0_dchdb1: 24
No process has more than 28 files open, so a limit of 1024
seems pretty safe. Yet the database has
a few hundred datafiles, so why isn't the number higher? This is where ASM comes in to play. ASM bypasses the file system so database
"files" aren't managed by the OS and FD's, so for the most part FD
limits have a different context when dealing with ASM.
Result? While I believe
it's not always wise to leave code as is even though 1 or more commands/parts
are failing each time, in this case spending a bit of time to think things
through allowed this change to be scheduled on a lower priority.