Wednesday, March 27, 2019

OS-Level Stalls, NAS and LD_LIBRARY_PATH


Although there are many posts about the evils of using LD_LIBRARY_PATH, at times it's required when working with Oracle, specifically with Oracle Golden Gate (OGG).  Unfortunately this variable's definition came back to bite us related to an oversite during NAS changes.

The situation was on a 2-node RAC, running 11.2.0.4 and RHEL 6.x.  The symptom was poor interactive performance, where commands would take 3 minutes or so to complete on an seemly inconsistent basis.  Even logging onto the servers involved long waits.  We reviewed all sorts of metrics / statistics and nothing yielded evidence of the problem.  We involved our SysAdmin team and they noted that no other OS account was experiencing this behavior, only "oracle".

Since our team was out of ideas, we started one session that looped continually running "date" commands every 5 seconds.  In a 2nd session we ran "strace -fp" on the 1st session.  What we found was that every 10 to 12 loops (50 to 60 secs) the 1st session would hang on:

open("/mnt/auto/ogg/GRIFFIN/tls/x86_64/librt.so.1", O_RDONLY

Checking the "oracle" environment I found that LD_LIBRARY_PATH was set to include the path "/mnt/auto/ogg/GRIFFIN/tls/x86_64", which had previously been used for OGG.  That path, which was to a NAS, was no longer being used and the NAS mount had gone stale, which accounts for the hangs.  After removing the invalid NAS path from LD_LIBRARY_PATH in ~oracle/.bash_profile, all interactive performance issue disappeared.

No comments:

Post a Comment