Wednesday, March 27, 2019

OS-Level Stalls, NAS and LD_LIBRARY_PATH


Although there are many posts about the evils of using LD_LIBRARY_PATH, at times it's required when working with Oracle, specifically with Oracle Golden Gate (OGG).  Unfortunately this variable's definition came back to bite us related to an oversite during NAS changes.

The situation was on a 2-node RAC, running 11.2.0.4 and RHEL 6.x.  The symptom was poor interactive performance, where commands would take 3 minutes or so to complete on an seemly inconsistent basis.  Even logging onto the servers involved long waits.  We reviewed all sorts of metrics / statistics and nothing yielded evidence of the problem.  We involved our SysAdmin team and they noted that no other OS account was experiencing this behavior, only "oracle".

Since our team was out of ideas, we started one session that looped continually running "date" commands every 5 seconds.  In a 2nd session we ran "strace -fp" on the 1st session.  What we found was that every 10 to 12 loops (50 to 60 secs) the 1st session would hang on:

open("/mnt/auto/ogg/GRIFFIN/tls/x86_64/librt.so.1", O_RDONLY

Checking the "oracle" environment I found that LD_LIBRARY_PATH was set to include the path "/mnt/auto/ogg/GRIFFIN/tls/x86_64", which had previously been used for OGG.  That path, which was to a NAS, was no longer being used and the NAS mount had gone stale, which accounts for the hangs.  After removing the invalid NAS path from LD_LIBRARY_PATH in ~oracle/.bash_profile, all interactive performance issue disappeared.

Friday, February 22, 2019

OEM Target Discovery and One Potential Source of Duplicates

With OEM 12c came a feature that I've come to appreciate, the ability to discover all targets on a given environment.  Part of my appreciation is the removal of the mundane - why should I tell and define for OEM what exists on an environment when I can make it discover this information for itself?  But of course, with automation comes highlighting our own mistakes or possibly misunderstanding of our own assumptions.

We recently needed to discover a 2-node database, which was a copy of a separate database.  Simple enough.  The new database was "orcldb" with instances "orcldb1" and "orcldb2" on nodes 1 and 2, respectively.  Yet the discovery results in OEM showed as "orcldb" and "orcldb2" on node2, "ORCLDB.us.dnb.com" and "ORCLDB.us.dnb.com_orcldb1" on node1.  So not only 2 separate databases but a difference in character case along with the domain involved.

A bit of knowledge about what's involved in the discovery process helped identify the problem.  When monitoring the discovery process I noticed that a process was forked with the following command:

$AGENT_HOME/perl/bin/perl $AGENT_BASE/plugins/oracle.sysman.db.discovery.plugin_12.1.0.7.0/discover/oracledb.pl
  $AGENT_HOME <host being discovered>

Also, not only is some information written to $AGENT_INSTANCE_HOME/sysman/log/emagent_perl.trc but much more information is detailed in $AGENT_INSTANCE_HOME/sysman/emd/state/fetchlet_state/Discovery/oracle.sysman.db_discover_oracledb.pl.trc.  A search on this trace file lead me to the following:

oracledb.pl: Fri Feb 15 17:57:48 2019:  DB_LISTENER_DISCOVERY:  SID=orcldb excluded from discovery as it does not have a
running pmon process, "/u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/lkORCLDB file or initialization file.

While I knew the database had both instances open, I noticed on node1, under $ORACLE_HOME/dbs, there was a leftover init.ora file from configuring the new database - initorcldb1.ora.  This file was a full "pfile" and included "CLUSTER_DATABASE=FALSE".  Bingo!

I removed the init.ora file (and made sure nothing similar was on node2 for this database), then validated "Spfile" was set properly in the OCR.  Then I removed the discovered targets in OEM and tried again, this time the discovered targets being exactly what I expected.

One more item has been added to my notes on sources for the discovery process - $ORACLE_HOME/dbs/init*!

Friday, November 9, 2018

ORA-01565, ORA-17503 and ORA-15173, with a Little Investigation

Although the errors seem like typical ORA- errors, this situation was rather weird and took a bit of investigation to track down all parties involved.
Every day at 11:44 and 16:16 we would receive errors from ASM on server03 like the following:

ORA-15173: entry 'dbprd3' does not exist in directory '/'

Besides the error not giving a lot of detail, we found it weird that database "dbprd3" is involved as it hasn't existed on this server in a long time. 
Since the problem would happen at specific times daily, it seemed likely that the cause of this error was related to a scheduled job of some kind.  But, the "oracle" account is locked down on this server which means "oracle" can't use cron.  No OEM Jobs are listed as having run during the times the errors are generated plus the target "dbprd3" doesn't exist in OEM.
After performing a few searches on server03 for files with "dbprd3" in the name, tracefiles were found under $ORACLE_HOME/rdbms/log, named dbprd3_ora_[PID], listing only 1 error line:

KGFCHECK kgfnStmtExecute01c: ret == OCI_SUCCESS: FAILED at kgfn.c:1563

Going back to OEM, it was found that "dbprd3" was still listed as a discovered yet unpromoted target on server03.  Deleting this target (since it doesn't exist) and running an immediate discovery process generated the errors in question.  The easy fix at this point was to disable auto-discovery and in fact it should be disabled for all servers, since we never need it unless something new is added.
Next came the question as to why the discovery process would generate the errors.  Since the discovery process is handled through the management agent, emagent_perl.trc under $AGENT_INSTANCE_HOME/sysman/log was checked for details, which revealed the following:

oracledb.pl: Wed Nov  7 16:16:55 2018: ERROR:  initParameterFileUtl::convertSPFileToPFile: Failed to convert spfile +DATA/dbprd3/spfiledbprd3.ora to pfile using sqlplus: Inappropriate ioctl for device 0 The output of sqlplus is "CREATE PFILE='/tmp/r3VRTW7BT5' FROM SPFILE='+DATA/dbprd3/spfiledbprd3.ora'
*
ERROR at line 1:
ORA-01565: error in identifying file '+DATA/dbprd3/spfiledbprd3.ora'
ORA-17503: ksfdopn:10 Failed to open file +DATA/dbprd3/spfiledbprd3.ora
ORA-15056: additional error message
ORA-15173: entry 'dbprd3' does not exist in directory '/'
ORA-06512: at line 4

The discovery process was finding database "dbprd3" and then trying to gather more information about it, but that failed since it was removed previously.  /etc/oratab was checked, as that's one of the files the discovery process checks, and sure enough "dbprd3" was still listed.  The final clue came from reviewing the details listed in emagent_perl.trc - "oracledb.pl: Wed Nov  7 16:16:55 2018: ERROR:  initParameterFileUtl::convertSPFileToPFile".  The discovery process runs several Perl scripts under $AGENT_BASE/plugins/oracle.sysman.db.discovery.plugin_[RELEASE]/discover, oracledb.pl being one of the main ones.  This calls ./utl/initParameterFileUtl.pl, which generated the error in subroutine "convertSPFileToPFile".  It seems the discovery process tries to gather additional details about databases it finds from the init.ora file.  The subroutine tries to generate a text file based on the spfile and the spfile location, in our case, is determined by the init.ora file under $ORACLE_HOME/dbs.  Sure enough, that file was still present for "dbprd3" and had one line:

SPFILE='+DATA/dbprd3/spfiledbprd3.ora'

When the database "dbprd3" was removed in the past, so was all its ASM files, including the spfile.  By disabling auto-discovery, then removing the entry in /etc/oratab and removing the init file, the problem was eliminated for good, even if someone manually executes the discovery process later.