Friday, November 9, 2018

ORA-01565, ORA-17503 and ORA-15173, with a Little Investigation

Although the errors seem like typical ORA- errors, this situation was rather weird and took a bit of investigation to track down all parties involved.
Every day at 11:44 and 16:16 we would receive errors from ASM on server03 like the following:

ORA-15173: entry 'dbprd3' does not exist in directory '/'

Besides the error not giving a lot of detail, we found it weird that database "dbprd3" is involved as it hasn't existed on this server in a long time. 
Since the problem would happen at specific times daily, it seemed likely that the cause of this error was related to a scheduled job of some kind.  But, the "oracle" account is locked down on this server which means "oracle" can't use cron.  No OEM Jobs are listed as having run during the times the errors are generated plus the target "dbprd3" doesn't exist in OEM.
After performing a few searches on server03 for files with "dbprd3" in the name, tracefiles were found under $ORACLE_HOME/rdbms/log, named dbprd3_ora_[PID], listing only 1 error line:

KGFCHECK kgfnStmtExecute01c: ret == OCI_SUCCESS: FAILED at kgfn.c:1563

Going back to OEM, it was found that "dbprd3" was still listed as a discovered yet unpromoted target on server03.  Deleting this target (since it doesn't exist) and running an immediate discovery process generated the errors in question.  The easy fix at this point was to disable auto-discovery and in fact it should be disabled for all servers, since we never need it unless something new is added.
Next came the question as to why the discovery process would generate the errors.  Since the discovery process is handled through the management agent, emagent_perl.trc under $AGENT_INSTANCE_HOME/sysman/log was checked for details, which revealed the following:

oracledb.pl: Wed Nov  7 16:16:55 2018: ERROR:  initParameterFileUtl::convertSPFileToPFile: Failed to convert spfile +DATA/dbprd3/spfiledbprd3.ora to pfile using sqlplus: Inappropriate ioctl for device 0 The output of sqlplus is "CREATE PFILE='/tmp/r3VRTW7BT5' FROM SPFILE='+DATA/dbprd3/spfiledbprd3.ora'
*
ERROR at line 1:
ORA-01565: error in identifying file '+DATA/dbprd3/spfiledbprd3.ora'
ORA-17503: ksfdopn:10 Failed to open file +DATA/dbprd3/spfiledbprd3.ora
ORA-15056: additional error message
ORA-15173: entry 'dbprd3' does not exist in directory '/'
ORA-06512: at line 4

The discovery process was finding database "dbprd3" and then trying to gather more information about it, but that failed since it was removed previously.  /etc/oratab was checked, as that's one of the files the discovery process checks, and sure enough "dbprd3" was still listed.  The final clue came from reviewing the details listed in emagent_perl.trc - "oracledb.pl: Wed Nov  7 16:16:55 2018: ERROR:  initParameterFileUtl::convertSPFileToPFile".  The discovery process runs several Perl scripts under $AGENT_BASE/plugins/oracle.sysman.db.discovery.plugin_[RELEASE]/discover, oracledb.pl being one of the main ones.  This calls ./utl/initParameterFileUtl.pl, which generated the error in subroutine "convertSPFileToPFile".  It seems the discovery process tries to gather additional details about databases it finds from the init.ora file.  The subroutine tries to generate a text file based on the spfile and the spfile location, in our case, is determined by the init.ora file under $ORACLE_HOME/dbs.  Sure enough, that file was still present for "dbprd3" and had one line:

SPFILE='+DATA/dbprd3/spfiledbprd3.ora'

When the database "dbprd3" was removed in the past, so was all its ASM files, including the spfile.  By disabling auto-discovery, then removing the entry in /etc/oratab and removing the init file, the problem was eliminated for good, even if someone manually executes the discovery process later.

Friday, October 26, 2018

DG Switchover, EMON Termination and kponpst: ORA-1033, during ksrpublish0


As part of my company's policy, we regularly perform DR-related activities, which includes switchover tests of databases configured to use Data Guard (physical standby).  During one of the tests in a lower environment the switchover command (run in "dgmgrl") hung.  Normally this process takes a few minutes but after waiting longer than usual, we checked the database for wait events and saw that the switchover was waiting on "EMON Termination".  We also checked the primary's alert log and saw the following:


Started @ 11:11:40
2018-10-22 11:11:56.745000 -04:00
minact-scn: Master considers inst:4 dead
kponpst: ORA-1033, during ksrpublish0
kponpst: ORA-1033, during ksrpublish0
kponpst: ORA-1033, during ksrpublish0
2018-10-22 11:11:58.862000 -04:00
Archived Log entry 438383 added for thread 1 sequence 31328 ID 0xe8c2ea16 dest 1:
ARC4: Archive log rejected (thread 2 sequence 31330) at host '****'
FAL[server, ARC4]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance **** - Archival Error. Archiver continuing.
2018-10-22 11:12:48.377000 -04:00
kponpst: ORA-1033, during ksrpublish0
2018-10-22 11:12:55.734000 -04:00
kponpst: ORA-1033, during ksrpublish0

In ARC[4] tracefile:
*** 2018-10-22 11:11:58.429 4645 krsu.c
Logged on to standby successfully
Client logon and security negotiation successful!
Error 16401 creating standby archive log file at host '****'
kcrrwkx: unknown error:16401


MOS document "Database Shutdown Hangs on "EMON termination" / " Waiting for Job queue slaves to complete" (Doc ID 1915888.1)" describes this event as caused by "a few orphan subscribers for the ALERT QUEUE which caused emon to spin during shutdown.".  This implies that one or more EMON background process could be hung.  Checking the view V$EMON I found:


SID LAST_UPDATE_TIME
------- -----------------------------
    336 04-OCT-18 23.42.02.881 -04:00
    396 22-OCT-18 11.11.51.986 -04:00
    427 22-OCT-18 11.11.51.986 -04:00
    456 22-OCT-18 11.11.51.986 -04:00
      7 22-OCT-18 11.11.51.986 -04:00

 
From the above it looks as though EMON SID 336 (highlighted) doesn't belong or at least is having issues.  I killed this database session, but unfortunately it couldn't be killed (marked for killed).  Fortunately, experience has told me to always grab the SPID associated with an Oracle session, just in case it can't be killed.  After killing the OS process the switchover picked up immediately and completed without issue.
Normally it's not a good idea to kill Oracle background sessions but from a bit of searching BEFORE anything was killed, I determined that EMON does have several bugs associated with it across releases of Oracle and that killing EMON is sometimes advised, in emergency situations.