Monday, January 29, 2018

Physical Standby Wait Event Alerting

I recently had a request to send alerts to our DBA team whenever a particular read-only standby database has high waits on "library cache*" events.  I could see the need as we just had an issue where a number of different "library cache*" events were causing a performance problem on the standby database.  I figured this would be pretty easy to do in Enterprise Manager 12c - set thresholds on metric(s) for this event, update any relevant Incident Rules and we're all set.

Unfortunately there are no metrics available for specific "library cache*" wait events in EM12c.  There is a metric for a higher level, on the wait class, so in this case that'd be "Concurrency".  But thinking this through I realized alerting on "Concurrency" wouldn't work for this case.  The idea is to perform regular checks for "library cache*" waits and if repeated checks (at least 2) find x number of sessions waiting on this event then an alert should be raised.  If we check for "Concurrency", we could run into a situation where at timestamp#1 there were 10 sessions waiting on "library cache lock" and at timestamp#2 there were 10 sessions waiting on "enq: HV - contention".  2 consecutive checks found 10 or more sessions waiting on the metric so an alert would be raised, even though in my contrived example it's not what we want.

Another option would be either to use a temporary table to store snapshots of current session waits or to query AWR data.  A temporary table wouldn't work because this is a read-only standby.  AWR data wouldn't work because 1, monitored would be in the past, the earliest being 1 snapshot ago.  Plus, again this is a standby database so any AWR present would be a copy of the primary.

Our solution was to go with a Metric Extension.  The ME was based on the cursor:




SELECT COUNT(*) total_sessions
  FROM v$session
 WHERE state = 'WAITING'
   AND event LIKE 'library cache%';


 ... set at the instance-level.  Thresholds were set for 2 (or more) consecutive checks at 10 (warning) or 20 (critical).  After updating the appropriate Incident Rule everything worked as expected.