Monday, January 29, 2018

Physical Standby Wait Event Alerting

I recently had a request to send alerts to our DBA team whenever a particular read-only standby database has high waits on "library cache*" events.  I could see the need as we just had an issue where a number of different "library cache*" events were causing a performance problem on the standby database.  I figured this would be pretty easy to do in Enterprise Manager 12c - set thresholds on metric(s) for this event, update any relevant Incident Rules and we're all set.

Unfortunately there are no metrics available for specific "library cache*" wait events in EM12c.  There is a metric for a higher level, on the wait class, so in this case that'd be "Concurrency".  But thinking this through I realized alerting on "Concurrency" wouldn't work for this case.  The idea is to perform regular checks for "library cache*" waits and if repeated checks (at least 2) find x number of sessions waiting on this event then an alert should be raised.  If we check for "Concurrency", we could run into a situation where at timestamp#1 there were 10 sessions waiting on "library cache lock" and at timestamp#2 there were 10 sessions waiting on "enq: HV - contention".  2 consecutive checks found 10 or more sessions waiting on the metric so an alert would be raised, even though in my contrived example it's not what we want.

Another option would be either to use a temporary table to store snapshots of current session waits or to query AWR data.  A temporary table wouldn't work because this is a read-only standby.  AWR data wouldn't work because 1, monitored would be in the past, the earliest being 1 snapshot ago.  Plus, again this is a standby database so any AWR present would be a copy of the primary.

Our solution was to go with a Metric Extension.  The ME was based on the cursor:

SELECT COUNT(*) total_sessions
  FROM v$session
 WHERE state = 'WAITING'
   AND event LIKE 'library cache%';

 ... set at the instance-level.  Thresholds were set for 2 (or more) consecutive checks at 10 (warning) or 20 (critical).  After updating the appropriate Incident Rule everything worked as expected.

Friday, January 17, 2014


I figure it's "a must" to blog any time I learn something new.  I've set events on databases many times over the years but for the most part I've never paid that much attention to the syntax.  I knew about SESSION vs. SYSTEM, that CONTEXT lets you define how long the event should be active, etc., but I never paid attention to the impact of EVENT vs. EVENTS.

Using "ALTER SYSTEM SET EVENT '…';" changes the parameter EVENT and overrides previous settings (more on that later), "ALTER SYSTEM SET EVENTS '…';" makes the listed event active immediately but not persistent across database restarts.  I guess that should be obvious, because without specifying SCOPE the change would only be in memory.  What makes this different, though, is that using EVENT gives you the ability to define SCOPE (and SID) while EVENTS does not.  Using EVENTS actually changes a memory structure (see Tanel Poder's blog on "Why doesn't ALTER SYSTEM SET EVENTS set the events or tracing immediately?":

Back to EVENT, setting it to a value overrides previous setting(s), just as it would for any other instance/database parameter.  This means it's important to first check for any existing values for EVENT before setting new one(s).

Tuesday, April 16, 2013

An Unexpected Impact of LANG

In the process of trying to sort a delimited character string for a bash script I came across a situation where the setting of LANG at the OS-level impacted the results.  In my situation I was working on a script for monitoring Oracle services by comparing "srvctl configure" output with "srvctl status" output.  Since services that cross multiple instances may list those instances in various orders, I wanted to make sure the listing was always sorted for a proper comparison.

While a simple stream of "echo | tr ' ' '\n' | sort | tr '\n' ' ' | sed 's/,$/\n/'" works just fine, I found that when testing the monitor script across all our servers I was getting different results.  Looking closer I found that the difference was related to running the script locally vs. via SSH.

After a bit of debugging I came up with a simple example to show the issue:

Create a file with 4 rows of numbers, with varying leading spaces.

% echo "1
   4" >x.dat

Next, sort the file locally.

% sort x.dat

 Last, sort the file on the same server (I'm connect to lnx261) but pretend it's a remote operation via ssh.

% ssh oracle@`hostname -s` 'sort x.dat'

For both sorts the same file was used and run on the same server.  What would there be a difference?

LANGUAGE!  Your OS LANG setting affects sorts, just like NLS* settings do in Oracle.  Most session-level settings are not set when you SSH.  So either sourcing the appropriate profile (/etc/profile, ~/.bash_profile, etc.) or setting LC_ALL within your SSH session will give you the expected results:

% ssh oracle@`hostname -s` 'export LANG=en_US.UTF-8; sort x.dat'

Although I knew issues related to NLS_DATE_FORMAT not being set I had never previously run into issues with LANG.  Fortunately Enterprise Manager's agent (as part of $ORACLE_HOME/bin/nmo) runs through a full process creation so that your various profile scripts get sourced when you execute EM Jobs or simply run "Execute Host Command".

Friday, March 22, 2013

Mining EM Metadata - Password Lookup

As I've posted before, Enterprise Manager's (EM) metadata presents a wealth of information that can come in handy for your day-to-day DBA work.  In this article I'd like to share how stored passwords in EM (Preferred Credentials) can  be used to your advantage.

For an EM agent to interact with a server, the agent needs that server's password.  To work without human interaction that agent needs to store the server's password.  It also can store other target passwords, like that for databases.  The passwords are obviously stored in an encrypted format, but with the right SQL and function combinations you can retrieve any password, in clear text, for any username programmatically.  It all starts with MGMT_CREDENTIALS2.

Each set of credentials is stored in MGMT_CREDENTIALS2, where a "credential set" is made up of the username and password.  Each credential set is identified by a CREDENTIAL_GUID, which maps to MGMT_TARGET_CREDENTIALS.  This table is used to map credential sets to their targets for each administrator within EM, as you can save credentials for multiple targets for multiple EM administrators.

The following is an example of putting this together in a script (pass_lookup.sql):


SELECT DISTINCT DECRYPT(c.credential_value)
  FROM mgmt_target_credentials tc, mgmt_credentials2 c, mgmt_targets t
 WHERE UPPER(c.key_value) = '&USERNAME'   -- Username whose password we're retrieving
   AND c.credential_set_column = 'password'
   AND t.target_guid = tc.target_guid
   AND tc.credential_guid = c.credential_guid
   AND t.target_type = '&TARGET_TYPE'
   AND t.target_name = '&TARGET_NAME';


The reason C.CREDENTIAL_SET_COLUMN is filtered on "password" is that in this case we only want the password value, not the value of the username.

Taking this a step further, you can see how retrieval of the password can now be used within code, such as:

PASS=`sqlplus -S sysman/sysman_pass @pass_lookup.sql "$USER" "dbname" "rac_database" | grep -v "^$"`

Those lines set the variable $PASS with the password value for a given target and username.  Now obviously you first need to know the password for SYSMAN of the repository, but that's a relatively small detail to code for and the above should give you a good start if you want to use the EM repository as a place to store and retrieve passwords.

Tuesday, February 19, 2013

Mining EM Metadata - Job Health Check

This is the first of (hopefully) a number of posts about how to take advantage of information stored in Enterprise Manager's (EM) metadata.  EM has to capture a large quantity and variety of details related to its functionality, which is stored in a highly normalized format.  A lot of this information includes key details about environments that you, as a DBA, support.  Knowing what's stored and how to access it correctly can actually make your job a bit easier.

As a DBA you more than likely use a variety of monitoring tools and require assurance that those tools are working properly.  Ideally EM works without issue but obviously bugs will always be there.  For the situation discussed in this post, the bug deals with an older EM release ( and problem where targets for an EM Job will disappear, for whatever reason, while other targets stay defined.

For example, let's say you have EM Job "MONITOR - ORACLE SERVICES" and it's defined to run against all cluster databases.  At some point 1 or more of those intended targets drop from the full list of the scheduled Job, so at a later point in time you may find that the EM Job is now running only against 5 cluster databases when you originally submitted it to run on 8.  That's a pretty scary situation for a DBA to find himself it, especially since this particular bug doesn't appear to give any warning or messages.

The good news is EM metadata includes everything you need to catch this situation when it happens.  The key objects are views MGMT$JOBS, MGMT$JOB_TARGETS, and MGMT$JOB_EXECUTION_HISTORY, with a little help from MGMT$TARGET_FLAT_MEMBERS.

View MGMT$JOBS holds basic information about every EM Job, for both what resides (if at all) in the library and for what has been submitted.  Differentiation is done using IS_LIBRARY, a flag where a value of "1" means the EM Job for that row of data is saved in the library whereas a value of "0" refers to a submitted EM Job.

COL job_name FORMAT A30
COL target_type FORMAT A15

SELECT job_name, job_type, target_type, is_library
  FROM mgmt$jobs

JOB_NAME                       JOB_TYPE                         TARGET_TYPE             IS_LIBRARY
------------------------------ -------------------------------- --------------- ------------------
EXAMPLE JOB                    OSCommand                        host                             1

Once you submit the EM Job an additional row is returned from MGMT$JOBS where IS_LIBRARY = 1:

JOB_NAME                       JOB_TYPE                         TARGET_TYPE             IS_LIBRARY
------------------------------ -------------------------------- --------------- ------------------
EXAMPLE JOB                    OSCommand                        host                             1
EXAMPLE JOB.1                  OSCommand                        host                             0

Also note that until you define a schedule for the EM Job (either "One Time" or "Repeating"), the column START_TIME will be NULL.

To see which targets have been assigned to a given EM Job, use the view MGMT$JOB_TARGETS.  Be careful, though.  Like I mentioned before, EM's metadata is highly normalized, along with how you can define your own groupings.  If you create an EM Job to run against 5 specific hosts you'll give 5 rows returned for this EM Job from MGMT$JOB_TARGETS.

SELECT j.job_name, j.job_type, jt.target_name, j.target_type
  FROM mgmt$jobs j, mgmt$job_targets jt
   AND j.is_library = 1
   AND (j.job_id = jt.job_id);

JOB_NAME                       JOB_TYPE                         TARGET_NAME TARGET_TYPE
------------------------------ -------------------------------- ----------- ---------------
EXAMPLE JOB                    OSCommand                        linux1      host
EXAMPLE JOB                    OSCommand                        linux2      host
EXAMPLE JOB                    OSCommand                        linux5      host

But if you defined the EM Job as type "host" and assign it to an EM Group you'll get 1 row with a target type of "composite".

SELECT j.job_name, j.job_type, jt.target_name, jt.target_type
  FROM mgmt$jobs j, mgmt$job_targets jt
   AND j.is_library = 1
   AND (j.job_id = jt.job_id);

JOB_NAME                       JOB_TYPE                         TARGET_NAME TARGET_TYPE
------------------------------ -------------------------------- ----------- ---------------
EXAMPLE JOB                    OSCommand                        PROD_GROUP  composite

In the example above the "Target Type" defined for the EM Job is still "host" but when I selected individual Targets for the EM Job I picked a Group, as I wanted to execute the EM Job against EVERY "host" within the selected Group.  To break out (or "flatten") type "composite" you'll need to join to view MGMT$TARGET_FLAT_MEMBERS.  More on that in just a bit.

Now that we've found out how to list all targets defined for an EM Job, we need to find out how to compare that to what's actually scheduled.  In comes the view MGMT$JOB_EXECUTION_HISTORY.  The query I used for this is rather complex so let me first break it into pieces.  First, I union EM Jobs of non-composite target types with those that are of composite target types.  The composite target types are broken out into individual targets via view MGMT$TARGET_FLAT_MEMBERS.  Second, I take this list of defined EM Jobs and join that to MGMT$JOB_EXECUTION_HISTORY, which represents those EM Jobs that are either "Scheduled" or current "Running".  (Note: when an EM Job is "Running" it doesn't have an entry for "Scheduled".  That only happens after the execution completes.)

WITH c_jobs_defined AS (SELECT j.job_name, j.job_id, j.job_owner, j.job_type, jt.target_name
                             , jt.target_type
                          FROM mgmt$jobs j, mgmt$job_targets jt
                         WHERE (j.job_id = jt.job_id)
                           AND j.is_library = 0
                           AND j.start_time IS NOT NULL
                           AND jt.target_type NOT IN ('composite')
                               UNION ALL
                        -- Join in Jobs with a composite target type separately as these need to be
                        -- flattened out to get a full list of all targets for that composite type.
                        SELECT j.job_name, j.job_id, j.job_owner, j.job_type, tfm.member_target_name
                             , tfm.member_target_type
                          FROM mgmt$jobs j, mgmt$job_targets jt, mgmt$target_flat_members tfm
                         WHERE (j.job_id = jt.job_id)
                           AND (    jt.target_name = tfm.aggregate_target_name
                                AND jt.target_type = tfm.aggregate_target_type
                                AND j.target_type = tfm.member_target_type)
                           AND j.is_library = 0
                           AND j.start_time IS NOT NULL
                           AND jt.target_type IN ('composite'))
SELECT c.job_name, c.job_owner, c.job_type, c.target_name missing_on_target, c.target_type
  FROM c_jobs_defined c LEFT OUTER JOIN mgmt$job_execution_history jeh
       ON (c.job_id = jeh.job_id AND c.target_name = jeh.target_name)
                        INNER JOIN (SELECT member_target_name, member_target_type
                                      FROM mgmt$target_flat_members
                                     WHERE aggregate_target_name = 'PROD_GROUP') m
       ON (c.target_name = m.member_target_name AND c.target_type = m.member_target_type)
 WHERE (jeh.status IN ('Scheduled', 'Running') OR jeh.status IS NULL)
   AND jeh.job_name IS NULL
 ORDER BY c.job_name, c.target_name;

The INNER JOIN to MGMT$TARGET_FLAT_MEMBERS with the UNION-ed results is to limit the search to those targets within a particular EM Group, in this case our production Group named "PROD_GROUP".

Although this was rather lengthy to explain, hopefully it's clear now how to differentiate EM Jobs as defined in the Job Library verses those listed in Job Activity within the EM Console.

Thursday, October 4, 2012

Finally! A use for Enterprise Manager Target Properties with Jobs!

Okay, perhaps I'm overreacting, as I haven't been thinking hard about using Target Properties ("%%") within EM non-stop.  But periodically when I'd see them during the Job creation process, I'd wonder what good are they and will I ever find a use for them.

To explain how I've found a use for them, let me first share a few details about how we configure our environments within EM and how that relates to Job definitions.  We assign all environments to 1 of 2 Groups - Production or Non-Production.  From there all target types are lumped within the appropriate Group.  When we create EM Jobs we try our best to always have the Target refer to a Group and let the Target Type define what resources within the Group will be accessed.

This works great for most situations, except that there's duplication in some cases.  For example, OS Jobs running against a Target Type of "host" will pick up both hosts AND clusters, which means on a given cluster your Job will run once on each node plus once on the cluster's primary node.

In comes Target Properties.  Target Properties are listed on the right side of the screen under the "Parameters" tab when creating a Job in EM, with the list varying based on the Target Type of the Job.

The following is a capture of these Target Properties for a Target Type of "host":

Back to the problem at hand, to have clusters skipped we simply check the "%TargetType%" in the Command:

if [ "%TargetType%" = "host" ]; then . ~/.bash_profile; ; else exit 0; fi

Notice the "else exit 0".  This is needed to avoid clusters returning a non-zero status because they weren't picked by the if-test.

Friday, August 17, 2012

Speeding up "fuser"Activity

"fuser" is a great utility for Unix systems for determining who has a file opened or just if a file is opened by anyone.  Related to Oracle maintenance, its handy when performing any kind of cleanup of tracefiles or audit files.  Oracle dumps plenty of files under [a|b|u]dump yet as part of regular cleanup it's not wise to blindly remove these files.  If a file is still open removing it doesn't release the space.

The problem is "fuser" can also be very slow.  The point of the utility is to find all users who have a given file open, but that literally means check every process on the system to see if they have the associated file descriptor reserved, in DBA terms "do a full scan" of all processes.  On a system with thousands of processes and thousands of files to check, the impact can be rather large.

The good news is Oracle is nice enough to frequently provide the PID who owns a file within the filename itself.  Since these audit and tracefiles are exclusive to a PID this means you could grab the PID off the filename and check one spot to see if that process still has the file open, instead of checking all processes on the system.

As an example, the following shows an "fuser" method and an lookup-by-PID method for checking all audit files to see if any are open:

"fuser" Method
ls -1 $ORACLE_BASE/admin//adump/ora*.aud | while read AUD_FILE
   [[ -n "`fuser $AUD_FILE | cut -d':' -f1`" ]] && echo "File $AUD_FILE is open"

Lookup-by-PID Method
ls -1 $ORACLE_BASE/admin//adump/ora*.aud | while read AUD_FILE
   AUD_FILE_BASE=`basename $AUD_FILE .aud`
   AUD_FILE_PID=`echo $AUD_FILE_BASE | cut -d'_' -f2`
   if [ `ls -l /proc/$AUD_FILE_PID/fd 2>/dev/null | grep -c "$AUD_FILE"` -eq 1 ]; then
      echo "File $AUD_FILE is open"

The "Lookup" method takes a few extra lines of code but the performance is dramatic.  I tested this on a system with 2500+ active processes with a few thousand audit files, using "time" to compare the 2 methods. 

For "fuser":
real    8m3.663s
user    0m11.197s
sys     0m50.258s

For "Lookup -by-PID"
real    0m0.951s
user    0m0.272s
sys     0m0.895s

Not every system has as many files to check so I ran both tests against just 1 file, still with 2500+ active processes.  The difference was still 1000x - "fuser" = 0m3.144s, "Lookup-by-PID" = 0m0.003s.

For me the bottom line is that you can have a huge, positive impact on performance if you're willing to add a smidge of intelligence to your code.