Sunday, May 13, 2018

"grep" with Colors can't find its Results

I ran into an interesting situation recently where one of our standard cleanup scripts of Oracle tracefiles, logs, etc. could not find an entry in /etc/oratab even though it was there.  The problem was that "grep" didn't know what color $ORACLE_SID was.

Obviously $ORACLE_SID isn't set to a specific color but when you color options with "grep" those special characters that display colors can come back to bite you.  For example, lets say you have are using ASM for an Oracle database and want to switch your environment to point to that ASM instance and do so programmatically.  The following shows the instance exists and is also defined in "/etc/oratab":

% ps -fu oracle | grep asm_smon_ | grep -v grep
oracle     9234      1  0  2017 ?        00:14:39 asm_smon_+ASM
% grep "^+ASM" /etc/oratab
+ASM:/grid/app/12.1.0/oracle:N          # line added by Agent

To switch the environment in a script you can simply tell "oraenv" to not prompt for input after you set $ORACLE_SID:

% cat x.sh
export ORAENV_ASK="NO"
export ORACLE_SID=`grep "^+ASM" /etc/oratab | cut -d\: -f1`
. oraenv

And when running the simple script:

% chmod +x x.sh
% x.sh
ORACLE_HOME = [/dnbusr1/oracle] ?

Wait!  I validated that the instance I was going to use, "+ASM", existed in "/etc/oratab".  Why couldn't "oraenv" find the value?  At first when I tried debugging, by making a copy of "/usr/local/bin/oraenv" and in that local copy I added "set -x", then re-running my script to see what was going on, I didn't see the problem:

% x.sh
+++ SILENT=
+++ '[' 0 -gt 0 ']'
+++ case ${ORACLE_TRACE:-""} in
+++ N=
+++ C=
+++ echo '\c'
+++ grep c
+++ N=-n
+++ '[' /ora01/app/oracle/product/12.1.0/db_1 = 0 ']'
+++ OLDHOME=/ora01/app/oracle/product/12.1.0/db_1
+++ case ${ORAENV_ASK:-""} in
+++ NEWSID="^+ASM"
+++ export ORACLE_SID
++++ dbhome "^+ASM"
+++ ORAHOME=/dnbusr1/oracle
+++ case $? in
+++ echo -n 'ORACLE_HOME = [/dnbusr1/oracle] ? '
ORACLE_HOME = [/dnbusr1/oracle] ? +++ read NEWHOME

Just for yucks I decided to send the output to a logfile and review the results in "vi" and sure enough, that helped identify the problem:

% x.sh 1>x.log 2>&1

% vi x.log
+++ SILENT=
+++ case ${ORACLE_TRACE:-""} in
+++ N=
+++ C=
+++ echo '\c'
+++ grep c
+++ N=-n
+++ '[' /grid/app/11.2.0.4/oracle = 0 ']'
+++ OLDHOME=/grid/app/11.2.0.4/oracle
+++ case ${ORAENV_ASK:-""} in
+++ NEWSID='^[[01;31m^[[K+ASM^[[m^[[K1'
+++ export ORACLE_SID
++++ dbhome '^[[01;31m^[[K+ASM^[[m^[[K1'

The highlighted escape sequences are adding color to the string.  And why would that be done?  Because someone had added color options to "grep", using "export GREP_OPTIONS="--color=always" ".  In my script I used "grep" to find the Oracle instance (in case the environment I was working on had a different value such as a cluster) and the color escape sequences were stored in the variable with the string.

So is there a happy medium where you can use colors to help identify found strings yet not have those colors interfere in cases like I described above?  Yes, for the most part.  The best option would be to use "--color" or "--color=auto".  In either case ("auto" is the default) color is added only when standard output is connected to a terminal, to partially quote the "man" page for "ls".  In other words, a command like "grep "^+ASM" /etc/oratab" would include color whereas "grep "^+ASM" /etc/oratab | cut -d\: -f1" would not because output is piped.

Wednesday, March 21, 2018

Wrapping "repvfy" to Make Your OEM Life Easier

For those of you who manage Oracle Enterprise Manager environments, you know how important the metadata is.  Fortunately, Oracle provides a utility called "repvfy" that, as the abbreviation implies, helps you verify the repository.  In this blog I'd like to share how you can wrap a bit of code around the execution of "repvfy" to make your life a bit easier.
Before proceeding, note that:
  • I'll be using the variable $EMDIAG_HOME to refer to where the "repvfy" utility has been extracted.
  • Our OEM environments that "repvfy" and its associated wrapper code run against are 11.1.0.1, 12.1.0.4 and 12.1.0.5.
  • The wrapper itself can be written in a variety of languages but since it doesn't require complex coding nor has strict elapsed time requirements and all members of our DBA team are proficient in bash, it was decided to use bash.
We created crontab entries to run "repvfy" on a daily basis against each of our OEM environments, including the "-fix" argument so that in case the found issue is simple to resolve, "repvfy" can take care of it.  But, periodically issues are found that require more effort so we decided it would helpful to receive an email with the output from each "repvfy" execution.  Reviewing these emails isn't too difficult as output is similar to the following:

verifyAGENTS
1008. NMO not setuid-root (Unix-only): 2
6002. Blocked Agents: 2
8005. Broken Agents: 1
verifyASLM
verifyAVAILABILITY
8001. Composite availability errors: 2
verifyBLACKOUTS
7001. Expired blackout windows: 9
      Fix: 9
verifyCA
verifyCAT
 ...
verifyTARGETS
1017. Platform ID mismatch between host and ORACLE_HOME: 25
2006. Targets with missing ORACLE_HOME target: 4
2013. CRS clusters with nodes not discovered: 8
3002. DB Systems linked to multiple databases: 2
6004. Targets not uploading: 33
7008. Orphan source target associations: 57
7010. Systems without members: 12
7013. Composite targets without metric dependency details: 4
8002. Broken targets: 3
verifyTEMPLATES
verifyUPGRADE
verifyUSERS
1003. Custom super user admins.: 11

But some of these we know can be skipped, the BLACKOUTS test 7001 alert can be ignored since all issues were fixed, etc.  So reviewing output like this involves some repetitive work.

Here's where the "make your life a bit easier" part comes in.  We can wrap a bit of code around "repvfy" to avoid some of the repetitive work.  This starts by parsing output from "repvfy".

By default, "repvfy" creates a logfile of each execution under $EMDIAG_HOME/log.  If you run a "verify" then the logfile will have a format similar to "verify_<yyyy>_<mm>_<dd>_<hh24miss>.log".  To obtain the most recent logfile, code like the following can be used:

% ls -lt $EMDIAG_HOME/log/verify_* | head -1 | awk '{print $NF}'

Parsing the logfile is relatively straight-forward as each module begins with "verify" followed by 0 to many tests, each of which are prefixed by a test 4-digit number.  An example of a loop for parsing out the MODULE sections is:

for REPVFY_LINE in `grep -E '(^verify|^[0-9]*\. |^ *Fix)' $LOGFILE | awk '{if ($1 ~ /verify/)  PREFIX=$1; else print PREFIX" "$0}' | grep -vE '(^verifyEMDIAG 6001\.|^verifyEXADATA 6006\.|^verifyREPOSITORY 6039\.|^verifyREPOSITORY 7001\.|verifyJOBS 707\.|verifyTARGETS 605\.)' | sed "s/ /~/g"`

The above line does the following:
1.       Start a loop containing all lines that begin with "verify" (the module header), a set of digits following by a period then a space (test output lines) or any line whose first characters are "Fix" -> grep -E '(^verify|^[0-9]*\. |^ *Fix)
2.       Use "awk" to prefix each module's test or "Fix" lines with the module header -> awk '{if ($1 ~ /verify/)  PREFIX=$1; else print PREFIX" "$0}'
3.       Skip several test results that we aren't concerned with -> grep -vE '(^verifyEMDIAG 6001 ... verifyTARGETS 605\.)'.  We've chosen to skip these alerts because either we know of the situation and don't see it as an issue.  For example, test 707 in module JOBS concerns active executions reaching job purge time.  This is typical archive activity and a non-issue.
4.       Last, each line's spaces are changed to "~" for easier work with delimiting fields.

The previously listed "repvfy" output now looks like:

verifyAGENTS~1008.~NMO~not~setuid-root~(Unix-only):~2
verifyAGENTS~6002.~Blocked~Agents:~2
verifyAGENTS~6006.~Deployed~Agent~plugins~lower~than~OMS~plugin:~8
verifyAGENTS~8005.~Broken~Agents:~1
verifyAVAILABILITY~8001.~Composite~availability~errors:~2
verifyBLACKOUTS~7001.~Expired~blackout~windows:~9
verifyBLACKOUTS~~~~~~~Fix:~9
verifyTARGETS~1017.~Platform~ID~mismatch~between~host~and~ORACLE_HOME:~25
verifyTARGETS~2006.~Targets~with~missing~ORACLE_HOME~target:~4
verifyTARGETS~2013.~CRS~clusters~with~nodes~not~discovered:~8
verifyTARGETS~3002.~DB~Systems~linked~to~multiple~databases:~2
verifyTARGETS~6004.~Targets~not~uploading:~33
verifyTARGETS~7008.~Orphan~source~target~associations:~57
verifyTARGETS~7010.~Systems~without~members:~12
verifyTARGETS~7013.~Composite~targets~without~metric~dependency~details:~4
verifyTARGETS~8002.~Broken~targets:~3
verifyUSERS~1003.~Custom~super~user~admins.:~11

At this point further logic is applied based on the module and test to try and fix some issues.  One example is test 6004 in the TARGETS module.  This test concerns targets that aren't uploading data.  What we found, though, is that clusters/scans only upload data every 24hrs so we rerun this test *manually* by querying the metadata directly, filtering out object types of 'cluster'.

How did we determine what the best cursor to use for this test?  Another great feature of "repvfy" is that the cursors on which tests are based are saved in a .sql file when the "-details" argument is passed.  In the example of test 6004 in the TARGETS module we ran:

% $EMDIAG_HOME/bin/repvfy verify -module TARGETS -test 6004 -details

… and then reviewed the generated .sql file under $EMDIAG_HOME/log to see the cursor that the tool used.

Before emailing the finalized output we still need to deal with test alerts that are automatically fixed.  Those are the lines that had begun with "Fix" as its first characters.  For these we compare the total alerts reported with the total reported as fixed.  If the totals match, we skip this alert, otherwise we report it.  Here's a code snippet showing this logic:

if [ -n "$REPVFY_PREVLINE" ]; then
   if [ `echo $REPVFY_LINE | grep -c "Fix:"` -eq 1 -a $ERROR_TOTAL -eq $ERROR_PREVTOTAL ]; then
      ERROR_PREVTOTAL=
      ERROR_TOTAL=
      REPVFY_PREVLINE=
      REPVFY_LINE=
   elif [ `echo $REPVFY_LINE | grep -c "Fix:"` -eq 0 -o $ERROR_TOTAL -ne $ERROR_PREVTOTAL ]; then
      echo "$REPVFY_PREVLINE" >>$PARSE_LOGFILE
      ERROR_PREVTOTAL=$ERROR_TOTAL
      REPVFY_PREVLINE=$REPVFY_LINE
   fi
else
   ERROR_PREVTOTAL=$ERROR_TOTAL
   REPVFY_PREVLINE=$REPVFY_LINE
fi


The last step is to email lines of output that, through the logic in this wrapper, only pertain to what our team should be concerned with.  In the example above the final list is:

verifyAGENTS 1008. NMO not setuid-root (Unix-only): 2
verifyAGENTS 6002. Blocked Agents: 2
verifyAGENTS 6006. Deployed Agent plugins lower than OMS plugin: 8
verifyAGENTS 8005. Broken Agents: 1
verifyTARGETS 1017. Platform ID mismatch between host and ORACLE_HOME: 25
verifyTARGETS 2006. Targets with missing ORACLE_HOME target: 4
verifyTARGETS 2013. CRS clusters with nodes not discovered: 8
verifyTARGETS 3002. DB Systems linked to multiple databases: 2
verifyTARGETS 6004. Targets not uploading: 32
verifyTARGETS 7008. Orphan source target associations: 57
verifyTARGETS 7010. Systems without members: 12
verifyTARGETS 7013. Composite targets without metric dependency details: 4
verifyTARGETS 8002. Broken targets: 3
verifyUSERS 1003. Custom super user admins.: 11


With this output we know that any issues that could have been automatically fixed have been addressed and that ones we don't need to be concerned with aren't listed, making the alert more meaningful for us.

Hopefully this showed how you can easily wrap a bit of additional functionality around the "repvfy" utility to remove alerts you don't care about and add a bit of code to try to automatically address other issues.

Additional Notes

Although a full explanation of how to use the "repvfy" utility is beyond this post, a good article in MOS for further research is " EMDIAG Repvfy 12c /13c Kit - How to Use the Repvfy 12c/13c kit (Doc ID 1427365.1".

It can be helpful to know what all the possible tests are for a given module.  This can be achieved using the "list" parameter.  For example, to see all tests for the AGENTS module use:

% $EMDIAG_HOME/bin/repvfy -module AGENTS list



Monday, January 29, 2018

Physical Standby Wait Event Alerting

I recently had a request to send alerts to our DBA team whenever a particular read-only standby database has high waits on "library cache*" events.  I could see the need as we just had an issue where a number of different "library cache*" events were causing a performance problem on the standby database.  I figured this would be pretty easy to do in Enterprise Manager 12c - set thresholds on metric(s) for this event, update any relevant Incident Rules and we're all set.

Unfortunately there are no metrics available for specific "library cache*" wait events in EM12c.  There is a metric for a higher level, on the wait class, so in this case that'd be "Concurrency".  But thinking this through I realized alerting on "Concurrency" wouldn't work for this case.  The idea is to perform regular checks for "library cache*" waits and if repeated checks (at least 2) find x number of sessions waiting on this event then an alert should be raised.  If we check for "Concurrency", we could run into a situation where at timestamp#1 there were 10 sessions waiting on "library cache lock" and at timestamp#2 there were 10 sessions waiting on "enq: HV - contention".  2 consecutive checks found 10 or more sessions waiting on the metric so an alert would be raised, even though in my contrived example it's not what we want.

Another option would be either to use a temporary table to store snapshots of current session waits or to query AWR data.  A temporary table wouldn't work because this is a read-only standby.  AWR data wouldn't work because 1, monitored would be in the past, the earliest being 1 snapshot ago.  Plus, again this is a standby database so any AWR present would be a copy of the primary.

Our solution was to go with a Metric Extension.  The ME was based on the cursor:




SELECT COUNT(*) total_sessions
  FROM v$session
 WHERE state = 'WAITING'
   AND event LIKE 'library cache%';


 ... set at the instance-level.  Thresholds were set for 2 (or more) consecutive checks at 10 (warning) or 20 (critical).  After updating the appropriate Incident Rule everything worked as expected.