Monday, April 18, 2011

Connection Total Mismatch between Listener and Database


Normally I don't spend a lot of time reviewing connection details in my client's listener logs, but I came across a situation recently where failover and connection distribution were being questioned, so I figured it'd be best to review listener log data just to be safe.  What I ended up finding was that connection counts as recorded in the listener logs was nearly twice that of what "logons cumulative" statistic was showing, which in my mind wasn't right.
To perform any type of listener log analysis, I first needed some sort of tool, as the system involved is a 4-node OLTP RAC with 100,000's of connections happening per day.  Fortunately Arup Nanda did most of this work for me, based on his DBAzine article on 9/16/08 entitled "Mining Information from the Listener Log".  I used his notes to create an External Table for pointing to a given listener log and an Oracle function for properly parsing out details within lines of the listener log.  (See the end of this article for SQL queries involved to help analyze the data.)
As for the hardware and database involved, the following uses simplified names of targets to help illustrate the problem:

  • 4-node RAC running RHEL 4.8 and Oracle 10.2.0.2.
  • Nodes are named RAC1, RAC2, RAC3, and RAC4 with one 1 Instance of the database on each, named inst1, inst2, inst3, and inst4
  • 2 Oracle Services are defined to use Inst1 and Inst2 as a strictly OLTP set of balanced Instances (service "inst1and2") and use Inst3 and Inst4 as a strictly Batch set of balanced Instances (service "batch1and2").
Back to the problem, I found that Service "inst1and2" was getting 1,000's of connections per hour against Instances Inst3 and Inst4, which from my understanding couldn't happen.  I used "srvctl" as one way to validate this Service's definition, just to be sure:
% srvctl config service -d $db_unique_name -a

inst1and2 PREF: inst1 inst2 AVAIL:  TAF: NONE
batch1and2 PREF: inst3 inst4 AVAIL:  TAF: NONE

As a last check, I asked the Application team to send me the exact connection string they were using.  Sure enough, this pointed out the problem:
(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = RACVIP3)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = RACVIP4)(PORT = 1521)) (LOAD_BALANCE = yes)(FAILOVER = ON) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = inst1and2)(FAILOVER_MODE = (TYPE = SELECT) (METHOD = BASIC))))

Testing this from my PC using "sqlplus user/pass@'' ", I reproduced the problem.  I found that RAC3 / 4's listener log recorded a connection:
18-APR-2011 09:52:03 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=inst1and2)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC))(CID=(PROGRAM=C:\oracle\product\10.2.0\db_1\bin\sqlplus.exe)(HOST=myserver123)(USER=ME))) * (ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=1234)) * establish * inst1and2 * 0

… and also RAC1 / 2's listener log recorded a connection:
18-APR-2011 09:52:03 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=inst1and2)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC))(CID=(PROGRAM=C:\oracle\product\10.2.0\db_1\bin\sqlplus.exe)(HOST=myserver123)(USER=ME))(INSTANCE_NAME=inst1)) * (ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=2345)) * establish * inst1and2 * 0

Notice the string "…INSTANCE_NAME…" in the listener log entry for RAC1 / 2.  According to MOS document 263599.1, when this is present it represents a redirected connection, which is true in the test case above.  Simple enough, except that this string DOESN'T occur in any of listener log entries in question.  Further down in the same MOS document, there's a short blurb about thin jdbc connections and that if these connections are redirected, there won't be any mention of this in the listener log, as in the string INSTANCE_NAME won't be present.
This matches my case, as all application servers involved are connecting via thin jdbc.  MOS document 263599.1 also points out that "lsnrctl services" can also point out connect redirects.  Checking RAC4 I see the following:


Service "inst1and2" has 2 instance(s).
  Instance "inst1", status READY, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:658909 refused:0 state:ready
         REMOTE SERVER
         (ADDRESS=(PROTOCOL=TCP)(HOST=RAC1)(PORT=1521))
  Instance "inst2", status READY, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:737610 refused:0 state:ready
         REMOTE SERVER
         (ADDRESS=(PROTOCOL=TCP)(HOST=RAC2)(PORT=1521))

For Service "inst1and2" on RAC4, 659,909 connections have been redirected to "inst1" and 737,610 connections have been redirected to "inst2", all since the listener was last restarted.
Although all of this helps account for any connection count discrepancies between the listener log and database statistic "logons cumulative", I'm left with having to check "lsnrctl services" and connect strings from application servers to help tie it all together.  It'd sure be a lot easier, at least to me, for Oracle to have included a message like "redirect" within the listener log entry to point out what happened.

SQL Code

The following is an example of a query I used to analyze listener log data via an external table.  I'm including this as I found a few cases where the data didn't line up as expected and I had to use a bit that knowledge to handle most of those cases.
SELECT min_log_date, max_log_date,
       CASE WHEN parse_listener_log_line_fnc(connect_string, 'HOST') IS NOT NULL AND
                 parse_listener_log_line_fnc(connect_string, 'HOST') NOT IN ('__jdbc__')
               THEN parse_listener_log_line_fnc(connect_string, 'HOST')
            ELSE REGEXP_REPLACE(REGEXP_REPLACE(protocol_info,
                                '(^.*HOST=)([0-9]+.[0-9]+.[0-9]+.[0-9]+).*\$', '\2'),
                                '(^.*HOST=)(lnx|racvip){1}([0-9]+.us.dnb.com).*\$', '\2\3')
       END source_machine
     , NVL(parse_listener_log_line_fnc(connect_string, 'PROGRAM'),
           parse_listener_log_line_fnc(connect_string, 'HOST'))
     , CASE WHEN connect_string IN ('ping', 'service_update', 'status') THEN connect_string
            WHEN protocol_info IN ('services', 'status') THEN protocol_info
            ELSE action
       END operation
     , NVL(service_name, protocol_info) service, return_code, COUNT(*)
  FROM listener_log_tb
     , (SELECT MIN(log_date) min_log_date, MAX(log_date) max_log_date
          FROM listener_log_tb)
 GROUP BY min_log_date, max_log_date
        , CASE WHEN parse_listener_log_line_fnc(connect_string, 'HOST') IS NOT NULL AND
                    parse_listener_log_line_fnc(connect_string, 'HOST') NOT IN ('__jdbc__')
                  THEN parse_listener_log_line_fnc(connect_string, 'HOST')
               ELSE REGEXP_REPLACE(REGEXP_REPLACE(protocol_info,
                                   '(^.*HOST=)([0-9]+.[0-9]+.[0-9]+.[0-9]+).*\$', '\2'),
                                   '(^.*HOST=)(lnx|racvip){1}([0-9]+.us.dnb.com).*\$', '\2\3')
          END
        , NVL(parse_listener_log_line_fnc(connect_string, 'PROGRAM'),
              parse_listener_log_line_fnc(connect_string, 'HOST'))
        , CASE WHEN connect_string IN ('ping', 'service_update', 'status') THEN connect_string
               WHEN protocol_info IN ('services', 'status') THEN protocol_info
               ELSE action
          END
        , NVL(service_name, protocol_info), return_code;