Tuesday, November 26, 2013

Troubleshooting 12c Cloud Control Performance Issues

Oracle 12c Cloud Control certainly has some great features that we are now taking advantage of. Specifically, the Oracle Fusion Middleware Plugins have been of great service to our project. Like many others, however, we experienced a huge drop in performance between 11g Grid Control and 12c Cloud Control after our upgrade. The issues we had were specific to navigating the Database Instance and Performance pages. It was not uncommon to see below database and performance pages hang for 20 minutes. To make matters worse, the entire application would hang if a couple of the DBA's were on at the same time.

What's the problem? I logged into the WebLogic Server Administration Console where our 12c Cloud Control is deployed and looked at the threads (Servers > EMGC_OMS1 > Monitoring > Threads). I noticed there were many stuck threads when we were accessing the DB Instance and Performance pages. I did a thread dump and saw a lot of the following:
"[STUCK] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock oracle.jdbc.driver.T4CConnection@6e72e5f2 BLOCKED
If the other threads weren't stuck, they would be soon. The only other threads running were executing SQL queries on the database. The next obvious step was to check the database. I ran an AWR report for the time the application was hanging and found this:

Physical ReadsExecutionsReads per Exec %TotalElapsed Time (s)%CPU%IOSQL IdSQL ModuleSQL Text






























4,217,94414,217,944.004.79453.9973.9325.85f58fb5n0yvr7c EM Realtime Connection select 'uptime' stat_type, rou...

1 execution running for over 7 minutes from EM? That is not good!

Here is the full SQL (formatted)

SELECT 'uptime' stat_type,
       ROUND ( (SYSDATE - startup_time) * 24) v1,
       NULL v2,
       version v3,
       NULL v4
  FROM v$instance
UNION ALL
SELECT 'total_sga',
       SUM (VALUE) / 1024 / 1024 v1,
       NULL v2,
       NULL v3,
       NULL v4
  FROM v$sga
UNION ALL
SELECT 'storage',
       SUM (NVL (f.total_gb, 0) - NVL (s.used_gb, 0)) v1,
       NULL v2,
       NULL v3,
       NULL v4
  FROM dba_tablespaces t,
       (  SELECT tablespace_name,
                 SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) total_gb
            FROM dba_data_files
        GROUP BY tablespace_name) f,
       (  SELECT tablespace_name,
                 SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) used_gb
            FROM dba_segments
        GROUP BY tablespace_name) s
 WHERE     t.tablespace_name = f.tablespace_name(+)
       AND t.tablespace_name = s.tablespace_name(+)
       AND t.contents != 'UNDO'
       AND NOT (t.extent_management = 'LOCAL' AND t.contents = 'TEMPORARY')
UNION ALL
SELECT 'sysmetric',
       SUM (
          CASE
             WHEN metric_name = 'Average Active Sessions' THEN VALUE
             ELSE 0
          END)
          v1,
       SUM (CASE WHEN metric_name = 'Session Count' THEN VALUE ELSE 0 END) v2,
       NULL v3,
       NULL v4
  FROM v$sysmetric
 WHERE     GROUP_ID = 2
       AND metric_name IN ('Average Active Sessions', 'Session Count')
UNION ALL
  SELECT 'addm_findings',
         COUNT (*) v1,
         f.task_id v2,
         NULL v3,
         NULL v4
    FROM dba_advisor_findings f
   WHERE     f.task_id =
                (WITH snaps
                      AS (SELECT /*+ NO_MERGE */
                                MAX (s.snap_id) AS end_snap,
                                 MAX (v.dbid) AS dbid
                            FROM DBA_HIST_SNAPSHOT s, V$DATABASE v
                           WHERE s.dbid = v.dbid)
                 SELECT MAX (t.task_id) AS task_id
                   FROM dba_addm_tasks t, snaps s, dba_addm_instances i
                  WHERE     t.dbid = s.dbid
                        AND t.begin_snap_id = s.end_snap - 1
                        AND t.end_snap_id = s.end_snap
                        AND t.how_created = 'AUTO'
                        AND t.requested_analysis = 'INSTANCE'
                        AND t.task_id = i.task_id
                        AND i.instance_number =
                               SYS_CONTEXT ('USERENV', 'INSTANCE'))
         AND f.TYPE NOT IN ('INFORMATION', 'WARNING')
         AND f.parent = 0
         AND (f.filtered IS NULL OR f.filtered <> 'Y')
GROUP BY f.task_id


I ran the query using TOAD and found that it took between 9 and 18 minutes to complete depending on the database instance. After looking at the execution plan, I saw why. There was a NESTED LOOP over a 136,000 row result-set. The nested loop was okay, processing each row at about 8ms...but 136,000 times? Something wasn't right. After some searching, I stumbled on the following note from Oracle Support (Doc ID 1528334.1). It seemed to describe our scenario perfectly, however, it did not resolve our issue and the query performance did not improve. 

Interestingly, the problem was not happening in our OMS repository database. When I took a look at the same query's execution plan, I noticed that instead of the nested loop there was a hash join of two full table scans. This seemed more efficient. Why aren't they using the same plan? As a test, I decided to copy the existing plan from our repository and propagate the changes to some of our other database instances using the SQLT Diagnostic Tool. You can download this from Oracle Support (SQLT Diagnostic Tool (Doc ID 215187.1)). 

To create the sql profile from the repository instance, use the following script after downloading SQLT: coe_xfr_sql_profile.sql 

START coe_xfr_sql_profile.sql [SQL_ID] [PLAN_HASH_VALUE];

This will create a script similar to this one (coe_xfr_sql_profile_&&sql_id._&&plan_hash_value..sql) which you can then run on your database instances. In our case, the following SQL_ID and PLAN_HASH_VALUE were used (f58fb5n0yvr7c 2336039161). Once generated, I ran the sql script on some of the other instances and tested the results. 

old plan: 13 minutes
new plan: 9 seconds

HUGE difference. Now, to test out the application...



Wow, the DB Instance and Performance pages actually came up (about 10 seconds or so). The concurrency issues were gone as well! All is well again! I logged in to our Admin Console and noticed there were no more stuck threads or warnings. Everything was working great... but I could do better. I like things to go FAST and the few seconds of waiting just bother me. I'm not going to recommend the next steps. If this solved your issue, wonderful! I'm glad it helped.

From the explain plan, I could see that the bulk of the work was parsing the query that populates the dba_segments view. The offending SQL?

SELECT tablespace_name, SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) used_gb
FROM dba_segments
GROUP BY tablespace_name


What if we create a local version of dba_segments as a Materialized View? Why not? The DBSNMP schema is the one running the query so lets give it a try. As the SYS user, run the following:

GRANT SELECT ON sys_dba_segs TO DBSNMP;

GRANT EXECUTE ON DBMS_SPACE_ADMIN TO DBSNMP;

CREATE MATERIALIZED VIEW DBSNMP.DBA_SEGMENTS
(
   OWNER,
   SEGMENT_NAME,
   PARTITION_NAME,
   SEGMENT_TYPE,
   SEGMENT_SUBTYPE,
   TABLESPACE_NAME,
   HEADER_FILE,
   HEADER_BLOCK,
   BYTES,
   BLOCKS,
   EXTENTS,
   INITIAL_EXTENT,
   NEXT_EXTENT,
   MIN_EXTENTS,
   MAX_EXTENTS,
   MAX_SIZE,
   RETENTION,
   MINRETENTION,
   PCT_INCREASE,
   FREELISTS,
   FREELIST_GROUPS,
   RELATIVE_FNO,
   BUFFER_POOL,
   FLASH_CACHE,
   CELL_FLASH_CACHE
)
AS
   SELECT owner,
          segment_name,
          partition_name,
          segment_type,
          segment_subtype,
          tablespace_name,
          header_file,
          header_block,
            DECODE (BITAND (segment_flags, 131072),
                    131072, blocks,
                    (DECODE (BITAND (segment_flags, 1),
                             1, DBMS_SPACE_ADMIN.segment_number_blocks (
                                   tablespace_id,
                                   relative_fno,
                                   header_block,
                                   segment_type_id,
                                   buffer_pool_id,
                                   segment_flags,
                                   segment_objd,
                                   blocks),
                             blocks)))
          * blocksize,
          DECODE (BITAND (segment_flags, 131072),
                  131072, blocks,
                  (DECODE (BITAND (segment_flags, 1),
                           1, DBMS_SPACE_ADMIN.segment_number_blocks (
                                 tablespace_id,
                                 relative_fno,
                                 header_block,
                                 segment_type_id,
                                 buffer_pool_id,
                                 segment_flags,
                                 segment_objd,
                                 blocks),
                           blocks))),
          DECODE (BITAND (segment_flags, 131072),
                  131072, extents,
                  (DECODE (BITAND (segment_flags, 1),
                           1, DBMS_SPACE_ADMIN.segment_number_extents (
                                 tablespace_id,
                                 relative_fno,
                                 header_block,
                                 segment_type_id,
                                 buffer_pool_id,
                                 segment_flags,
                                 segment_objd,
                                 extents),
                           extents))),
          initial_extent,
          next_extent,
          min_extents,
          max_extents,
          max_size,
          retention,
          minretention,
          pct_increase,
          freelists,
          freelist_groups,
          relative_fno,
          DECODE (buffer_pool_id,  1, 'KEEP',  2, 'RECYCLE',  'DEFAULT'),
          DECODE (flash_cache,  1, 'KEEP',  2, 'NONE',  'DEFAULT'),
          DECODE (cell_flash_cache,  1, 'KEEP',  2, 'NONE',  'DEFAULT')
     FROM sys.sys_dba_segs;


Want to know how fast it runs now?

Old plan: 9 seconds
New plan: 284msecs

And the application? Faster than ever! Don't let your MV get stale. Make sure you create a job to refresh the MV at the frequency you see fit. I hope this helps. 






No comments: