What's the problem? I logged into the WebLogic Server Administration Console where our 12c Cloud Control is deployed and looked at the threads (Servers > EMGC_OMS1 > Monitoring > Threads). I noticed there were many stuck threads when we were accessing the DB Instance and Performance pages. I did a thread dump and saw a lot of the following:
"[STUCK] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock oracle.jdbc.driver.T4CConnection@6e72e5f2 BLOCKEDIf the other threads weren't stuck, they would be soon. The only other threads running were executing SQL queries on the database. The next obvious step was to check the database. I ran an AWR report for the time the application was hanging and found this:
Physical Reads | Executions | Reads per Exec | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|---|
4,217,944 | 1 | 4,217,944.00 | 4.79 | 453.99 | 73.93 | 25.85 | f58fb5n0yvr7c | EM Realtime Connection | select 'uptime' stat_type, rou... |
1 execution running for over 7 minutes from EM? That is not good!
Here is the full SQL (formatted)
SELECT 'uptime' stat_type,
ROUND ( (SYSDATE - startup_time) * 24) v1,
NULL v2,
version v3,
NULL v4
FROM v$instance
UNION ALL
SELECT 'total_sga',
SUM (VALUE) / 1024 / 1024 v1,
NULL v2,
NULL v3,
NULL v4
FROM v$sga
UNION ALL
SELECT 'storage',
SUM (NVL (f.total_gb, 0) - NVL (s.used_gb, 0)) v1,
NULL v2,
NULL v3,
NULL v4
FROM dba_tablespaces t,
( SELECT tablespace_name,
SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) total_gb
FROM dba_data_files
GROUP BY tablespace_name) f,
( SELECT tablespace_name,
SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) used_gb
FROM dba_segments
GROUP BY tablespace_name) s
WHERE t.tablespace_name = f.tablespace_name(+)
AND t.tablespace_name = s.tablespace_name(+)
AND t.contents != 'UNDO'
AND NOT (t.extent_management = 'LOCAL' AND t.contents = 'TEMPORARY')
UNION ALL
SELECT 'sysmetric',
SUM (
CASE
WHEN metric_name = 'Average Active Sessions' THEN VALUE
ELSE 0
END)
v1,
SUM (CASE WHEN metric_name = 'Session Count' THEN VALUE ELSE 0 END) v2,
NULL v3,
NULL v4
FROM v$sysmetric
WHERE GROUP_ID = 2
AND metric_name IN ('Average Active Sessions', 'Session Count')
UNION ALL
SELECT 'addm_findings',
COUNT (*) v1,
f.task_id v2,
NULL v3,
NULL v4
FROM dba_advisor_findings f
WHERE f.task_id =
(WITH snaps
AS (SELECT /*+ NO_MERGE */
MAX (s.snap_id) AS end_snap,
MAX (v.dbid) AS dbid
FROM DBA_HIST_SNAPSHOT s, V$DATABASE v
WHERE s.dbid = v.dbid)
SELECT MAX (t.task_id) AS task_id
FROM dba_addm_tasks t, snaps s, dba_addm_instances i
WHERE t.dbid = s.dbid
AND t.begin_snap_id = s.end_snap - 1
AND t.end_snap_id = s.end_snap
AND t.how_created = 'AUTO'
AND t.requested_analysis = 'INSTANCE'
AND t.task_id = i.task_id
AND i.instance_number =
SYS_CONTEXT ('USERENV', 'INSTANCE'))
AND f.TYPE NOT IN ('INFORMATION', 'WARNING')
AND f.parent = 0
AND (f.filtered IS NULL OR f.filtered <> 'Y')
GROUP BY f.task_id
I ran the query using TOAD and found that it took between 9 and 18 minutes to complete depending on the database instance. After looking at the execution plan, I saw why. There was a NESTED LOOP over a 136,000 row result-set. The nested loop was okay, processing each row at about 8ms...but 136,000 times? Something wasn't right. After some searching, I stumbled on the following note from Oracle Support (Doc ID 1528334.1). It seemed to describe our scenario perfectly, however, it did not resolve our issue and the query performance did not improve.
Interestingly, the problem was not happening in our OMS repository database. When I took a look at the same query's execution plan, I noticed that instead of the nested loop there was a hash join of two full table scans. This seemed more efficient. Why aren't they using the same plan? As a test, I decided to copy the existing plan from our repository and propagate the changes to some of our other database instances using the SQLT Diagnostic Tool. You can download this from Oracle Support (SQLT Diagnostic Tool (Doc ID 215187.1)).
To create the sql profile from the repository instance, use the following script after downloading SQLT: coe_xfr_sql_profile.sql
START coe_xfr_sql_profile.sql [SQL_ID] [PLAN_HASH_VALUE];
This will create a script similar to this one (coe_xfr_sql_profile_&&sql_id._&&plan_hash_value..sql) which you can then run on your database instances. In our case, the following SQL_ID and PLAN_HASH_VALUE were used (f58fb5n0yvr7c 2336039161). Once generated, I ran the sql script on some of the other instances and tested the results.
old plan: 13 minutes
new plan: 9 seconds
HUGE difference. Now, to test out the application...
Wow, the DB Instance and Performance pages actually came up (about 10 seconds or so). The concurrency issues were gone as well! All is well again! I logged in to our Admin Console and noticed there were no more stuck threads or warnings. Everything was working great... but I could do better. I like things to go FAST and the few seconds of waiting just bother me. I'm not going to recommend the next steps. If this solved your issue, wonderful! I'm glad it helped.
From the explain plan, I could see that the bulk of the work was parsing the query that populates the dba_segments view. The offending SQL?
SELECT tablespace_name, SUM (NVL (bytes, 0)) / (1024 * 1024 * 1024) used_gb
FROM dba_segments
GROUP BY tablespace_name
What if we create a local version of dba_segments as a Materialized View? Why not? The DBSNMP schema is the one running the query so lets give it a try. As the SYS user, run the following:
GRANT SELECT ON sys_dba_segs TO DBSNMP;
GRANT EXECUTE ON DBMS_SPACE_ADMIN TO DBSNMP;
CREATE MATERIALIZED VIEW DBSNMP.DBA_SEGMENTS
(
OWNER,
SEGMENT_NAME,
PARTITION_NAME,
SEGMENT_TYPE,
SEGMENT_SUBTYPE,
TABLESPACE_NAME,
HEADER_FILE,
HEADER_BLOCK,
BYTES,
BLOCKS,
EXTENTS,
INITIAL_EXTENT,
NEXT_EXTENT,
MIN_EXTENTS,
MAX_EXTENTS,
MAX_SIZE,
RETENTION,
MINRETENTION,
PCT_INCREASE,
FREELISTS,
FREELIST_GROUPS,
RELATIVE_FNO,
BUFFER_POOL,
FLASH_CACHE,
CELL_FLASH_CACHE
)
AS
SELECT owner,
segment_name,
partition_name,
segment_type,
segment_subtype,
tablespace_name,
header_file,
header_block,
DECODE (BITAND (segment_flags, 131072),
131072, blocks,
(DECODE (BITAND (segment_flags, 1),
1, DBMS_SPACE_ADMIN.segment_number_blocks (
tablespace_id,
relative_fno,
header_block,
segment_type_id,
buffer_pool_id,
segment_flags,
segment_objd,
blocks),
blocks)))
* blocksize,
DECODE (BITAND (segment_flags, 131072),
131072, blocks,
(DECODE (BITAND (segment_flags, 1),
1, DBMS_SPACE_ADMIN.segment_number_blocks (
tablespace_id,
relative_fno,
header_block,
segment_type_id,
buffer_pool_id,
segment_flags,
segment_objd,
blocks),
blocks))),
DECODE (BITAND (segment_flags, 131072),
131072, extents,
(DECODE (BITAND (segment_flags, 1),
1, DBMS_SPACE_ADMIN.segment_number_extents (
tablespace_id,
relative_fno,
header_block,
segment_type_id,
buffer_pool_id,
segment_flags,
segment_objd,
extents),
extents))),
initial_extent,
next_extent,
min_extents,
max_extents,
max_size,
retention,
minretention,
pct_increase,
freelists,
freelist_groups,
relative_fno,
DECODE (buffer_pool_id, 1, 'KEEP', 2, 'RECYCLE', 'DEFAULT'),
DECODE (flash_cache, 1, 'KEEP', 2, 'NONE', 'DEFAULT'),
DECODE (cell_flash_cache, 1, 'KEEP', 2, 'NONE', 'DEFAULT')
FROM sys.sys_dba_segs;
Want to know how fast it runs now?
Old plan: 9 seconds
New plan: 284msecs
And the application? Faster than ever! Don't let your MV get stale. Make sure you create a job to refresh the MV at the frequency you see fit. I hope this helps.
No comments:
Post a Comment