you are a new oracle dba and you already know that sql queries can be parallelized. your production application has some important sql queries that are running in parallel but still the performance is not as expected. your manager asked you to investigate why the sql query is slow even though it is running in parallel.
to perform your investigation you should first understand the different terminologies used in the explain plan for a sql query that is running in parallel. The different terminologies could be producer slaves, consumer slaves , query coordinator, type of distribution like broadcast , random , parallel to serial operations, serial to parallel operations, degree of parallelism of the tables and indexes, different parameters that can influence the parallel behaviour of a query, how the load on a system could impact the parallel behaviour of a query etc.
Below article will explain you in detail about all the above terminologies and guides you with a proper approach if you want to tune the parallel query.
Checklist for Performance Problems with Parallel Execution [ID 280939.1] | |||
|
|||
Modified 29-APR-2010 Type BULLETIN Status PUBLISHED | |||
***Checked for relevance on 29-April-2010***
PURPOSE
-------
This note gives an overview of tuning parallel operations at
session and database level.
SCOPE & APPLICATION
-------------------
DBA's and Support analysts.
Checklist for parallel execution
---------------------------------
Tune a parallel operation at session level.
--------------------------------------------
To tune a statements that run in parallel, you have to understand
the basics of parallel execution.
We split SQL statement in smaller operations. Each of this operations
can done in parallel.
As example
select c.unit_price
from costs c, products p
where c.prod_id = p.prod_id;
We can scan the tables (costs,products) in parallel and we can
do the join in parallel. This means we have 3 smaller operations
that run in parallel.
We have slaves that are grouped in slave sets. 1 slave set ( called
Producer) only works at a certain time and the other slave set
( called Consumer ) waits to get data from the first slave set.
After it got all data it starts to work, it is now the Producer
slave set.
The slaves are controlled by the user background process (called
query coordinator QC ). In RAC environment the slaves for one
query maybe be are spawned on different Nodes.
We can use different methods to send data from one slave set
to another. we can send all rows from one slave to all other slaves
in the other slave set. This we call a 'BROADCAST' distribution.
Another example is a join on a partitioned table. Here every slave
in the slave set that does the join works only on range of partitions.
Then you see a 'PART' distribution method.
A slave set can do 1 or more operation in parallel. As example
a slave set that scans the table COSTS can also do the join.
Here is the execution plan of the example.
-----------------------------------------------------------------
|Id| Operation |Name | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------
| 0| SELECT STATEMENT | | | | |
| 1| PX COORDINATOR | | | | |
| 2| PX SEND QC (RANDOM) |:TQ10001|Q1,01| P->S | QC (RAND) |
| 3| HASH JOIN | |Q1,01| PCWP | |
| 4| PX RECEIVE | |Q1,01| PCWP | |
| 5| PX SEND BROADCAST |:TQ10000|Q1,00| P->P | BROADCAST |
| 6| PX BLOCK ITERATOR | |Q1,00| PCWC | |
| 7| TABLE ACCESS FULL|PRODUCTS|Q1,00| PCWP | |
| 8| PX BLOCK ITERATOR | |Q1,01| PCWC | |
| 9| TABLE ACCESS FULL |COSTS |Q1,01| PCWP | |
-----------------------------------------------------------------
We scan both tables COSTS,PRODUCTS in parallel. The slave set Q1,01
that scans table COSTS does also the hash join. The PRODUCTS table
is scaned by a other slave set Q1,00 and send via a BROADCAST distribution
method to the other slave set.
Some operations are running in serial faster than in parallel. Please verify
if you query is faster with parallel execution.
It can happen that the degree of parallelism was changed on tables or indexes
implicit by some operations. As example a import of data or a
"ALTER INDEX REBUILD PARALLEL" can change the degree of parallelism.
The basic information about parallel execution you can find
in the Data Warehousing Guide.
See also
Note.184417.1 Where you find Information about Parallel Execution
in the Oracle Documentation
In 9.x you see the distribution method in the "PX distributed" column
when you run the utlxplp script to format the explain plan output.
From 10g onwards you will see this in the "Operation" column, so it will
also appear in the output of the utlxpls.sql script.
See Note
Note.235530.1 Recommended Method for Obtaining a Formatted Explain Plan
Note.236026.1 Display Execution Plans with dbms_xplan.display
1.) Check the execution plan.
a.) Check if the query is well tuned.
b.) Are steps ( e.q join, read the tables ,..) running on the same slave set?
(Hint: You can see this in the IN-OUT column. If you have often a
PARALLEL_COMBINED_WITH_PARENT PCWP you are on the same slave set.)
This is good for the performance because we do not have to sent data from
one slave set to the other slave set.
E.g We want to join a 2 tables. For the scan of the 2 table we need 2
slave sets. It can happened that the slave set that scanned one table
can also do the join operation. We do not have to send the data of the
2 tables to a new slave set. You will see this in the IN-OUT column of
formatted execution plan output. See also the example above.
Features like full partition wise joins and partial partitions wise joins
can help to achieve that we have PCWP.
c.) Check ff some parts of the operation are running in serial
(e.g full table scan,index fast full scan )
(Hint: You can see this in the IN-OUT column. If you see a P->S or a
S->P you do some operations in serial.)
To scan small tables in serial is maybe faster than to do it in parallel.
Please check if the objects have a suitable degree setup.
2.) Check if any data are skewed when we send them between slave sets.
Run the following query after the parallel SQL statement is finished.
select DFO_NUMBER, tq_id, SERVER_TYPE, NUM_ROWS, PROCESS
from V$PQ_TQSTAT
order by 1,2, 3;
Please have in mind that you can query V$PQ_TQSTAT after the query
is finished not during the run of the query.
An example.
The same SQL statement as before. We force only a HASH distribution.
That all three steps ( scan both tables and the join) done on
different slave sets.
The prod_id columns has only 2 values in the costs tables.
PROD_ID COUNT(*)
---------- ----------
148 720
147 81392
select /*+ pq_distribute(c hash,hash) parallel(p,4) parallel(c,4) */
c.unit_price
from costs c, products p
where c.prod_id = p.prod_id;
------------------------------------------------------------------
|Id| Operation | Name | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------
| 0| SELECT STATEMENT | | | | |
| 1| PX COORDINATOR | | | | |
| 2| PX SEND QC (RANDOM) |:TQ10002 |Q1,02| P->S | QC (RAND) |
| 3| HASH JOIN | |Q1,02| PCWP | |
| 4| PX RECEIVE | |Q1,02| PCWP | |
| 5| PX SEND HASH |:TQ10000 |Q1,00| P->P | HASH |
| 6| PX BLOCK ITERATOR | |Q1,00| PCWC | |
| 7| TABLE ACCESS FULL|PRODUCTS |Q1,00| PCWP | |
| 8| PX RECEIVE | |Q1,02| PCWP | |
| 9| PX SEND HASH |:TQ10001 |Q1,01| P->P | HASH |
|10| PX BLOCK ITERATOR | |Q1,01| PCWC | |
|11| TABLE ACCESS FULL|COSTS |Q1,01| PCWP | |
------------------------------------------------------------------
select DFO_NUMBER, TQ_ID, SERVER_TYPE, NUM_ROWS ,BYTES
from v$pq_tqstat
order by dfo_number , tq_id , server_type;
DFO_NUMBER TQ_ID SERVER_TYP NUM_ROWS BYTES
---------- ---------- ---------- ---------- ----------
............
1 2 Producer 81392 P002
1 2 Producer 0 P003
1 2 Producer 720 P001
1 2 Producer 0 P000
You see only 2 slaves have work to do.
Tune a parallel operation at database level.
---------------------------------------------
It is not so easy to tune parallel execution on database level. You can
tune some parameters like
- tune PARALLEL_MAX_SERVERS
- tune PARALLEL_EXECUTION_MESSAGE_SIZE
- set PARALLEL_ADAPTIVE_MULTI_USER
on a RAC System you can limit the number of instances with
- PARALLEL_INSTANCE_GROUP/INSTANCE_GROUPS
It is important that the CPU utilization is monitored. If the CPU utilization
is always high then you can try reducing the DOP or PARALLEL_MAX_SERVERS.
If your CPU utilization is high you get no benefit from parallel execution.
In same cases the usage of parallel execution under a high CPU utilization
can hurt the performance.
On Platform where OSWatcher is available it should be install and started
to monitor the system. For more information please have a look into
Note 301137.1 OS Watcher User Guide
This can help to check if there was a problem with high CPU utilization.
A good start value for parallel_max_servers is 10*(#CPU's in the system).
1.) Degree of parallelism (DOP)
Tuning the DOP is a balance between query performance and overall
database performance. If the query runs fine standalone it maybe
has a problem during the day when a lot of queries running in parallel
and we have not enough slaves that the query get the request degree.
You can be check via v$px_session for a single query or via
v$sysstat you can monitor how many parallel operations are downgraded.
The conclusion is that with a high DOP on a table/index only a few
queries get the requested DOP, this depends also of the setting of
PARALLEL_MAX_SERVERS. You can use Resource manager to limit
the degree for a group of users. As example your OTLP users get a low
DOP and your power users ( batch jobs ,..) get higher degrees.
The parameter PARALLEL_ADAPTIVE_MULTI_USER can also help.
A high DOP and a high value for PARALLEL_MAX_SERVERS can hurt your system
and cause a 100% CPU utilization.
Consider the use of different number for the DOP on your tables.
On large tables and their indexes use high degree like #CPU.
For smaller tables use DOP (#CPU)/2 as start value.
a.) Check if you get the request degree for your query in v$px_session
select distinct qcsid, degree , req_degree from v$px_session;
QCSID DEGREE REQ_DEGREE
---------- ---------- ----------
141 3 4
As you can see, we wanted a degree of 4 and we got a degree of 3.
b.) You can check in statspack if some of your queries have a
reduced the DOP
See columns
DFO trees parallelized
...........
Parallel operations downgraded 1
Parallel operations downgraded 25
Parallel operations downgraded 50
Parallel operations downgraded 75
Parallel operations downgraded to
Parallel operations not downgrade
or use
select name , value
from v$sysstat
where name like '%arallel%';
If a lot Parallel operations downgraded, please consider to
reduce the DOP.
2.) Settings of parallel execution parameters
We already discussed PARALLEL_MAX_SERVERS and PARALLEL_ADAPTIVE_MULTI_USER.
PARALLEL_EXECUTION_MESSAGE_SIZE is the size of the buffer that we need
to exchange data/messages between the slaves or between the slaves
and the QC. High values can improve the overall performance, but we maybe
need much more memory.
See Note:238680.1 Parallel Execution the Large/Shared Pool and ORA-4031
You can check the setting with the commands below.
a.) Use the following statements to check the setting of the important init.ora parameters.
select substr(name,1,30) name , substr(value,1,15) value
from v$parameter
where name like '%arallel%';
b.) With the command
select * from V$PX_PROCESS_SYSSTAT;
we could see the high water mark of parallel execution server.
3.) Checkpoint tuning
Parallel queries read large objects with a direct path read. Before we can do
the read we have to do a tablespace checkpoint to write the dirty buffers
back to the datafiles. So a checkpoint tuning and help to improve the
performance of parallel operations.
See also
Note.237328.1 Parallel Execution and Direct Reads
Parallel Execution with RAC systems
------------------------------------
PX is also automatically controlled on RAC systems.
Parallel execution does not allocate slaves randomly across
the available instances, but rather will start by allocating on the
least loaded instance. The goal is to both minimize inter-node traffic
and at the same time try to minimize any imbalance of work across
the instances.
The INSTANCE_GROUPS/PARALLEL_INSTANCE_GROUP parameter can be used to
restrict allocation of query slaves to specific instances in a RAC
configuration and over-ride the automatic allocation across instances.
This can improve the performance when there are problem with the
inter-connect.
Note: 181489.1 Tuning Inter-Instance Performance in RAC and OPS
Related
Products
Errors
|