Saturday, May 28, 2011

WAIT activity performs differently in a SOA Suite 11g cluster

Let's say you have a 2-node cluster of Oracle SOA Suite 11g and are running an asynchronous BPEL process. If the first node crashes, typically the instance will resume on the other node. This is normal cluster behavior.

However, this did not happen in one of the tests that I performed, and this was due to a WAIT activity in the BPEL process.


Test Performed

1. A BPEL process was instantiated and confirmed that it was running on "host2".

This was confirmed by running the following SQL statement while the instance was running:
SELECT create_cluster_node_id, cikey, creation_date, modify_date, state, composite_name
FROM cube_instance
WHERE TO_CHAR(creation_date, 'YYYY-MM-DD HH24:MI') >= '2011-04-25 15:40'

CREATE_CLUSTER_NODE_ID CIKEY CREATION_DATE MODIFY_DATE STATE COMPOSITE_NAME
host2-soa_domain_soacluster 410003 25-APR-11 03.43.55.685000000 PM 25-APR-11 03.43.55.702000000 PM 1 TestLongRunningProcess
2. While this long running BPEL process was being executed, we killed the Unix process for "soa_server2" on "host2":
kill -9 21409
4. The "soa_server2" managed server on "host2" began to restarted automatically.

5. While "soa_server2" was starting up, the BPEL process was still in the "Running" state.

6. After 16 minutes, the process basically completes successfully after "soa_server2" is brought back online.


Behavior of a SOA Suite 11g Cluster
  • If an asynchronous transaction is running, and the SOA server crashes, the instance resumes execution on the active node. This behavior is consistent for both BPEL and Mediator.
  • If the BPEL process is using a WAIT or ONALARM BRANCH OF PICK activities, the instance will not failover to the active node, and will only resume when the server on which it crashed on is restarted.

References:

  • http://download.oracle.com/docs/cd/E17904_01/relnotes.1111/e10133.pdf#page=243

Applicable Versions:
  • Oracle SOA Suite 11g (11.1.1.4)


Ahmed Aboulnaga

No comments: