Database Deadlock in a Distributed Application Architecture

Abstract

Deadlock conditions can be caused by poor programming practices, design deficiencies or database session collisions. It is a difficult-to-trace problem especially in a distributed environment, due to lack of tools and its sporadic characteristics. This paper presents an analysis on such problem specific to the kind of application servers that we were developing. The issue with respect to transaction atomicity and isolation level is also discussed. A synopsis of synchronous and asynchronous calls is presented. A detailed discussion of how a transaction deadlock happens, along with solutions are included.

1. Introduction

In the C++ framework, one way for a server to communicate with another is through messaging. When two instances of servers are running in separate process space, the servers communicate through a proprietary messaging system, called MBX. MBX has similar functions to IBM’s MQ, Microsoft’s MSMQ or Tibco’s Rendezvous. In a typical case, a server can execute methods in another server by sending specific API message to the mailbox (or 'subject' in TIBCO’s context) listened upon by the second server. This kind of call can be categorized into two; Synchronous and Asynchronous. As to when to use these methods, it depends on the nature of the business requirement.

2. Synchronous Call

When server A wants to execute a function in server B (for a service that is not available in A itself), and wants to wait for B to perform its task before continuing, a synchronous call is used. Server A can perform its own database transactions, and then calls a method in server B via synchronous messaging, and server B in turn will perform additional database transactions (while server A is blocking, or waiting) before returning to server A. However, if server B fails to execute the desired transaction, it will return an error (via messaging) and server A is required to take necessary clean-up actions. The transaction class is designed in such a way that an error upon exit of a function will cause the framework to roll back all transactions that took place under the same context. So in this case, server B rolls back its transaction, return an error to A, and A is responsible to check the return code and rollback accordingly. This kind of inter-server interaction is useful and safe only when server B is performing not critical database transactions such as returning a result of a database query. This kind of server interaction also is only good if and only if both servers successfully committed its transactions in its own context. The atomicity of the transactions in both servers cannot be guaranteed, as described in the following scenario.

2.1 Scenario

Server A performed a database transaction, and calls server B for a service that will perform another database transaction. Upon return from the call to server B, server A continues to perform another database transaction to complete the operation. Assuming the sequence of transaction is important to the business process. If server B failed to commit the transaction due to any form of error (database error, business logic validation error, etc), it will return a failure code to server A and rolls back its own transactions. Server A upon detecting the returned error can also opt to abort transactions that it has performed before it made the call to server B. In this case, data consistency is still up hold. However, if server B returns with success, its transaction is committed upon function exit (because server B is running in a different context), and a problem occurred in server A while it carries on with its subsequent transactions, server A will rollback its own transaction, but will not be able to rollback the transaction that has been done by server B, because the database transactions performed by server A was executed in a totally separate context. The framework does not support two-phase commit in this scenario, hence a data inconsistency.

The following UML sequence diagram describes scenario that could cause data inconsistency.


Figure 1: Fail Scenario

3. Asynchronous Call

In this form of call, server A can call a function in server B without having to wait for server B to return. The example is best described by a notification use case. Say server A has done some work and wanted to inform a person about the outcome, and the notification service is provided by server B. Server A can send a API message to server B and return immediately to perform its remaining tasks. In this case, server B will take it’s time to notify the person according to the information specified in the API send from server A. If server B fails to perform its job, it will not affect the operation of server A. The application administrator can still find out what happen from server B’s execution log or through some kind of server monitoring. Asynchronous calls are useful in many circumstances where the service requested by a server cannot be immediately fulfilled or the time it takes to fulfill the request is long enough to warrant an asynchronous call. One example in a Wafer Fabrication environment is when a server finished performing some tasks on a wafer lot and wanted that wafer lot to be transported to another location. Server A can request the material handling server to send the material, but for the physical transportation to take place, it will require a fix amount of time, say 20 minutes. In this case, server A can send the request to server B and execute its subsequent commands without waiting for the material to arrive at the destination.
The following UML sequence diagram shows an asynchronous call server A made to server B


Figure 2: Asynchronous Call

4. The Deadlock

Both illustrations above pertaining synchronous and asynchronous calls are not without its problem.
An anti-pattern was observed in many of the source codes written using the C++ framework. This particular way of coding has consistently causing transaction deadlock issues, and some of them are in fact hard to track due to conditional branches during execution.
The following sequence diagram illustrates how a deadlock takes place.


Figure 3: Database deadlock

Before server A calls a service in server B, it performs an update to a row in Table1. By default, Oracle server will issue an exclusive row lock to server A. The lock is not release until server A issues a commit or rollback command. Then server A calls server B. For business reasons, server B is required to update the same row in Table1, and the framework issues an update SQL statement against that row. However, at this time, there is already an exclusive lock issue to server A, so the database server will wait until server A issues a commit or rollback. But this is not possible because server A is still waiting for server B to complete its operation before being able to issue a commit or rollback. This situation is called ‘database deadlock’. The framework does not have a timeout mechanism in this kind of synchronous call, because it is not desired.

Besides a deadlock issue, this kind of inter-server (inter-process) communication could also result in data visibility problem. Due to both server are running in separate context, the database isolation level (default is read committed) actually prevents other instances from ‘seeing’ what was updated by the previous server until the transaction is committed. In the event of updates performed by server A are required for server B to function correctly, server B would not have been able to work correctly because it is unaware of uncommitted changes performed by server A.

5. Detecting the Deadlock

In a complex program, deadlock does not always happen. In a simple program, without any conditional branch, a deadlock situation can be detected immediately during unit test. For a complex program that contains multiple possible execution branches, a failure to perform complete test coverage will cause the ‘bug’ to slip into the production environment. When deadlock occurs, it is hard to track. The symptom of a database deadlock is when the server locks and does not return from its execution. Other calls to that server are blocked, and usually timed-out by the transport infrastructure. However, during this time, the server is still ‘alive’. Multiple tools are required to track down the problems. Among them are, reviewing the execution log file with time stamp, and some queries to database system level tables/view are necessary. The latter is not straight forward, but is essential. How to determine a deadlock by querying the Oracle system level tables/views is not covered in this writing.

6. Solving the Deadlock problem

There are two ways to solve the database deadlock problem. One is to issue a ‘commit’ statement before requesting a service from another server. This patch however fails the first item of the ACID test; i.e. Atomicity. Anyway, let’s look at this solution before we delve into the ultimate solution.

6.1 The Quick-Fix

The following sequence diagram illustrate the first solution, that is to issue a ‘commit’ command again the database before making a call to server B.


Figure 4: A temporary solution

The above solution introduces another problem. Since server A issued a ‘commit’ command before making a call to server B, it does not allow the transaction to be rolled back should server B failed during furtherUpdatetable1Row1 call. This leads to a serious data inconsistency problem and further complication could arise if all transactions were successful except updateTable_N in server A.

6.2 The Proper Solution

Based on the analysis, and the fact that the persistent framework does not support two-phase commit in a distributed environment, a different approach is warranted.
We will first explain the server architecture created using the framework.


Figure 5: Server Architecture

Each server is allowed to load different modules grouped under same family of functionality. Modules here mean UNIX loadable shared library or shared object. A server is usually started with certain modules loaded as specified in the server configuration file. Between modules, there exist dependencies among them. The linkages between these modules are dynamic. Inter module calls are in-process invocation.

In a distributed environment, when Module 1 needs a service from Module 5, it sends an API message via MBX messaging system. To solve potential deadlock issues, Module 5 will need to be dynamically linked into Module 1. This means the following:

  • Linking Module 5 shared library into Module 1 (UNIX make file modification is required)
  • Include header files of required services in Module 1 caller program
  • Ensure Module 5 is in the library load path of Module 1 server
  • Replacing all messaging calls to Module 5 with in-process function call declared in related header files


After the modification, the architecture will look like following.


Figure 6: Server Architecture (Modified)

Assumptions:

  • Binaries of Module 1 and Module 5 are platform compatible.
  • Header files of Module 5 are available.


It is not necessary to include Module 5 into server A configuration file. The loader is smart enough to detect the dependencies and load the link library into the process space.

7. Conclusions

It is important to understand the underlying database behavior when programming a database application. When solving this specific problem, the concept of database transaction context, inter-process communication, database isolation level and sequence of program execution are well understood.