Dec-2005

Abstract

A C++ program has caused a severe performance impact. The core subject of this paper centered around the usage of C-style array versus a dynamic container. This paper describes the process and technical details of solving an automatic memory allocation performance issue. It outlines the challenges to implement a totally non-intrusive and garbage collected dynamic container solution.

1. Introduction

It is a common practice to define local variables to store information needed during execution of a program function. Local variables are automatic variables because they are created on the stack and are destroyed automatically when the function they reside goes out of scope. Stack pointer is wind back and memory is available again for next execution. While automatics are good, it is not always a good choice to use local variables when the size of the data structure is unknown or the data structure is too large to fit into the stack.

In many operating system, stack size are usually fixed. Operating systems like UNIX uses global setting to dictate a process stack size, while on Windows NT, the stack size can be specified during program compilation.

Large automatic object or deep recursive function calls could potentially cause a process to run out of stack space, causing a stack overflow exception. This paper discusses specifically problems caused by large array of local objects and a solution to it.

2. Problem Analysis

The first approach used in tracking down the problem was to perform a code review on the program. During the review process, we noticed an unusual practice of using a C++ data structure. The structure contained multiple members of different types, both primitive and structure. One of the members was array of structures, with the array size fixed at 25 elements. The structure in turn contains multiple members and within it another array of another structures. The nesting was up to three levels. The following pseudo structure represents the data structure.

#define ARR_SIZE 25

typedef struct
{
   CString      aString1;
   CString      aString2;
   COrdered     anOrdered;
   CDictionary  aDict;
   CString      aString3;
} STRUCT0;

typedef struct
{
   CString      aString1;
   int          aString2;
   CString      aString3;
   STRUCT0      aStruct0Array[ARR_SIZE];
} STRUCT1;

typedef struct
{
   CString      aString1;
   int          anInt;
   CString      aString2;
   STRUCT1      aStruct1Array[ARR_SIZE];
} STRUCT2;

typedef struct
{
  CString      aString1;
  CString      aString2;
  int          anInt;
  CString      aString3;
  CBoolean     aBool1;
  STRUCT2      aStruct2Array[ARR_SIZE];
  CString      aString4;
  CString      aString5;
  CString      aString6;
  CString      aString7;
  CString      aString8;
  CString      aString9;
  CBoolean     aBool2;
  CBoolean     aBool3;
} STRUCT3;

The structure above was instantiated as a local variable in the program being examined. This took up considerably large amount of stack space of the running process. A test showed that the size of (sizeof(STRUCT3)) the structure was 2148560 bytes (approximately 2MB). Instantiation of STRUCT3 type caused the Operating System to allocate a big chunk of continuous block of memory. An instance of STRUCT3 on the heap (by the new operator) caused the same amount of contiguous memory to be allocated. It was observed that the system performance degraded even more when there was not enough physical memory to serve the required continuous memory. A page out is required to ensure that the current running process have enough contiguous memory to for the STRUCT3 object.

The server that hosted the program in production mode was running on UNIX Tru64 version 5.0, which has a default thread stack size of 5MB. We were lucky the program did not cause a stack overflow due to sizeof(STRUCT3) was about 2MB. However even a slightest increase of ARR_SIZE will increase the sizeof(STRUCT3) exponentially, and stack overflow is guaranteed. When tested on a Windows NT platform, the program caused a stack overflow somewhere within the entry point of main(). A dumpbin on the test program shows that the default stack size reserved for this program was 1MB. Using editbin.exe to increase the stack size to 3MB for a ARR_SIZE of 25 did prevent the program from running into stack overflow.

int main()
{
STRUCT3 onTheStack; // take a long time to complete this line of code, caused stack overflow on Windows platform.
STRUCT3* onTheHeap = new STRUCT3; // took equally long to complete this statement.
return 0;
}

Let’s look at what happen under the hood, why the statement took such a long time to complete.
A C-style array implementation requires an allocation of contiguous memory block^[1]. This is by design. C-style arrays are accessed via its subscript operator []. The location of the next element in the array is obtained by adding the address of the current element with the size of the element itself. The design also allows the C-style arrays to be manipulated by functions such as memcpy(). When there is insufficient contiguous memory, the OS suspends the current thread and perform a page-out for the least used applications to free up more physical memories.
Each time an instance of STRUCT3 is created, all elements, including the elements in the arrays are default-initialized^[2]. This consumed a great deal of CPU cycles before the next statement could be executed.

3. Implementing the Solution

There were two ways of solving the problem. The first way was to convert the array into an ordered collection. The C++ framework being used did have an implementation of ordered collection called COrdered. Taking this approach means rewriting a big chuck of the existing code that sets/gets the values of the STRUCT3 structure.

The major gap between the existing implementation and using COrdered as the solution is described below.
The program under investigation contained codes that read and write to the data structure through array subscript notation, []. This is the traditional way of manipulating a C-style array. The following code snippet illustrates both reads and writes to the structure.

STRUCT3 aStruct3;
aStruct3.aStruct2Vec[1].anInt = 1; //sets value to an element
CString aString = aStruct3.aStruct2Vec[1].aString2; //gets value of an element

However, if COrdered was to replace array, the above codes will no longer compile/function due to the following limitations.
COrdered cannot contain a structure like STRUCT2. The insert method of COrdered only takes arguments in the forms of CObject. To make this work, the STRUCT2 structure needs to be converted to CObject as well. This was way too much work and potentially could cause the entire code based to be altered.

COrdered does not support a simple subscript operator. There are 3 overloads of [] operators defined in the header file
AutoPtr(CObject)& operator[]( int32 n ) { return refAt(n); }
The object returned by this version of [] operator returns a reference to an CObject instance wrapped in a smart pointer envelop, AutoPtr. The returned type cannot be dereferenced using “.” (dot) notation as follow

aStruct3.aStruct2Vec[1].anInt

due to the fact that AutoPtr(CObject) does not have a member called ‘anInt’. Access to anInt in AutoPtr(CObject) requires resolutions to a few levels of indirection. AutoPtr supports the -> notation to assess the CObject contained within it, but the CObject has to be casted into STRUCT2 object before its member ‘anInt’ could be reached.

const CObject* operator[]( int32 n ) const { return at(n); }

While this overload returns a pointer to CObject, it is a constant. As a result, assigning value to it or its members is not permitted.

const CObject*  operator()( int32 n ) const { return at(n); }

Using () operator in place of [] would not help either due to the const-ness of the returned value.
When a COrdered is created, it does not contain any element by default. Unlike array, all elements are pre-constructed (default-initialized) and are ready to be accessed via a [] operator. Reading/Setting (if at all possible) the value at index 1 when there are no entries in COrdered causes an index out-of-bound error to be thrown. Operator [] does not perform bounds-checking on the array.
For COrdered to work, the above snippet will look like this (pseudo).

STRUCT3 aStruct3;
//1) create an instance of STRUCT2 (assuming STRUCT2 inherits CObject)
//2) assign appropriate value to its members
//3) insert that element into COrdered

//To obtain the value from a member of one of the elements
//4) Use [], () or at() to access the element by position
//5) Cast (dereference if necessary) the return object appropriately into STRUCT2 object
//6) Access the desired member element

Having done step 1 to 6 above virtually mean rewriting the entire program, which consisted of about 5000 lines of codes.
The solution of using COrdered didn’t seem to work quite well.
Due the fact that the program under investigation was already running in production environment and the required response time to the issue was short, we looked into another alternative, which we considered as ‘daring’ and unconventional.

3.1 A Non-Intrusive Approach

We needed a data structure implementation that behaves like a C-style array, with dynamic memory allocation, and that does not require in any way, changes to be made to the existing codes that deal with it. This data structure will also help to amortize the cost of objects creation, destruction as well as to avoid unnecessary object creation to fill the entire c-style array. Most importantly, it was a non-intrusive solution.
In summary, we needed a new data structure that meets the following requirements:

Mutable [] operator
Dynamic memory allocation for each array elements
Automatic memory management (as a result of preceding requirement)
Does not require changes to existing program codes
Framework compatible

3.1.1 Defining a container of structures

There are three structures defined above, where all of them needs some non-intrusive wrapping. Of STRUCT0, STRUCT1 and STRUCT2 structures, we will use STRUCT1 structure in subsequent discussions.

There is a class in the framework called CPtrVector. CPtrVector is a Brooks implementation of a pointer vector. It is similar to STL’s vector of pointer type. CPtrVector is used extensively by all CCollection classes, where it serves as internal^[3] data structures. The flexibility and risk of using CPtrVector directly are avoided in CCollection implementation.

We chose CPtrVector because it has all characteristics that we needed. It is a macro^[4] that generates a pointer vector class for a specific object type. we need not worry about what object it points to, thus keeping the STRUCT1 and other form of structure as originally defined.

3.1.2 First attempt: A pointer vector of POD-structures

As mentioned earlier, the framework provides some handy macros that implement an CPtrVector. They are CDECLARE_PTR_VECTOR(type) and CDEFINE_PTR_VECTOR(type).

Without making any changes to the existing structures, we created a CPtrVector(T) for each of the structures.

//Declare a pointer vector class that holds pointers to structure, that is STRUCT0
CDECLARE_PTR_VECTOR(STRUCT0);
//Define the class
CDEFINE_PTR_VECTOR(STRUCT0);
typedef CPtrVector(STRUCT0) STRUCT0Vector;
//the same goes here for other structs
CDECLARE_PTR_VECTOR(STRUCT1);
CDEFINE_PTR_VECTOR(STRUCT1);
typedef CPtrVector(STRUCT1) STRUCT1Vector;
CDECLARE_PTR_VECTOR(STRUCT2);
CDEFINE_PTR_VECTOR(STRUCT2);
typedef CPtrVector(STRUCT2) STRUCT2Vector;

Now we have three new pointer vectors that will replace the c-style arrays in STRUCT1 STRUCT2 and STRUCT3 structures.

typedef struct
{
   CString      aString1;
   int           aString2;
   CString      aString3;
   STRUCT0Vector aStruct0Vec;
} STRUCT1;
typedef struct
{
   CString      aString1;
   int           anInt;
   CString      aString2;
   STRUCT1Vector    aStruct1Vec;
} STRUCT2;
typedef struct
{
   CString      aString1;
   CString      aString2;
   int           anInt;
   CString      aString3;
   CBoolean     aBool1;
   STRUCT2Vector aStruct2Vec;
   CString      aString4;
   CString      aString5;
   CString      aString6;
   CString      aString7;
   CString      aString8;
   CString      aString9;
   CBoolean     aBool2;
   CBoolean     aBool3;
} STRUCT3;

It is time to test the new data structure that got rid of the arrays of POD-sturcts. However, compiling the following program caused the compiler to generate a series of errors.

int main()
{
STRUCT3 aStruct3;
aStruct3.aStruct2Vec[1].aString1 << "aString1";
return 0;
}

The compiler generated this error

: error C2228: left of '.aString1' must have class/struct/union type

It was obvious that aStruct3.aStruct2Vec[1] does not resolved to a valid l-value object. A quick look up to the vector.h file revealed that CPtrVector does not support return-by-reference [] operator. Existing implementation returned a reference to a pointer of type instead.

T*& operator[]( int32 ndx ) { return at(ndx); }

Our main goal is to preserve the existing program codes. Modifying the code to allow access to the location pointed to by T* was not an option. Furthermore, element 1 of CPtrVector is not guaranteed to exist due to the nature of dynamic allocation, given the fact that current program code already assumed that all 25 elements in the array are pre-constructed. We have to address the problem at the root. We need to enhance the CPtrVector class. But how are we going to do this? The answer is ‘inheritance’. But before we jump into coding the enhance CPtrVector class, let’s consider also the other two more pointer vectors that will bump into the same problem; STRUCT1Vector and STRUCT0Vector. The best solution to this is to create a template class that can be reused.

In the template class, we will solve 2 problems; Make the [] operator to return a non-const reference to the element at any given position, and insert a new element whenever the element at certain position has yet to exist.
The resultant template looked like following.

template //T is the vector, U is the element contained in T
class WrapVec: public T
{
public:
    WrapVec() {}
    virtual ~WrapVec() {}
    U& operator[] (int32 ndx)
    {
        if (!(ndx >= 0 && ndx < T::entries()))
        {
            T::insertAt(ndx, (new U));
        }
        return *(T::at(ndx));
    }
};

This template fulfilled our requirements. The operator [] returns a reference to U. Before it does that, it checks for existence of element specified at ndx. The algorithm used here was crude as it assumed nth are always assigned before (n+1)th. The program will crash the rule is not observed. This solution was good enough to address current problem, though opportunity for improvement was there. Now the test program compiles without any problem. We will see the next problem regarding to dynamic memory allocation.
It is obvious that the following statement

T::insertAt(ndx, (new U));

will cause a memory leak. A new instance of object type U is created on the heap every time an element is first accessed. The following test demonstrates the severity of the leakage when each vector is populated with 25 elements.

Figure 1: Memory Leak Detected Under Windows Environment

3.1.3 Second Attempt: Addressing the memory leak problem

The vector severely caused a memory leak. Fortunately, CPtrVector comes with a method called cleanUp. This function basically iterates through each pointer of the element and performs a delete on it. However, this function was not called automatically when CPtrVector goes out of scope.

void CPtrVector(T)::cleanUp()          \
{ if (_entries > 0)                             \
  { for (--_entries; _entries >= 0; _entries--) \
    { T* p = _array[_entries]; _array[_entries] = nil; DESTROY(p); } \
    _entries = 0;                               \
  }                                             \
}

Noticed also this function calls another function named DESTROY(p) to delete the pointer p, where it is defined as

inline void DESTROY( CRefCounter* p ) { if (p) p->destroy(); }

It seemed that the framework will destroy an object of type CRefCounter.
CRefCounter is a framework implementation of a reference counter, used mainly in AutoPtr class (macro). It was ‘cannibalized’ here just to provide the basic need of reference counting. A full blown GC was not necessary.

template //T is the structure
class WrappStruct : public T, public CRefCounter
{
};

It is important to note that multiple inheritances was used to preserve the behavior of the structure, while introducing reference counting capability to the structure in a non-intrusive manner.
The following snippet shows modification to the vector wrapper template that incorporates a call to cleanUp function in the destructor.

template //T is the vector, U is the element contained in T
class WrapVec: public T
{
public:
    WrapVec() {}
    virtual ~WrapVec() { T::cleanUp(); }
    U& operator[] (int32 ndx)
    {
        if (!(ndx >= 0 && ndx < T::entries()))
        {
            T::insertAt(ndx, (new U));
        }
        return *(T::at(ndx));
    }
};

Putting it together, the entire modification to the structure is ready for a stress test.

//Declare a pointer vector class that holds pointers to structure, that is WrappedStruct0
typedef WrappStruct WrappedStruct0;
CDECLARE_PTR_VECTOR(WrappedStruct0);
//Define the class
CDEFINE_PTR_VECTOR(WrappedStruct0);
typedef CPtrVector(WrappedStruct0) STRUCT0Vector;
//the same goes here for other structs
typedef WrappStruct WrappedStruct1;
CDECLARE_PTR_VECTOR(WrappedStruct1);
CDEFINE_PTR_VECTOR(WrappedStruct1);
typedef CPtrVector(WrappedStruct1) STRUCT1Vector;
typedef WrappStruct WrappedStruct2;
CDECLARE_PTR_VECTOR(WrappedStruct2);
CDEFINE_PTR_VECTOR(WrappedStruct2);
typedef CPtrVector(WrappedStruct2) STRUCT2Vector;

3.2 Testing it

Testing the solution was another challenge. Since the approach was not conventional, we had to put it under a rigorous test. It was hard to predict what kind of data is sent by the caller. The test was designed to generate randomly huge combinations of possible values for each level of arrays in the structure. After the data structure was populated with certain data, its contents were dumped to the console for sanity check. The complete source code of the test program is available in the Appendix.

Full stress test was carried out under Tru64 UNIX platform with memory leak monitoring script running.

4. A Final Look at the Memory

The following diagram depicts the simplified view of the program stack before and after the modification. Note the stack size.

Figure 2: Before

Figure 3: After

5. Conclusions

The modification of existing data structure from using C-style array to a dynamic data structure was successful and had met the requirements defined earlier in this paper. To recap, the requirements were:

Has a mutable [] operator
Dynamic memory allocation for each array elements
Automatic memory management (as a result of preceding requirement)
Does not require changes to existing program codes, i.e. non-intrusive
Framework compatible

What we saw from C++ data structure literature^[5] has enabled us to derive the solution this specific problem discussed in this paper. It is necessary for developers to understand the design of different data structure, such as array, stack, linked list, que, deque, sorted, ordered containers, their strength and weakness, and to apply them appropriately. The data structure (CPtrVector) we introduced in the solution did improve the program performance by amortizing the cost of object creations, while maintaining the same efficiency O(1) for inserting a single new element. Looping of the container remained at O(n) efficiency.

We also noticed the design of CPtrVector uses a default value of 8 as its initial capacity and contained a simple algorithm to calculate the next reallocation size to grow the capacity of the container. Like the vector implementation in Silicon Graphic STL (and perhaps many other implementations), capacity of CPtrVector is always increased at a factor of 2.

Our knowledge in Memory Layout implementation of an Operating System has enabled us to understand the difference between stack and heap memory. Understand why and how a stack memory became corrupted and overflowed had direct impact on decision to employ a heap based solution to the problem.
Our understanding in the reference counting algorithms used by the framework to perform garbage collection had enable us to strip the GC subsystem into bare minimum, where only CReference component was used in conjunction with cleanUp.

Appendix D

Unit Test Program Listing

#define IMAX_BASE 25
#define JMAX_BASE 25
#define KMAX_BASE 25
void testVector()
{
  CDateTime dtNow;
  srand(dtNow.time().seconds());
  for (int32 x=0; x < 1000; x++)
  {
  STRUCT3 aStruct3;
  aStruct3.aString1 = "TID";
  aStruct3.aString2 = "USERID";
  int iseed = rand();
  srand((uint) iseed);
  int jseed = rand();
  srand((uint) jseed);
  int kseed = rand();
  srand((uint) kseed);
  int32 imax = (iseed%IMAX_BASE < 0 ? (int32) (IMAX_BASE + iseed%IMAX_BASE) : (int32) (iseed%IMAX_BASE));
  int32 jmax = (jseed%JMAX_BASE < 0 ? (int32) (JMAX_BASE + jseed%JMAX_BASE) : (int32) (jseed%JMAX_BASE));
  int32 kmax = (kseed%KMAX_BASE < 0 ? (int32) (KMAX_BASE + kseed%KMAX_BASE) : (int32) (kseed%KMAX_BASE));
  for (int32 i=0; i < imax; i++)
  {
    aStruct3.aStruct2Vec[i].aString1 << "aString1" << i;
    for (int32 j=0; j < jmax; j++)
    {
      aStruct3.aStruct2Vec[i].aStruct1Vec[j].aString1 << "aString1" << j;
      for (int32 k=0; k < kmax; k++)
      {
        CString tmpStr;
        tmpStr << "testxxx" << k;
        aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[k].aDict.insert(tmpStr, aStruct3.aStruct2Vec[i].aStruct1Vec[j].aString1 + tmpStr);
aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[k].aString1 << "aString2" << k;
       aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[0].aDict.insert(aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[k].aString1, tmpStr);
        aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[k].aString1 << "aString2" << k;
        cout << "aStruct3.aStruct2Vec[" << i << "].aStruct1Vec[" << j << "].aStruct0Vec[" << k << "].aString1=" << aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[k].aString1 << endl;
      }
      CSerializableOrdered colTemp(aStruct3.aStruct2Vec[i].aStruct1Vec[j].aStruct0Vec[0].aDict);
      CString msgStr;
      msgStr = colTemp.asString(msgStr);
      cout << msgStr << endl;
    }
}
  cout << "=========================================================================================" << endl;
  }
}

Complete Listing of modified data structure

template
class WrappStruct : public T , public CRefCounter
{
public:
    WrappStruct() {}
    virtual ~WrappStruct() {}
};
template
class WrapVec: public T
{
public:
    WrapVec() {}
    virtual ~WrapVec() { T::cleanUp(); }
    U& operator[] (int32 ndx)
    {
        if (!(ndx >= 0 && ndx < T::entries()))
        {
            T::insertAt(ndx, (new U));
        }
        return *(T::at(ndx));
    }
};

typedef WrappStruct WrappedStruct0;
CDECLARE_PTR_VECTOR(WrappedStruct0);
CDEFINE_PTR_VECTOR(WrappedStruct0);
typedef CPtrVector(WrappedStruct0) STRUCT0Vector;
CDECLARE_DESTRUCTOR(WrappedStruct0)
typedef WrapVec STRUCT0_;
typedef struct
{
   CString    aString1;
   int        aString2;
   CString    aString3;
   STRUCT0_   aStruct0Vec;
}STRUCT1;

typedef WrappStruct WrappedStruct1;
CDECLARE_PTR_VECTOR(WrappedStruct1);
CDEFINE_PTR_VECTOR(WrappedStruct1);
typedef CPtrVector(WrappedStruct1) STRUCT1Vector;
CDECLARE_DESTRUCTOR(WrappedStruct1)
typedef WrapVec STRUCT1_;

typedef struct
{
   CString      aString1;
   int          anInt;
   CString      aString2;
   STRUCT1_     aStruct1Vec;
} STRUCT2;

typedef WrappStruct WrappedStruct2;
CDECLARE_PTR_VECTOR(WrappedStruct2);
CDEFINE_PTR_VECTOR(WrappedStruct2);
typedef CPtrVector(WrappedStruct2) STRUCT2Vector;
typedef WrapVec STRUCT2_;
typedef struct
{
   CString      aString1;
   CString      aString2;
   int          anInt;
   CString      aString3;
   CBoolean     aBool1;
   STRUCT2_     aStruct2Vec;
   CString      aString4;
   CString      aString5;
   CString      aString6;
   CString      aString7;
   CString      aString8;
   CString      aString9;
   CBoolean     aBool2;
   CBoolean     aBool3;
} STRUCT3;

[1] ISO/IEC 14882:1998 C++ Standard (8.3.4/1)
[2] ISO/IEC 14882:1998 C++ Standard (8.5)
[3] It was noticed that CCollection uses composition instead of inheritance from CPtrVector object. Doing so hides the danger of letting user access members of CPtrVector directly.
[4] CPtrVector was created in the early 90’s, in the form of macro. At that time, C++ template was not a widely adopted approach and not many compilers support this feature, even until it was standardized in ISO/IEC 14882:1998. C++ Template was first presented by Bjarne Stroustrup at the 1988 USENIX C++ conference in Denver.
[5] The C++ Standard Library: A Tutorial And Reference, Nicolai Josuttis, Addison-Wesley 1999 Fundamentals of Data Structures, Ellis Horowitz and Sartaj Sahni, Chapter 2: Arrays

The Laboratory

Beware of large struct(ure) on the stack