Data-Distribution Cache

Enterprise Edition Home | Express Edition Home | Previous Page | Next Page Effect of Configuration on Memory Utilization > Parameters That Affect Memory Caches >

Data-Distribution Cache

The optimizer uses distribution statistics generated by the UPDATE STATISTICS statement in the MEDIUM or HIGH mode to determine the query plan with the lowest cost. The first time that the optimizer accesses the distribution statistics for a column, the database server retrieves the statistics from the sysdistrib system catalog table on disk. Once the database server has accessed the distribution statistics, it places that information in the data-distribution cache in memory.

Figure 8 shows how the database server accesses the data-distribution cache for multiple users. When the optimizer accesses the column distribution statistics for User 1 for the first time, the database server puts the distribution statistics in the data-distribution cache. When the optimizer determines the query plan for user 2, user 3 and user 4 who access the same column, the database server does not have to read from disk to access the data-distribution information for the table. Instead, it reads the distribution statistics from the data-distribution cache in shared memory.

Figure 8. Data-Distribution Cache

begin figure description - This figure is described in the surrounding text. - end figure description

The database server initially places pages for the sysdistrib system catalog table in the buffer pool as it does all other data and index pages. However, the data-distribution cache offers additional performance advantages. It:

Is organized in a more efficient format
Is organized to allow fast retrieval
Bypasses the overhead of the buffer pool management
Frees more pages in the buffer pool for actual data pages rather than system catalog pages
Reduces I/O operations to the system catalog table

Data-Distribution Configuration

The database server uses a hashing algorithm to store and locate information within the data-distribution cache. The DS_POOLSIZE controls the size of the data-distribution cache and specifies the total number of column distributions that can be stored in the data-distribution cache. To modify the number of buckets in the data-distribution cache, use the DS_HASHSIZE configuration parameter. The following formula determines the number of column distributions that can be stored in one bucket.

Distributions_per_bucket = DS_POOLSIZE / DS_HASHSIZE

To modify the number of distributions per bucket, change either the DS_POOLSIZE or DS_HASHSIZE configuration parameter.

For example, with the default values of 127 for DS_POOLSIZE and 31for DS_HASHSIZE, you can potentially store distributions for about 127 columns in the data-distribution cache. The cache has 31 hash buckets, and each hash bucket can have an average of 4 entries.

The values that you set for DS_HASHSIZE and DS_POOLSIZE, depend on the following factors:

The number of columns for which you execute UPDATE STATISTICS in HIGH or MEDIUM mode and you expect to be used most often in frequently executed queries.
If you do not specify columns when you run UPDATE STATISTICS for a table, the database server generates distributions for all columns in the table.

You can use the values of DD_HASHSIZE and DD_HASHMAX as guidelines for DS_HASHSIZE and DS_POOLSIZE. The DD_HASHSIZE and DD_HASHMAX specify the size for the data-dictionary cache, which stores information and statistics about tables that queries access.

For medium to large systems, you can start with the following values:
- DD_HASHSIZE 503
- DD_HASHMAX 4
- DS_HASHSIZE 503
- DS_POOLSIZE 2000
Monitor these caches to see the actual usage, and you can adjust these parameters accordingly. For monitoring information, see Monitoring the Data-Distribution Cache.
The amount of memory available
The amount of memory required to store distributions for a column depends on the level at which you run UPDATE STATISTICS. Distributions for a single column might require between 1 kilobyte and 2 megabytes, depending on whether you specify medium or high mode or enter a finer resolution percentage when you run UPDATE STATISTICS.

If the size of the data-distribution cache is too small, the following performance problems can occur:

The database server uses the DS_POOLSIZE value to determine when to remove entries from the data-distribution cache. However, if the optimizer needs the dropped distributions for another query, the database server must reaccess them from the sysdistrib system catalog table on disk. The additional I/O and buffer pool operations to access sysdistrib on disk adds to the total response time of the query.
The database server tries to maintain the number of entries in data-distribution cache at the DS_POOLSIZE value. If the total number of entries reaches within an internal threshold of DS_POOLSIZE, the database server uses a least recently used mechanism to remove entries from the data-distribution cache. The number of entries in a hash bucket can go past this DS_POOLSIZE value, but the database server eventually reduces the number of entries when memory requirements drop.
If DS_HASHSIZE is small and DS_POOLSIZE is large, overflow lists can be long and require more search time in the cache.
Overflow occurs when a hash bucket already contains an entry. When multiple distributions hash to the same bucket, the database server maintains an overflow list to store and retrieve the distributions after the first one.

If DS_HASHSIZE and DS_POOLSIZE are approximately the same size, the overflow lists might be smaller or even nonexistent, which might waste memory. However, the amount of unused memory is insignificant overall.

Monitoring the Data-Distribution Cache

To monitor the size and use of the data-distribution cache, run onstat -g dsc or use the ISA Performance -> Cache menu options. You might want to change the values of DS_HASHSIZE and DS_POOLSIZE if you see the following situations:

If the data-distribution cache is full most of the time and commonly used columns are not listed in the distribution name field, try increasing the values of DS_HASHSIZE and DS_POOLSIZE.
If the total number of entries is much lower than DS_POOLSIZE, you can reduce the values of DS_HASHSIZE and DS_POOLSIZE.

Figure 7 shows sample output for onstat -g dsc.

Figure 9. onstat -g dsc Output

onstat -g dsc



Distribution Cache:
    Number of lists             : 31
    DS_POOLSIZE                 : 127

Distribution Cache Entries:

list#id ref_cnt dropped? heap_ptr  distribution name
-----------------------------------------------------------------

5     0       0       0   aa8f820  vjp_stores@gilroy:virginia.orders.order_num

12    0       0       0   aa90820  vjp_stores@gilroy:virginia.items.order_num

15    0       0       0   a7e9a38  vjp_stores@gilroy:virginia.customer.customer_num

19    0       0       0   aa3bc20  vjp_stores@gilroy:virginia.customer.lname

21    0       0       0   aa3cc20  vjp_stores@gilroy:virginia.orders.customer_num

28    0       0       0   aa91820  vjp_stores@gilroy:virginia.customer.company


Total number of distribution entries: 6.
    Number of entries in use    : 0

The onstat -g dsc output has the following fields.

Field: Description
Number of Lists: Number of buckets or lists that DS_HASHSIZE specifies
DS_POOLSIZE: Number of column distributions allowed in data-distribution cache
Number of entries: Number of column distributions currently in the data-distribution cache
Number of entries in use: Number of column distributions currently in use
List #: Hash bucket number
Id: Not used
Ref cnt: Number of SQL statements currently referencing the data-distribution information for this column from the cache
Dropped?: Designation if the column distribution has been dropped with the DROP DISTRIBUTIONS keyword on the UPDATE STATISTICS statement.
Heap ptr: Heap pointer
Distribution name: Name of the table and column that the data-distribution information describes