The optimizer uses distribution statistics generated by the UPDATE STATISTICS statement in the MEDIUM or HIGH mode to determine the query plan with the lowest cost. The first time that the optimizer accesses the distribution statistics for a column, the database server retrieves the statistics from the sysdistrib system catalog table on disk. Once the database server has accessed the distribution statistics, it places that information in the data-distribution cache in memory.
Figure 8 shows how the database server accesses the data-distribution cache for multiple users. When the optimizer accesses the column distribution statistics for User 1 for the first time, the database server puts the distribution statistics in the data-distribution cache. When the optimizer determines the query plan for user 2, user 3 and user 4 who access the same column, the database server does not have to read from disk to access the data-distribution information for the table. Instead, it reads the distribution statistics from the data-distribution cache in shared memory.
The database server initially places pages for the sysdistrib system catalog table in the buffer pool as it does all other data and index pages. However, the data-distribution cache offers additional performance advantages. It:
The database server uses a hashing algorithm to store and locate information within the data-distribution cache. The DS_POOLSIZE controls the size of the data-distribution cache and specifies the total number of column distributions that can be stored in the data-distribution cache. To modify the number of buckets in the data-distribution cache, use the DS_HASHSIZE configuration parameter. The following formula determines the number of column distributions that can be stored in one bucket.
Distributions_per_bucket = DS_POOLSIZE / DS_HASHSIZE
To modify the number of distributions per bucket, change either the DS_POOLSIZE or DS_HASHSIZE configuration parameter.
For example, with the default values of 127 for DS_POOLSIZE and 31for DS_HASHSIZE, you can potentially store distributions for about 127 columns in the data-distribution cache. The cache has 31 hash buckets, and each hash bucket can have an average of 4 entries.
The values that you set for DS_HASHSIZE and DS_POOLSIZE, depend on the following factors:
If you do not specify columns when you run UPDATE STATISTICS for a table, the database server generates distributions for all columns in the table.
You can use the values of DD_HASHSIZE and DD_HASHMAX as guidelines for DS_HASHSIZE and DS_POOLSIZE. The DD_HASHSIZE and DD_HASHMAX specify the size for the data-dictionary cache, which stores information and statistics about tables that queries access.
For medium to large systems, you can start with the following values:
Monitor these caches to see the actual usage, and you can adjust these parameters accordingly. For monitoring information, see Monitoring the Data-Distribution Cache.
The amount of memory required to store distributions for a column depends on the level at which you run UPDATE STATISTICS. Distributions for a single column might require between 1 kilobyte and 2 megabytes, depending on whether you specify medium or high mode or enter a finer resolution percentage when you run UPDATE STATISTICS.
If the size of the data-distribution cache is too small, the following performance problems can occur:
The database server tries to maintain the number of entries in data-distribution cache at the DS_POOLSIZE value. If the total number of entries reaches within an internal threshold of DS_POOLSIZE, the database server uses a least recently used mechanism to remove entries from the data-distribution cache. The number of entries in a hash bucket can go past this DS_POOLSIZE value, but the database server eventually reduces the number of entries when memory requirements drop.
Overflow occurs when a hash bucket already contains an entry. When multiple distributions hash to the same bucket, the database server maintains an overflow list to store and retrieve the distributions after the first one.
If DS_HASHSIZE and DS_POOLSIZE are approximately the same size, the overflow lists might be smaller or even nonexistent, which might waste memory. However, the amount of unused memory is insignificant overall.
To monitor the size and use of the data-distribution cache, run onstat -g dsc or use the ISA Performance -> Cache menu options. You might want to change the values of DS_HASHSIZE and DS_POOLSIZE if you see the following situations:
Figure 7 shows sample output for onstat -g dsc.
onstat -g dsc Distribution Cache: Number of lists : 31 DS_POOLSIZE : 127 Distribution Cache Entries: list#id ref_cnt dropped? heap_ptr distribution name ----------------------------------------------------------------- 5 0 0 0 aa8f820 vjp_stores@gilroy:virginia.orders.order_num 12 0 0 0 aa90820 vjp_stores@gilroy:virginia.items.order_num 15 0 0 0 a7e9a38 vjp_stores@gilroy:virginia.customer.customer_num 19 0 0 0 aa3bc20 vjp_stores@gilroy:virginia.customer.lname 21 0 0 0 aa3cc20 vjp_stores@gilroy:virginia.orders.customer_num 28 0 0 0 aa91820 vjp_stores@gilroy:virginia.customer.company Total number of distribution entries: 6. Number of entries in use : 0
The onstat -g dsc output has the following fields.