Create and Maintain Data-Distribution Statistics

Home | Previous Page | Next Page Tuning Specific Queries and Transactions > Fundamental Query and Transaction Tuning Tasks >

Create and Maintain Data-Distribution Statistics

The optimizer uses the data-distribution statistics that are stored in the sysdistrib system catalog table to determine the lowest-cost query plan. Make sure that you keep these statistics up-to-date so that the optimizer can choose the best query plan.

Generating Statistics on a Table

Identify the set of all columns that appear in any single-column or multi-column index on the table.
Identify the subset that includes all columns that are not the leading column of any index.
Run UPDATE STATISTICS LOW on each column in that subset.

Run UPDATE STATISTICS as follows when tables are updated to ensure that statistics are also updated.

If you update tables by adding many rows in batch processes or by attaching table fragments, run UPDATE STATISTICS as part of the update process.
If you update tables by inserting, deleting, and modifying data in individual rows, run UPDATE STATISTICS on a regular schedule.

Tip:

For multiprocessor coserver systems, such as those that have four CPUs on each node and as few as eight nodes, UPDATE STATISTICS can execute much faster if you set BUFFERS as high as 25,000 (100 megabytes) on each node. For uniprocessor coserver systems, you can set BUFFERS lower if there are more coserver nodes to do the work.

For detailed information about the syntax and options of the UPDATE STATISTICS statement, refer to the IBM Informix: Guide to SQL Syntax.

For unindexed columns in tables, follow these general guidelines when you run UPDATE STATISTICS:

Run UPDATE STATISTICS in MEDIUM mode for reasonably significant results.
UPDATE STATISTICS MEDIUM uses a statistical sampling method to create distributions. In MEDIUM mode, UPDATE STATISTICS samples the columns with a default resolution of 1 and a default confidence level of 0.95. In HIGH mode, the default resolution is 0.5 and the confidence level is 0.99.

To bring the MEDIUM mode output closer to the HIGH mode output, use the RESOLUTION clause in the UPDATE STATISTICS statement. HIGH mode statistics are more significant than MEDIUM mode statistics, but in HIGH mode the database server scans the table and records data for each row, which might take a very long time on a large table. The statistical sampling method used by MEDIUM mode is faster. Experiment, and see if UPDATE STATISTICS MEDIUM output data is adequate to produce good query plans.

Occasionally, you might find that you need to run UPDATE STATISTICS in HIGH mode, but in general adjusting the RESOLUTION clause for MEDIUM mode produces good statistics faster.
For columns that contain duplicate or skewed data, increase the precision of statistics by specifying the RESOLUTION clause.
If you know that a column is likely to contain many duplicate values or skewed data, run UPDATE STATISTICS MEDIUM with a lower resolution. You can specify a resolution as low as 0.005. The RESOLUTION you specify depends on the number of rows in the table and the amount of skewed or duplicated values you expect in the data.
Specify table and column names to collect statistics only on the columns used in filters and joins.
You do not need to collect statistics on columns that are not used in joins and filters. Although in an environment in which users enter ad hoc queries, you cannot be completely sure what columns are used as filters, common sense and an understanding of the kind of information that users probably want to derive from the table data will help you identify such columns.
Run UPDATE STATISTICS whenever the data distribution changes significantly. For example, run UPDATE STATISTICS after you alter a table by attaching or detaching a fragment or use batch-loading to insert rows. You might not need to run UPDATE STATISTICS after you change data for a dozen customers, however.
Make the UPDATE STATISTICS statement part of the script that you run to attach and detach fragments or to execute the batch job.
Run UPDATE STATISTICS on a regular schedule for tables that are updated by sporadic inserts, updates, and deletes.

If columns are indexed, follow these guidelines:

Run UPDATE STATISTICS HIGH or UPDATE STATISTICS MEDIUM RESOLUTION 0.05 98 for all columns that are indexed.
If indexes begin with the same subset of columns, run UPDATE STATISTICS HIGH for the first column that differs in each index.
For example, if index ix_1 is defined on columns a, b, c, and d, and index ix_2 is defined on columns a, b, e, and f, run UPDATE STATISTICS HIGH on column a by itself. Then run UPDATE STATISTICS HIGH on columns c and e. In addition, you can run UPDATE STATISTICS HIGH on column b, but this step is usually not necessary.
For each column in a multicolumn index, execute UPDATE STATISTICS in at least LOW mode for columns for which you have not run UPDATE STATISTICS in a higher mode.
Because the statement constructs the index information statistics only once for each index, these steps ensure that UPDATE STATISTICS executes rapidly.

Tip:

To examine the data-distribution information for each column in a table, use the dbschema utility with the following syntax: dbschema -d database -hd tablename, where database is the database name and tablename is the table name.