Creating Data Distributions

You can use the LOW keyword with the UPDATE STATISTICS statement for all non-index leading columns that are part of an index.

You can also use the MEDIUM or HIGH keywords with the UPDATE STATISTICS statement to specify the mode for data distributions on specific columns. These keywords indicate that the database server is to generate statistics about the distribution of data values for each specified column and place that information in a system catalog table called sysdistrib. If a distribution has been generated for a column, the optimizer uses that information to estimate the number of rows that match a query against a column. Data distributions in sysdistrib supersede values in the colmin and colmax column of the syscolumns system catalog table when the optimizer estimates the number of rows returned.

When you use data-distribution statistics for the first time, try to update statistics in MEDIUM mode for all your tables and then update statistics in HIGH mode for all columns that head indexes. This strategy produces statistical query estimates for the columns that you specify. These estimates, on average, have a margin of error less than percent of the total number of rows in the table, where percent is the value that you specify in the RESOLUTION clause in the MEDIUM mode. The default percent value for MEDIUM mode is 2.5 percent. (For columns with HIGH mode distributions, the default resolution is 0.5 percent.) For each table that your query accesses, build data distributions according to the following guidelines.

To build data distributions for each table that your query accesses

Run UPDATE STATISTICS MEDIUM for all columns in a table that do not head an index.
This step is a single UPDATE STATISTICS statement. The default parameters are sufficient unless the table is very large, in which case you should use a resolution of 1.0 and confidence of 0.99.

With the DISTRIBUTIONS ONLY option, you can execute UPDATE STATISTICS MEDIUM at the table level or for the entire system because the overhead of the extra columns is not large. Run the following UPDATE STATISTICS statement to create distributions for non-index join columns and non-index filter columns:
```
UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY;
```
Run UPDATE STATISTICS HIGH for all columns that head an index. For the fastest execution time of the UPDATE STATISTICS statement, you must execute one UPDATE STATISTICS HIGH statement for each column.
For example, suppose you have a table t1 with columns a, b, c, d, e, and f with the following indexes:
```
CREATE INDEX ix_1 ON t1 (a, b, c, d) ...
CREATE INDEX ix_3 ON t1 (f) ...
```
Run the following UPDATE STATISTICS statements for the columns that head an index:
```
UPDATE STATISTICS HIGH FOR TABLE t1(a);
UPDATE STATISTICS HIGH FOR TABLE t1(f);
```
These UPDATE STATISTICS HIGH statements replace the medium distributions that the previous step creates with high distributions for index columns.

Important:

Always execute the MEDIUM mode of UPDATE STATISTICS before the HIGH mode on a table. If you execute UPDATE STATISTICS in HIGH mode followed by MEDIUM mode, you lose the high distributions.
If you have indexes that begin with the same subset of columns, run UPDATE STATISTICS HIGH for the first column in each index that differs.
For example, suppose you have the following indexes on table t1:
```
CREATE INDEX ix_1 ON t1 (a, b, c, d) ...
CREATE INDEX ix_2 ON t1 (a, b, e, f) ...
CREATE INDEX ix_3 ON t1 (f) ...
```
Step 2 executes UPDATE STATISTICS HIGH on column a and column f by. Then run UPDATE STATISTICS HIGH on columns c and e.
```
UPDATE STATISTICS HIGH FOR TABLE t1(c);
UPDATE STATISTICS HIGH FOR TABLE t1(e);
```
In addition, you can run UPDATE STATISTICS HIGH on column b, but this step is usually not necessary.
For each single-column or multi-column index on the table:
1. Identify the set of all columns that appear in the index.
2. Identify the subset that includes all columns that are not the leading column of any index.
3. Run UPDATE STATISTICS LOW on each column in that subset.
For the sample indexes in Step 3, run the following UPDATE STATISTICS statement to update the sysindexes and syscolumns system catalog tables:
```
UPDATE STATISTICS FOR TABLE t1(a,b,c,d);
UPDATE STATISTICS FOR TABLE t1(a,b,e,f);
```
For small tables, run UPDATE STATISTICS HIGH.
```
UPDATE STATISTICS HIGH FOR TABLE t2;
```

Because the statement constructs the statistics only once for each index, these steps ensure that UPDATE STATISTICS executes rapidly.

For additional information about data distributions and the UPDATE STATISTICS statement, see the IBM Informix: Guide to SQL Syntax.