Sampled Queries: The SAMPLES OF Keywords (XPS)

In Extended Parallel Server, sampled queries are supported. Sampled queries are queries that are based on sampled tables. A sampled table is the result of randomly selecting a specified number of rows from the table, rather than all rows that match the selection criteria.

You can use a sampled query to gather quickly an approximate profile of data within a large table. If you use a sufficiently large sample size, you can examine trends in the data by sampling the data instead of scanning all the data. In such cases, sampled queries can provide better performance than scanning the data. If the value specified for the sample size is greater than the number of rows in the table, the whole table is scanned.

To indicate that a table is to be sampled, specify the number of samples to return before the SAMPLES OF keywords of the FROM clause. You can run sampled queries against tables and synonyms, but not against views.

Sampled queries are not supported in the INSERT, DELETE, or UPDATE statements, nor in other SQL statements that can include a query.

A sampled query has at least one sampled table. You do not need to sample all tables in a sampled query. You can specify the SAMPLES OF option for some tables in the FROM clause but not specify it for other tables.

The sampling method is known as sampling without replacement. This means that a sampled row is not sampled again. The database server applies selection criteria after samples are selected. Thus, the selection criteria restrict the sample set, rather than the rows from which the sample is taken.

If a table is fragmented, the database server divides the specified number of samples among the fragments. The number of samples from a fragment is proportional to the ratio of the size of a fragment to the size of the table. In other words, the database server takes more samples from larger fragments.

You must run UPDATE STATISTICS MEDIUM before you run the query with the SAMPLES OF option. If you do not run UPDATE STATISTICS, the SAMPLES OF clause is ignored, and all data values are returned. For better results, run UPDATE STATISTICS MEDIUM before you run the query with the SAMPLES OF option.

The results of a sampled query contain a certain amount of deviation from a complete scan of all rows. You can reduce this expected error to an acceptable level by increasing the proportion of sampled rows to actual rows. When you use sampled queries in joins, the expected error increases dramatically; you must use larger samples in each table to retain an acceptable level of accuracy.

For example, you might want to generate a list of how many of each part is sold from the parts_sold table, which is known to contain approximately 100,000,000 rows.

The following query provides a sampling ratio of one percent and returns an approximate result: