Physical Database Design: The Database Professional's Guide to Exploiting Indexes, Views, Storage, and More

Get your facts first, and then you can distort them as much as you please. Facts are stubborn, but statistics are more pliable.
Mark Twain (1835 1910)
There are three kinds of lies: lies, damned lies, and statistics.
Benjamin Disraeli (1804 1881)
Counting and sampling are critical to effective database design. There is perhaps nothing more natural than wanting to explore something to see what it is really like before plunging in and committing. Most of us use counting and sampling strategies every day for various things we do. We nibble on food that we are cooking to see how it tastes or we read a few lines from a book before we buy it. Sampling is one of the oldest and most common strategies for getting a good sense of a large system quickly. It has marvelous applications to physical database design, particularly using SQL capabilities for counting and sampling that are being added by database vendors.
Some of the most important design problems that can be helped through sampling include materialized view size estimation, index size and key duplication (or cardinality) projections, multidimensional clustering storage, and shared-nothing partitioning skew. In some ways data sampling is so important it can be hard to develop a truly top-notch database design without it. In all of these cases understanding the data is the goal, and sampling is a tool that can help speed the analysis.
The need for sampling in database design is actually growing. Why? Because data volumes are growing and...