TPC-H (1999)
A popular benchmark which models the internal data warehouse of a wholesale supplier. The data is stored into a 3rd normal form representation, requiring lots of joins at query runtime. Despite its age and its unrealistic assumption that the data is uniformly and independently distributed, TPC-H remains the most popular OLAP benchmark to date.
References
- TPC-H
- New TPC Benchmarks for Decision Support and Web Commerce (Poess et. al., 2000)
- TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark (Boncz et. al.), 2013
- Quantifying TPC-H Choke Points and Their Optimizations (Dresseler et. al.), 2020
Data Generation and Import
First, checkout the TPC-H repository and compile the data generator:
Then, generate the data. Parameter -s specifies the scale factor. For example, with -s 100, 600 million rows are generated for table 'lineitem'.
To speed things up, you can use "chunked" generation (in multiple processes):
Detailed table sizes with scale factor 100:
| Table | size (in rows) | size (compressed in ClickHouse) |
|---|---|---|
| nation | 25 | 2 kB |
| region | 5 | 1 kB |
| part | 20.000.000 | 895 MB |
| supplier | 1.000.000 | 75 MB |
| partsupp | 80.000.000 | 4.37 GB |
| customer | 15.000.000 | 1.19 GB |
| orders | 150.000.000 | 6.15 GB |
| lineitem | 600.000.000 | 26.69 GB |
(Compressed sizes in ClickHouse are taken from system.tables.total_bytes and based on below table definitions.)
Now create tables in ClickHouse. The table definitions are available in init.sql in the ClickHouse repository.
The data can be imported as follows:
Instead of using tpch-kit and generating the tables by yourself, you can alternatively import the data from a public S3 bucket. Make sure
to create empty tables first using init.sql.
Queries
The 22 TPC-H queries can be found here in the ClickHouse repository.
To get SQL standard compatible behavior and expected results, apply the settings from settings.json.
See the README for known issues and notes on specific queries.
Correctness
The result of the queries agrees with the official results unless mentioned otherwise. To verify, generate a TPC-H database with scale
factor = 1 (dbgen, see above) and compare with the expected results in tpch-kit.