BigQuery Table Clustering

Justin Taras
4 min readJun 16, 2020

TL;DR You should be using BigQuery Table Clustering

BigQuery has recently introduced the ability to independently cluster tables [GA ]. Up until now, the clustering feature was only usable when configured with table partitioning. Top level table clustering will provide additional performance enhancements for non-partitioned tables with autonomous table maintenance.

What is Table Clustering?

Clustering is one of several ways to physically design the layout of data in a BigQuery table. When data is written to a clustered table, the data is sorted based upon pre-selected cluster keys. This co-locates similar data together, potentially reducing the search area for a query. The data is stored in capacitor files with the sort ranges stored as file metadata. BigQuery can quickly check this metadata at query-time and determine if the files need to be scanned. This is an IO elimination technique that directs BigQuery to the relevant data, skipping the unnecessary data files.

Table Clustering in Action

In the following example, I have a 1.8TB orders table from the TPCH dataset. The table is 5 columns wide and contains 29 billion records.

21 billion row orders table

A standard query for a table like this would look up a specific order and its associated revenue.

--

--

Justin Taras
Justin Taras

Written by Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!

No responses yet