Cassandra keys

Partition key, clustering key, primary key

Cassandra uses two kinds of keys:

the Partition Keys is responsible for data distribution across nodes
the Clustering Key is responsible for data sorting within a partition

A primary key is a combination of those to types. The vocabulary depends on the combination:

simple primary key: only the partition key, composed of one column
composite partition key: only the partition key, composed of multiple columns
compound primary key: one partition key with one or more clustering keys.
composite and compound primary key: a partition key composed of multiple columns and multiple clustering keys.

The PRIMARY KEY syntax

Declaring a key

The table creation statement should contain a PRIMARY KEY expression. The way you declare it is very important. In a nutshell:

PRIMARY KEY(partition key)
PRIMARY KEY(partition key, clustering key)

Additional parentheses group multiple fields into a composite partition key or declares a compound composite key.

Examples

Simple primary key:

PRIMARY KEY (key)

key is called the partition key.

(for simple primary key, it is also possible to put the PRIMARY KEY expression after the field, i.e. key int PRIMARY KEY, for example).

Compound primary key:

PRIMARY KEY (key_part_1, key_part_2)

Contrary to SQL, this does not exactly create a composite primary key. Instead, it declares key_part_1 as the partition key and key_part_2 as the clustering key. Any other field will also be considered part of the clustering key.

Composite+Compound primary keys:

PRIMARY KEY ((part_key_1, ..., part_key_n), (clust_key_1, ..., clust_key_n))

The first parenthese defines the compound partition key, the other columns are the clustering keys.

Syntax summary

(part_key)
(part_key, clust_key)
(part_key, clust_key_1, clust_key_2)
(part_key, (clust_key_1, clust_key_2))
((part_key_1, part_key_2), clust_key)
((part_key_1, part_key_2), (clust_key_1, clust_key_2))

Key ordering and allowed queries

The partition key is the minimum specifier needed to perform a query using a where clause.

If you declare a composite clustering key, the order matters.

Say you have the following primary key:

PRIMARY KEY((part_key1, part_key_2), (clust_key_1, clust_key_2, clust_key_3))

Then, the only valid queries use the following fields in the where clause:

part_key_1, part_key_2
part_key_1, part_key_2, clust_key_1
part_key_1, part_key_2, clust_key_1, clust_key_2
part_key_1, part_key_2, clust_key_1, clust_key_2, clust_key_3

Example of invalid queries are:

part_key_1, part_key_2, clust_key_2
Anything that does not contain both part_key_1, part_key_2
…

If you want to use clust_key_2, you have to also specify clust_key_1, and so on.

So the order in which you declare your clustering keys will have an impact on the type of queries you can do. In the opposite, the order of the partition key fields is not important, since you always have to specify all of them in a query.