Hey
For context:
- We track users, they each have an id: user_id.
- For each user we track when they use our app, each instance of app usage is referred to as a session. A user can have many sessions. Each session have an id: session_id.
Currently, our primary use cases are focusing on analyzing sessions, therefore we have sharded the majority of our tables on session_id. Performance is great and all is good.
Now we’ve been tasked with also focusing on analyzing users. This means that our sharding on session_id is no longer optimal, as we cant be sure that all sessions related to a user is on the same partition. We could change the sharding such that we shard on user_id, but then I doubt the query planner would know, that all session_ids within the partition would actually be unique to that partition, which would cause all of our current queries to be less optimal. Using (user_id, session_id) as the shard key, probably wouldn’t be of much help either, as the user_id would still be spread across different partitions, right?
I’ll try and give a very simplified example of the issue…
Using session_id as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1       | 1          | 11     | 1         |
| 1       | 1          | 12     | 1         |
| 1       | 1          | 13     | 1         |
| 1       | 2          | 32     | 2         |
| 2       | 3          | 65     | 3         |
| 3       | 4          | 68     | 4         |
+-------------------------------------------+
With this I can easily do count(distinct session_id) locally, but doing count(distinct user_id) needs to have data distributed between nodes.
Using user_id as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1       | 1          | 11     | 1         |
| 1       | 1          | 12     | 1         |
| 1       | 1          | 13     | 1         |
| 1       | 2          | 32     | 1         |
| 2       | 3          | 65     | 2         |
| 3       | 4          | 68     | 3         |
+-------------------------------------------+
With this I can easily do count(distinct user_id) locally, but doing count(distinct session_id) needs to have data distributed between nodes.
Using (user_id, session_id) as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1       | 1          | 11     | 2         |
| 1       | 1          | 12     | 2         |
| 1       | 1          | 13     | 2         |
| 1       | 2          | 32     | 3         |
| 2       | 3          | 65     | 1         |
| 3       | 4          | 68     | 3         |
+-------------------------------------------+
With this doing either count(distinct session_id) or count(distinct user_id) needs to have data distributed between nodes, because the query planner cant know for sure that a session_id is always tied to the same user_id.
So I guess my question is, how do i define my shard key, such that both count(distinct session_id) and count(distinct user_id) can be done locally on the partition? Is it possible to state some relation between two columns like:
create table log (
  id bigint not null,
  user_id bigint not null,
  session_id bigint not null unique by (user_id),
  timestamp datetime not null series timestamp,
  shard key (user_id),
  sort key (timestamp),
  primary key (id, session_id, user_id)
);
Also just want to note that creating a projection table with user_id as the shard key instead of session_id, is currently not an option as that would double the size of data and therefore also storage cost (right?).
 sadly projections are the only solution to this problem
 sadly projections are the only solution to this problem