How rowstore and columnstore works

alesanco · April 28, 2020, 10:13am

Hi guys, I would like to understand more about the rowstore and columnstore techniques in MemSQL.
Let’s say I have a base already populated and I’m going to insert a new tuple.

With a rowstore table:

Search for the fragmentation key (shard-key or PK).
If you have the key, make your distribution using the hash technique.
If you do not have the key, make your distribution using the round robin technique.
In the node, it stores the fragment in RAM.

With a columnstore table:

Search for the fragmentation key (shard-key or PK);
If you have the key, make your distribution using the hash technique.
If you do not have the key, make your distribution using the round robin technique.
In the node, it identifies the row segment that can receive the tuple, respecting the value range of the key column or creates a new row segment.
Place the tuple in a column segment of a given row segment.
Update the RAM metadata of all the column segments involved, update the log and re-store the compressed data of the column segment on the HD / SSD.

The step-by-step I put in was what I was able to understand about its operation.
Could you correct or confirm this?

In columnstore, I also considered the possibility of the master node already consulting the aggregator’s metadata to know at which exact point in the cluster that tuple should be inserted. It would be great to keep the data sorted, but doing so, what is the point of setting up a shard-key for this table?

I read a lot in documentation, but it was not clear (I am writing about MemSQL).

hanson · April 28, 2020, 9:18pm

I really don’t think I understand your question. E.g. what do you mean " * Search for the fragmentation key (shard-key or PK);" in your list of bullet points?

Please see these documentation topics:

And what are you writing?

alesanco · April 29, 2020, 10:37am

In fact, I was not clear on that first point.
When inserting a tuple in a table, the master node checks whether the table has a shard-key. I say this because there may be tables with or without a shard-key. With the key, tuples with the same key value go to the same nodes. Without the key, tuples are distributed in a circular fashion, balancing the number of tuples in each node.

I am doing a performance analysis on a type of application and it would be important to understand the steps, at least in general.

hanson · April 30, 2020, 5:25pm

For keyless sharded tables, where you either declare the shard key as SHARD() or else you don’t declare a shard key at all and there is no primary key, row placement is determined as follows:

a) for INSERT, it’s random
b) for LOAD DATA it’s batched, round robin
c) for INSERT SELECT it inserts into whatever partition generated the data as part of SELECT

All of these will tend to spread the data evenly, independent of the actual contents of the row.

alesanco · September 22, 2020, 12:31am

Hi Hanson, there is some documentation to confirm this round robin in load data here on the website. I would need this reference for scientific writing.

hanson · September 24, 2020, 8:09pm

Not that I know of. What I wrote back on April 30th came from our engineering team.