Evict data from local disk/cache

Hi

I’m reading your documentation and it’s clear that I can choose to prewarm data (Local and Unlimited Database Storage Concepts · SingleStore Documentation) or prematurely execute my query to make sure that the data is fetched from unlimited storage and put onto the local disk.

However, I’m looking for the opposite. How do I evict from the local disk/cache, if I know I’m not going to be needing the data that’s been pulled for the near future?

I acknowledge it might be a weird request, but we’re seeing performance hits due to fetching data in some occasions when we’ve had to ingest a large bulk of historical data.

1 Like

Hi @ahkjeldsen. That is an interesting use case. To get the best out of it, it seems like you would want to declare to not cache data before you load it, so you don’t evict anything more valuable.

Can you elaborate a bit on your use case by saying more precisely how you were ingesting data, how much (total and as a fraction of your total available disk cache), and exactly what data you don’t want to cache, or want to evict?

Hi @hanson

You’re absolutely right. We would prefer to be able to configure a pipeline to consume directly into blob storage. I guess it’s not possible at the moment, but perhaps you can disclose if this is something that’s been talked about already?

So the use case is that we’re providing web analytics to our customers. We therefore need to have a continuous stream of data being inserted, this is done using regular insert statements. If a customer change their ingestion configuration, then we wish to reprocess all their historical data. The customer data is then mutated outside of SingleStore in accordance with the ingestion configuration and then we use pipelines to insert it into our cluster. Once the new version of the data is ingested, we remove the old version of the data.

I don’t have those exact numbers available, sorry. We would essentially just want to evict everything and then have the cluster behave as if all the data was already in the blob store i.e. fetch when needed. We do not need the “prewarming” of the pipelined ingested data.

@ahkjeldsen okay, I understand. I made an internal feature request for this. We will consider it for the future. You’re the first to ask for this.

It would help if you can explain the performance issue you are having in more detail. It sounds like you think that the pipeline ingest (new blobs) are flushing your cache and causing poor hit rate for other data. Do you have any more details to verify this?

Thanks for making the internal feature request @hanson :smile:

You are right, that is my assumption. My assumption is however based on discussions with some of your colleagues during our onboarding process and later some debugging sessions where we looked at this issue as well. It was during these meetings we got introduced to how the “Unlimited Storage” works. At that time, our cluster was configured to be able to use 75% of the disk as local cache. We were told that the decrease in performance was due to the fact that many blobs needed to be pushed to S3 and we also needed to fetch a lot of blobs. Therefore we were getting longer query times due to the network IO and also sometimes because S3 were telling the cluster to slow down. Your colleagues suggested changing a configuration, such that we could use 85% of the local disk. That change had an immediate effect.

We were still seeing issues if we tried to ingest multiple customers full historical data at the same time, therefore we made a change in our application such that only one pipeline could run at a time. This kept us “above water” for a long time, but it has a negative impact for our customers as they sometimes need to wait days before their changed ingest configuration is applied. We sometimes allow two pipelines running at the same time, but often see a negative impact on our cluster soon after.

In cases were our continuous ingest of data can’t keep up, then we’ve found that making sure no pipelines are running, will make the continuous ingest catch up within a short period of time. We’ve tried to test if it’s just a coincidence by letting the pipelines keep running, but in those cases we just kept falling longer and longer behind in our continuous ingest process.

We’re also noticing that when running a lot of these full historical ingests, then the automated daily backup process is often taking more than 8 hours. During these 8 hours we’re also seeing a significant decrease in performance. We’ve had to contact your colleagues in Support requesting them to stop the backup as it made it near impossible to serve our customers in a timely manner. So I’m also assuming, that having pipelines ingest directly to the “Unlimited Storage”, would make the daily backup not consider these blobs as they would already have been pushed to S3. This is pure guesswork and not confirmed by anyone :sweat_smile:

@ahkjeldsen Thanks for these details. This will help us with the context when we plan bottomless performance improvements. We’re working on several things to make it better, including in the near term GP3 on AWS, and optimizations to improve I/O concurrency. This is an ongoing area of investment for us.

We also now give you the ability on our Helios cloud service to have 2x or 4x cache at an additional charge, that is less than doubling or quadrupling the size of your workspace. That can let you get your working set into local cache, which is recommended.

What cloud are you running on?

@hanson we’re not on Helios unfortunately. We’re not even in Workspaces, so it’s probably the “old old” cloud we’re on. I’ll forward the suggestion about doubling the cache size when/if we get onto Helios.

Got it. Some of the improvements we’re planning to the engine will help you if you are using bottomless, self-hosted. The Helios 2x/4x storage won’t help but you could provision cloud servers with bigger local disks to limit cache misses, of course.