Can I have a continuous S3/GS stream for logging

stefandevo · March 29, 2022, 6:21pm

I am thinking to push json files through S3 or google cloud storage which contains application logs. These files will be pushed on the buckets every x seconds.
I read I can create a pipeline to ingest data from s3/gcs as a stream, or at least to monitor changes.

Is this possible (I am not going to use Kafka)?
Is it a good idea to have application logging data inside SingleStore. I will contain million of records, but then I have all the (aggregate) data inside my database for easy dashboarding and logging.

Any suggestions welcome!

hanson · March 29, 2022, 9:23pm

This should work fine. Pipelines are designed for this case. They sense new files that arrive in a folder or bucket and load them.

Many people use pipelines like a bulk loader without relying on this streaming load. But the streaming load was the motivating use case and works well.

stefandevo · March 30, 2022, 5:52am

Ok thanks. I read something about pipelines going directly to the leafs? Is data still distributed then - for a single execution of a file? Meaning, if it processes one file (like 100 records), will it place all records on the same leaf; then next file on another leaf, or are the 100 records distributed over the leafs?

hanson · March 30, 2022, 6:25pm

Each file will go to one leaf. That leaf is aware of the partitioning and sharding of the target table. The leaf will forward data to other leaf nodes as necessary.

stefandevo · March 30, 2022, 7:42pm

Ok thanks! Good to know!