in Engineering


Painlessly Analyze Your S3 Access Logs With SingleStore

Joseph Victor

Engineer

Painlessly Analyze Your S3 Access Logs With SingleStore

So you have some complex app that uses AWS S3, and the visibility you built into it isn’t quite enough.

Maybe your bill is higher than expected, or maybe a component is misbehaving and you want to see why.  Whatever the reason, streaming your access logs from S3 is trivial with SingleStore pipelines. Here’s how you do it.

creating-the-data-pipelineCreating the data pipeline

First you create a table —  the columns can be found here, but it’s converted to SQL below.  I went ahead and added a sort key on request_time, so queries which filter by time are optimized, although you may prefer a different sort key.

create database aws_access_logs;
use aws_access_logs;
CREATE TABLE `aws_access_logs` (
`bucket_owner` blob not null,
`bucket` blob not null,
`request_time` datetime not null,
`remote_ip` blob not null,
`requester` blob not null,
`request_id` blob not null,
`operation` blob not null,
`key` blob not null,
`request_uri` blob not null,
`http_status` bigint not null,
`error_code` bigint not null,
`bytes_sent` bigint not null,
`object_size` bigint not null,
`total_time` bigint not null,
`turn_around_time` bigint not null,
`referer` blob not null,
`user_agent` blob not null,
`version_id` blob not null,
`host_id` blob not null,
`signature_version` blob not null,
`cipher_suite` blob not null,
`auth_type` blob not null,
`host_head` blob not null,
`tls_version` blob not null,
`acess_arn` blob not null,
`acl_required` blob,
SORT KEY (request_time)
);

Next we need to load the data. SingleStore pipelines make this easy — and I figured out the annoying date mangling for you — so you can copy and paste the pipeline below.  Note that we’re forced to use the IGNORE clause because acl_required is sometimes omitted by Amazon.

CREATE PIPELINE `aws_access_logs_pipeline`
AS LOAD DATA S3 '<your-audit-logging-bucket>'
CONFIG '{\"region\":\"us-east-1\"}'
CREDENTIALS '{
"aws_access_key_id":<youraccessid>,
"aws_secret_access_key":<yoursecretkey>
}'
IGNORE
INTO TABLE `aws_access_logs`
FIELDS TERMINATED BY ' ' ENCLOSED BY '"'
(
`aws_logs`.`bucket_owner`,
`aws_logs`.`bucket`,
@tp1,
@tp2,
`aws_logs`.`remote_ip`,
`aws_logs`.`requester`,
`aws_logs`.`request_id`,
`aws_logs`.`operation`,
`aws_logs`.`key`,
`aws_logs`.`request_uri`,
`aws_logs`.`http_status`,
`aws_logs`.`error_code`,
@bs,
`aws_logs`.`object_size`,
`aws_logs`.`total_time`,
`aws_logs`.`turn_around_time`,
`aws_logs`.`referer`,
`aws_logs`.`user_agent`,
`aws_logs`.`version_id`,
`aws_logs`.`host_id`,
`aws_logs`.`signature_version`,
`aws_logs`.`cipher_suite`,
`aws_logs`.`auth_type`,
`aws_logs`.`host_head`,
`aws_logs`.`tls_version`,
`aws_logs`.`acess_arn`,
`aws_logs`.`acl_required`
)
SET
request_time = str_to_date(concat(@tp1, ' ', @tp2), '[%d/%b/%Y:%H:%i:%s +0000]'),
bytes_sent = if(@bs = '-', 0, @bs);

Pipelines are lovely, as they continuously load data as new data appears. Let’s start by sanity checking the list of files. If at any point your results deviate from mine, you can check for debugging strategies here.

select * from information_schema.pipelines_files;

You should see a list of the first few thousand files to be loaded, in an unloaded state. Now let’s go ahead and run a batch in the foreground.

start pipeline aws_access_logs_pipeline foreground limit 1 batches;

This will run one batch of the pipeline in the foreground, or return an error. When it succeeds, take a look at your data.

select * from aws_access_logs limit 10;

Everything look about right? Good. Finally, we can go ahead and start the pipeline.

start pipeline aws_access_logs_pipeline;

Congrats, data is now flowing!  You can check the status of your pipeline’s recent batches at any time by running

select * from information_schema.pipelines_batches_summary;

extra-tip-use-persisted-columnsExtra tip: Use persisted columns

Perhaps you care primarily about keys matching a specified regex. Not only is it annoying to re-type the regex every time you need it, but it also burns valuable cycles recomputing it every query. Furthermore, if SingleStore knows you’re going to be checking this regex often, it can optimize your query execution and avoid even opening data segments which don’t match your regex. The solution: Persisted columns!

Suppose we want to check for keys containing substring foo, as one does when writing a blog post.  To create one, simply type:

alter table aws_access_logs add column is_foo as `key` like '%foo%' persisted tinyint;

Now is_foo is available to be used in SQL as any other column, and optimizations like segment skipping and optimizer statistics automatically respect it. Nice!

conclusionConclusion

SingleStore pipelines offer a way to continuously load any data sitting in an S3 bucket, access logs among them.  Tired of downloading individual access log files and analyzing them with bash, I set up a pipeline and was immediately able to ask interesting questions of my own data. I hope someone finds this useful, or at least interesting. Thanks for reading!


Share