New

IT Threat Detection, Part 1

Notebook


SingleStore Notebooks

IT Threat Detection, Part 1

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.

In this instance, we aim to construct a network intrusion detection system. These systems continuously monitor incoming and outgoing network traffic, generating alerts when potential threats are detected. We'll utilize a combination of a deep learning model and similarity search to identify and classify network intrusion traffic.

Our initial step involves a dataset of labeled traffic events, distinguishing between benign and malicious events, by transforming them into vector embeddings. These vector embeddings serve as comprehensive mathematical representations of network traffic events. SingleStoreDB's built-in similarity-search algorithms allow us to measure the similarity between different network events. To generate these embeddings, we'll leverage a deep learning model based on recent academic research.

Subsequently, we'll apply this dataset to search for the most similar matches when presented with new, unseen network events. We'll retrieve these matches along with their corresponding labels. This process enables us to classify the unseen events as either benign or malicious by propagating the labels of the matched events. It's essential to note that intrusion detection is a complex classification task, primarily because malicious events occur infrequently. The similarity search service plays a crucial role in identifying relevant historical labeled events, thus enabling the identification of these rare events while maintaining a low rate of false alarms.

Install Dependencies

In [1]:

1!pip install tensorflow keras==2.15.0 --quiet

In [2]:

1import os2os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'3
4import pandas as pd5import tensorflow.keras.backend as K6from tensorflow import keras7from tensorflow.keras.models import Model

We'll define a Python context manager called clear_memory() using the contextlib module. This context manager will be used to clear memory by running Python's garbage collector (gc.collect()) after a block of code is executed.

In [3]:

1import contextlib2import gc3
4@contextlib.contextmanager5def clear_memory():6    try:7        yield8    finally:9        gc.collect()

We'll will incorporate portions of code from research work. To begin, we'll clone the repository required for data preparation.

In [4]:

1!git clone -q https://github.com/Colorado-Mesa-University-Cybersecurity/DeepLearning-IDS.git

Data Preparation

The datasets we'll utilize comprise two types of network traffic:

  1. Benign (normal)

  2. Malicious (attack)

stemming from various network attacks. Our focus will be solely on web-based attacks. These web attacks fall into three common categories:

  1. Cross-site scripting (BruteForce-XSS)

  2. SQL-Injection (SQL-Injection)

  3. Brute force attempts on administrative and user passwords (BruteForce-Web)

The original data was collected over a span of two days.

Download Data

We'll proceed by downloading data for two specific dates:

  1. February 22, 2018

  2. February 23, 2018

These files will be retrieved and saved to the current directory. Our intention is to use one of these dates for training and generating vectors, while the other will be reserved for testing purposes.

In [5]:

1!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress2!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress

Review Data

In [6]:

1with clear_memory():2    data = pd.read_csv('Friday-23-02-2018_TrafficForML_CICFlowMeter.csv')3
4data.Label.value_counts()

Clean Data

We'll run a cleanup script from the previously downloaded GitHub repo.

In [7]:

1!python DeepLearning-IDS/data_cleanup.py "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" "result23022018"

We'll now review the cleaned data from the previous step.

In [8]:

1with clear_memory():2    data_23_cleaned = pd.read_csv('result23022018.csv')3
4data_23_cleaned.head()

In [9]:

1data_23_cleaned.Label.value_counts()

Load Model

In this section, we'll load a pre-trained model that has been trained on data collected from the same date.

There are slight modifications to the original model, specifically, altering the number of classes. Initially, the model was designed to classify into four categories:

  1. Benign

  2. BruteForce-Web

  3. BruteForce-XSS

  4. SQL-Injection

Our modified model has been adjusted to classify into just two categories:

  1. Benign

  2. Attack

Action Required

The ZIP file is hosted on a Google Drive.

Using the Edit Firewall button in the top right, add the following to the SingleStoreDB Cloud notebook firewall, one-by-one:

  • drive.google.com

  • *.googleapis.com

  • *.googleusercontent.com

In [10]:

1!wget -q -O it_threat_model.zip "https://drive.google.com/uc?export=download&id=1ahr5dYlhuxS56M6helUFI0yIxxIoFk9o"2!unzip -q it_threat_model.zip

In [11]:

1with clear_memory():2    model = keras.models.load_model('it_threat_model')3
4model.summary()

In [12]:

1with clear_memory():2    # Use the first layer3    layer_name = 'dense'4    intermediate_layer_model = Model(5        inputs = model.input,6        outputs = model.get_layer(layer_name).output7    )

Upload Data to SingleStoreDB

Prepare Data

We'll use a method for defining item IDs that aligns with the event's label.

In [13]:

1from tqdm import tqdm2items_to_upload = []3
4with clear_memory():5    model_res = intermediate_layer_model.predict(K.constant(data_23_cleaned.iloc[:,:-1]))6
7    for i, res in tqdm(zip(data_23_cleaned.iterrows(), model_res), total = len(model_res)):8        benign_or_attack = i[1]['Label'][:3]9        items_to_upload.append((benign_or_attack + '_' + str(i[0]), res.tolist()))

We'll store the data in a Pandas DataFrame.

In [14]:

1with clear_memory():2    df = pd.DataFrame(items_to_upload, columns=['ID', 'Model_Results'])3
4df.head()

Now we'll convert the vectors to a binary format, ready to store in SingleStoreDB.

In [15]:

1import struct2
3def data_to_binary(data: list[float]):4    format_string = 'f' * len(data)5    return struct.pack(format_string, *data)6
7with clear_memory():8    df['Model_Results'] = df['Model_Results'].apply(data_to_binary)

We'll check the DataFrame.

In [16]:

1df.head()

Create Database and Table

In [17]:

1%%sql2DROP DATABASE IF EXISTS siem_log_kafka_demo;3
4CREATE DATABASE IF NOT EXISTS siem_log_kafka_demo;5
6USE siem_log_kafka_demo;7
8DROP TABLE IF EXISTS model_results_demo;9
10CREATE TABLE IF NOT EXISTS model_results (11    id TEXT,12    Model_Results BLOB13);

Get Connection Details

Action Required

Select the database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by SQLAlchemy to make connections to the selected database.

In [18]:

1from sqlalchemy import *2
3db_connection = create_engine(connection_url)

Store DataFrame

In [19]:

1with clear_memory():2    df.to_sql(3        'model_results',4        con = db_connection,5        if_exists = 'append',6        index = False,7        chunksize = 10008    )

Check Stored Data

In [20]:

1%%sql2USE siem_log_kafka_demo;3
4SELECT ID, JSON_ARRAY_UNPACK(Model_Results) AS Model_Results5FROM model_results6LIMIT 1;

Details


About this Template

Part 1 or Real-time threat Detection - This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.

This Notebook can be run in Standard and Enterprise deployments.

Tags

advancedcybersecurityvectordbiotai

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.