New

IT Threat Detection, Part 1

Notebook

SingleStore Notebooks

IT Threat Detection, Part 1

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.

In this instance, we aim to construct a network intrusion detection system. These systems continuously monitor incoming and outgoing network traffic, generating alerts when potential threats are detected. We'll utilize a combination of a deep learning model and similarity search to identify and classify network intrusion traffic.

Our initial step involves a dataset of labeled traffic events, distinguishing between benign and malicious events, by transforming them into vector embeddings. These vector embeddings serve as comprehensive mathematical representations of network traffic events. SingleStoreDB's built-in similarity-search algorithms allow us to measure the similarity between different network events. To generate these embeddings, we'll leverage a deep learning model based on recent academic research.

Subsequently, we'll apply this dataset to search for the most similar matches when presented with new, unseen network events. We'll retrieve these matches along with their corresponding labels. This process enables us to classify the unseen events as either benign or malicious by propagating the labels of the matched events. It's essential to note that intrusion detection is a complex classification task, primarily because malicious events occur infrequently. The similarity search service plays a crucial role in identifying relevant historical labeled events, thus enabling the identification of these rare events while maintaining a low rate of false alarms.

Install Dependencies

In [1]:

!pip install tensorflow keras==2.15.0 --quiet

In [2]:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import pandas as pd
import tensorflow.keras.backend as K
from tensorflow import keras
from tensorflow.keras.models import Model

We'll define a Python context manager called clear_memory() using the contextlib module. This context manager will be used to clear memory by running Python's garbage collector (gc.collect()) after a block of code is executed.

In [3]:

import contextlib
import gc
@contextlib.contextmanager
def clear_memory():
try:
yield
finally:
gc.collect()

We'll will incorporate portions of code from research work. To begin, we'll clone the repository required for data preparation.

In [4]:

!git clone -q https://github.com/Colorado-Mesa-University-Cybersecurity/DeepLearning-IDS.git

Data Preparation

The datasets we'll utilize comprise two types of network traffic:

  1. Benign (normal)

  2. Malicious (attack)

stemming from various network attacks. Our focus will be solely on web-based attacks. These web attacks fall into three common categories:

  1. Cross-site scripting (BruteForce-XSS)

  2. SQL-Injection (SQL-Injection)

  3. Brute force attempts on administrative and user passwords (BruteForce-Web)

The original data was collected over a span of two days.

Download Data

We'll proceed by downloading data for two specific dates:

  1. February 22, 2018

  2. February 23, 2018

These files will be retrieved and saved to the current directory. Our intention is to use one of these dates for training and generating vectors, while the other will be reserved for testing purposes.

In [5]:

!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress
!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress

Review Data

In [6]:

with clear_memory():
data = pd.read_csv('Friday-23-02-2018_TrafficForML_CICFlowMeter.csv')
data.Label.value_counts()

Clean Data

We'll run a cleanup script from the previously downloaded GitHub repo.

In [7]:

!python DeepLearning-IDS/data_cleanup.py "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" "result23022018"

We'll now review the cleaned data from the previous step.

In [8]:

with clear_memory():
data_23_cleaned = pd.read_csv('result23022018.csv')
data_23_cleaned.head()

In [9]:

data_23_cleaned.Label.value_counts()

Load Model

In this section, we'll load a pre-trained model that has been trained on data collected from the same date.

There are slight modifications to the original model, specifically, altering the number of classes. Initially, the model was designed to classify into four categories:

  1. Benign

  2. BruteForce-Web

  3. BruteForce-XSS

  4. SQL-Injection

Our modified model has been adjusted to classify into just two categories:

  1. Benign

  2. Attack

Action Required

The ZIP file is hosted on a Google Drive.

Using the Edit Firewall button in the top right, add the following to the SingleStoreDB Cloud notebook firewall, one-by-one:

  • drive.google.com

  • *.googleapis.com

  • *.googleusercontent.com

In [10]:

!wget -q -O it_threat_model.zip "https://drive.google.com/uc?export=download&id=1ahr5dYlhuxS56M6helUFI0yIxxIoFk9o"
!unzip -q it_threat_model.zip

In [11]:

with clear_memory():
model = keras.models.load_model('it_threat_model')
model.summary()

In [12]:

with clear_memory():
# Use the first layer
layer_name = 'dense'
intermediate_layer_model = Model(
inputs = model.input,
outputs = model.get_layer(layer_name).output
)

Upload Data to SingleStoreDB

Prepare Data

We'll use a method for defining item IDs that aligns with the event's label.

In [13]:

from tqdm import tqdm
items_to_upload = []
with clear_memory():
model_res = intermediate_layer_model.predict(K.constant(data_23_cleaned.iloc[:,:-1]))
for i, res in tqdm(zip(data_23_cleaned.iterrows(), model_res), total = len(model_res)):
benign_or_attack = i[1]['Label'][:3]
items_to_upload.append((benign_or_attack + '_' + str(i[0]), res.tolist()))

We'll store the data in a Pandas DataFrame.

In [14]:

with clear_memory():
df = pd.DataFrame(items_to_upload, columns=['ID', 'Model_Results'])
df.head()

Now we'll convert the vectors to a binary format, ready to store in SingleStoreDB.

In [15]:

import struct
def data_to_binary(data: list[float]):
format_string = 'f' * len(data)
return struct.pack(format_string, *data)
with clear_memory():
df['Model_Results'] = df['Model_Results'].apply(data_to_binary)

We'll check the DataFrame.

In [16]:

df.head()

Create Database and Table

In [17]:

%%sql
DROP DATABASE IF EXISTS siem_log_kafka_demo;
CREATE DATABASE IF NOT EXISTS siem_log_kafka_demo;
USE siem_log_kafka_demo;
DROP TABLE IF EXISTS model_results_demo;
CREATE TABLE IF NOT EXISTS model_results (
id TEXT,
Model_Results BLOB
);

Get Connection Details

Action Required

Select the database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by SQLAlchemy to make connections to the selected database.

In [18]:

from sqlalchemy import *
db_connection = create_engine(connection_url)

Store DataFrame

In [19]:

with clear_memory():
df.to_sql(
'model_results',
con = db_connection,
if_exists = 'append',
index = False,
chunksize = 1000
)

Check Stored Data

In [20]:

%%sql
USE siem_log_kafka_demo;
SELECT ID, JSON_ARRAY_UNPACK(Model_Results) AS Model_Results
FROM model_results
LIMIT 1;

Details

Tags

#advanced#cybersecurity#vectordb#iot#ai

License

This Notebook has been released under the Apache 2.0 open source license.