IT Threat Detection, Part 1
Notebook
Note
This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace
This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.
In this instance, we aim to construct a network intrusion detection system. These systems continuously monitor incoming and outgoing network traffic, generating alerts when potential threats are detected. We'll utilize a combination of a deep learning model and similarity search to identify and classify network intrusion traffic.
Our initial step involves a dataset of labeled traffic events, distinguishing between benign and malicious events, by transforming them into vector embeddings. These vector embeddings serve as comprehensive mathematical representations of network traffic events. SingleStoreDB's built-in similarity-search algorithms allow us to measure the similarity between different network events. To generate these embeddings, we'll leverage a deep learning model based on recent academic research.
Subsequently, we'll apply this dataset to search for the most similar matches when presented with new, unseen network events. We'll retrieve these matches along with their corresponding labels. This process enables us to classify the unseen events as either benign or malicious by propagating the labels of the matched events. It's essential to note that intrusion detection is a complex classification task, primarily because malicious events occur infrequently. The similarity search service plays a crucial role in identifying relevant historical labeled events, thus enabling the identification of these rare events while maintaining a low rate of false alarms.
Install Dependencies
In [1]:
!pip install tensorflow keras==2.15.0 --quiet
In [2]:
import osos.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'import pandas as pdimport tensorflow.keras.backend as Kfrom tensorflow import kerasfrom tensorflow.keras.models import Model
We'll define a Python context manager called clear_memory()
using the contextlib module. This context manager will be used to clear memory by running Python's garbage collector (gc.collect()
) after a block of code is executed.
In [3]:
import contextlibimport gc@contextlib.contextmanagerdef clear_memory():try:yieldfinally:gc.collect()
We'll will incorporate portions of code from research work. To begin, we'll clone the repository required for data preparation.
In [4]:
!git clone -q https://github.com/Colorado-Mesa-University-Cybersecurity/DeepLearning-IDS.git
Data Preparation
The datasets we'll utilize comprise two types of network traffic:
Benign (normal)
Malicious (attack)
stemming from various network attacks. Our focus will be solely on web-based attacks. These web attacks fall into three common categories:
Cross-site scripting (BruteForce-XSS)
SQL-Injection (SQL-Injection)
Brute force attempts on administrative and user passwords (BruteForce-Web)
The original data was collected over a span of two days.
Download Data
We'll proceed by downloading data for two specific dates:
February 22, 2018
February 23, 2018
These files will be retrieved and saved to the current directory. Our intention is to use one of these dates for training and generating vectors, while the other will be reserved for testing purposes.
In [5]:
!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress
Review Data
In [6]:
with clear_memory():data = pd.read_csv('Friday-23-02-2018_TrafficForML_CICFlowMeter.csv')data.Label.value_counts()
Clean Data
We'll run a cleanup script from the previously downloaded GitHub repo.
In [7]:
!python DeepLearning-IDS/data_cleanup.py "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" "result23022018"
We'll now review the cleaned data from the previous step.
In [8]:
with clear_memory():data_23_cleaned = pd.read_csv('result23022018.csv')data_23_cleaned.head()
In [9]:
data_23_cleaned.Label.value_counts()
Load Model
In this section, we'll load a pre-trained model that has been trained on data collected from the same date.
There are slight modifications to the original model, specifically, altering the number of classes. Initially, the model was designed to classify into four categories:
Benign
BruteForce-Web
BruteForce-XSS
SQL-Injection
Our modified model has been adjusted to classify into just two categories:
Benign
Attack
Action Required
The ZIP file is hosted on a Google Drive.
Using the Edit Firewall button in the top right, add the following to the SingleStoreDB Cloud notebook firewall, one-by-one:
drive.google.com
*.googleapis.com
*.googleusercontent.com
In [10]:
!wget -q -O it_threat_model.zip "https://drive.google.com/uc?export=download&id=1ahr5dYlhuxS56M6helUFI0yIxxIoFk9o"!unzip -q it_threat_model.zip
In [11]:
with clear_memory():model = keras.models.load_model('it_threat_model')model.summary()
In [12]:
with clear_memory():# Use the first layerlayer_name = 'dense'intermediate_layer_model = Model(inputs = model.input,outputs = model.get_layer(layer_name).output)
Upload Data to SingleStoreDB
Prepare Data
We'll use a method for defining item IDs that aligns with the event's label.
In [13]:
from tqdm import tqdmitems_to_upload = []with clear_memory():model_res = intermediate_layer_model.predict(K.constant(data_23_cleaned.iloc[:,:-1]))for i, res in tqdm(zip(data_23_cleaned.iterrows(), model_res), total = len(model_res)):benign_or_attack = i[1]['Label'][:3]items_to_upload.append((benign_or_attack + '_' + str(i[0]), res.tolist()))
We'll store the data in a Pandas DataFrame.
In [14]:
with clear_memory():df = pd.DataFrame(items_to_upload, columns=['ID', 'Model_Results'])df.head()
Now we'll convert the vectors to a binary format, ready to store in SingleStoreDB.
In [15]:
import structdef data_to_binary(data: list[float]):format_string = 'f' * len(data)return struct.pack(format_string, *data)with clear_memory():df['Model_Results'] = df['Model_Results'].apply(data_to_binary)
We'll check the DataFrame.
In [16]:
df.head()
Create Database and Table
In [17]:
%%sqlDROP DATABASE IF EXISTS siem_log_kafka_demo;CREATE DATABASE IF NOT EXISTS siem_log_kafka_demo;USE siem_log_kafka_demo;DROP TABLE IF EXISTS model_results_demo;CREATE TABLE IF NOT EXISTS model_results (id TEXT,Model_Results BLOB);
Get Connection Details
Action Required
Select the database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by SQLAlchemy to make connections to the selected database.
In [18]:
from sqlalchemy import *db_connection = create_engine(connection_url)
Store DataFrame
In [19]:
with clear_memory():df.to_sql('model_results',con = db_connection,if_exists = 'append',index = False,chunksize = 1000)
Check Stored Data
In [20]:
%%sqlUSE siem_log_kafka_demo;SELECT ID, JSON_ARRAY_UNPACK(Model_Results) AS Model_ResultsFROM model_resultsLIMIT 1;
Details
About this Template
Part 1 or Real-time threat Detection - This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.