
IT Threat Detection, Part 2
Notebook

Note
This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace
Install Dependencies
In [1]:
1!pip3 install tensorflow keras==2.15.0 scikit-learn --quiet
In [2]:
1import os2os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'3 4import pandas as pd5import tensorflow.keras.backend as K6from collections import Counter7from sklearn.metrics import accuracy_score, precision_score, recall_score8from sklearn.metrics import confusion_matrix9from tensorflow import keras10from tensorflow.keras.models import Model
We'll define a Python context manager called clear_memory()
using the contextlib module. This context manager will be used to clear memory by running Python's garbage collector (gc.collect()
) after a block of code is executed.
In [3]:
1import contextlib2import gc3 4@contextlib.contextmanager5def clear_memory():6 try:7 yield8 finally:9 gc.collect()
Load Model
In [4]:
1with clear_memory():2 model = keras.models.load_model('it_threat_model')3 4model.summary()
In [5]:
1with clear_memory():2 # Select the first layer3 layer_name = 'dense'4 intermediate_layer_model = Model(5 inputs = model.input,6 outputs = model.get_layer(layer_name).output7 )
Data Preparation
We'll use the second file we downloaded earlier for testing purposes.
Review Data
In [6]:
1with clear_memory():2 data = pd.read_csv('Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv')3 4data.Label.value_counts()
Clean Data
We'll run a cleanup script from the previously downloaded GitHub repo.
In [7]:
1!python DeepLearning-IDS/data_cleanup.py "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" "result22022018"
We'll now review the cleaned data from the previous step.
In [8]:
1with clear_memory():2 data_22_cleaned = pd.read_csv('result22022018.csv')3 4data_22_cleaned.head()
In [9]:
1data_22_cleaned.Label.value_counts()
We'll create a sample that encompasses all the distinct types of web attacks observed on this particular date.
In [10]:
1with clear_memory():2 data_sample = data_22_cleaned[-2000:]3 4data_sample.Label.value_counts()
Get Connection Details
Action Required
Select the database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by SQLAlchemy to make connections to the selected database.
In [11]:
1from sqlalchemy import *2 3db_connection = create_engine(connection_url)
Queries
Next, we'll perform queries on the test dataset and store the predicted and expected results, enabling us to construct a confusion matrix.
In [12]:
1from tqdm import tqdm2import numpy as np3 4y_true = []5y_pred = []6 7BATCH_SIZE = 1008 9for i in tqdm(range(0, len(data_sample), BATCH_SIZE)):10 test_data = data_sample.iloc[i:i+BATCH_SIZE, :]11 12 # Create vector embedding using the model13 test_vector = intermediate_layer_model.predict(K.constant(test_data.iloc[:, :-1]))14 query_results = []15 16 for xq in test_vector.tolist():17 # SQL query here, make sure it returns 'id' column18 query_res = %sql SELECT id, EUCLIDEAN_DISTANCE(Model_Results, JSON_ARRAY_PACK('{{xq}}')) AS score FROM model_results WHERE score IS NOT NULL ORDER BY score ASC LIMIT 50;19 query_results.append(pd.DataFrame(query_res))20 21 for label, res in zip(test_data.Label.values, query_results):22 23 if 'id' not in res.columns:24 print("Column 'id' not found in res.")25 continue26 27 if label == 'Benign':28 y_true.append(0)29 else:30 y_true.append(1)31 32 ids_to_count = [id.split('_')[0] for id in res['id']]33 counter = Counter(ids_to_count)34 # print(counter)35 36 if counter.get('Bru') or counter.get('SQL'):37 y_pred.append(1)38 else:39 y_pred.append(0)
Visualize Results
Confusion Matrix
In [13]:
1import plotly.graph_objs as go2 3# Calculate the confusion matrix4conf_matrix = confusion_matrix(y_true, y_pred)5 6# Create a DataFrame from the confusion matrix7conf_matrix_df = pd.DataFrame(8 conf_matrix,9 columns = ['Benign', 'Attack'],10 index = ['Benign', 'Attack']11)12 13# Create an empty list to store annotations14annotations = []15 16# Define a threshold for text color17thresh = conf_matrix_df.values.max() / 218 19# Loop through the confusion matrix and add annotations with text color based on the threshold20for i in range(len(conf_matrix_df)):21 for j in range(len(conf_matrix_df)):22 value = conf_matrix_df.iloc[i, j]23 text_color = "white" if value > thresh else "black"24 annotations.append(25 go.layout.Annotation(26 x = j,27 y = i,28 text = str(value),29 font = dict(color = text_color),30 showarrow = False,31 )32 )33 34# Create a heatmap trace with showscale set to False35trace = go.Heatmap(36 z = conf_matrix_df.values,37 x = ['Benign', 'Attack'],38 y = ['Benign', 'Attack'],39 colorscale = 'Reds',40 showscale = False41)42 43# Create the figure with heatmap and annotations44fig = go.Figure(45 data = [trace],46 layout = {47 "title": "Confusion Matrix",48 "xaxis": {"title": "Predicted", "scaleanchor": "y", "scaleratio": 1},49 "yaxis": {"title": "Actual"},50 "annotations": annotations,51 "height": 400,52 "width": 40053 }54)55 56fig.show()
In [14]:
1# Create confusion matrix2conf_matrix = confusion_matrix(y_true, y_pred)3 4# Define class labels5class_labels = ['Benign', 'Attack']6 7# Print confusion matrix with labels8print("Confusion Matrix:")9for i in range(len(class_labels)):10 for j in range(len(class_labels)):11 print(f"{class_labels[i]} (Actual) -> {class_labels[j]} (Predicted): {conf_matrix[i][j]}")
Accuracy
In [15]:
1# Calculate accuracy2acc = accuracy_score(y_true, y_pred, normalize = True, sample_weight = None)3precision = precision_score(y_true, y_pred)4recall = recall_score(y_true, y_pred)5 6print(f"Accuracy: {acc:.3f}")7print(f"Precision: {precision:.3f}")8print(f"Recall: {recall:.3f}")
Per Class Accuracy
In [16]:
1# Calculate per class accuracy2cmd = confusion_matrix(y_true, y_pred, normalize = "true").diagonal()3per_class_accuracy_df = pd.DataFrame([(index, round(value,4)) for index, value in zip(['Benign', 'Attack'], cmd)], columns = ['type', 'accuracy'])4per_class_accuracy_df = per_class_accuracy_df.round(2)5display(per_class_accuracy_df)
Predict Values Directly from Model
We achieved excellent results with SingleStoreDB. Now, let's explore what happens when we bypass the similarity search step and make predictions directly from the model. In other words, we'll utilize the model responsible for generating the embeddings as a classifier. We can then compare the accuracy of this approach with that of the similarity search method.
In [17]:
1from tensorflow.keras.utils import normalize2import numpy as np3 4data_sample = normalize(data_22_cleaned.iloc[:, :-1])[-2000:]5y_pred_model = model.predict(normalize(data_sample)).flatten()6y_pred_model = np.round(y_pred_model)
Visualize Results
Confusion Matrix
In [18]:
1# Create confusion matrix2conf_matrix = confusion_matrix(y_true, y_pred_model)3 4# Create a DataFrame from the confusion matrix5conf_matrix_df = pd.DataFrame(6 conf_matrix,7 columns = ['Benign', 'Attack'],8 index = ['Benign', 'Attack']9)10 11# Create an empty list to store annotations12annotations = []13 14# Define a threshold for text color15thresh = conf_matrix_df.values.max() / 216 17# Loop through the confusion matrix and add annotations with text color based on the threshold18for i in range(len(conf_matrix_df)):19 for j in range(len(conf_matrix_df)):20 value = conf_matrix_df.iloc[i, j]21 text_color = "white" if value > thresh else "black"22 annotations.append(23 go.layout.Annotation(24 x = j,25 y = i,26 text = str(value),27 font = dict(color=text_color),28 showarrow = False,29 )30 )31 32# Create a heatmap trace with showscale set to False33trace = go.Heatmap(34 z = conf_matrix_df.values,35 x = ['Benign', 'Attack'],36 y = ['Benign', 'Attack'],37 colorscale = 'Reds',38 showscale = False39)40 41# Create the figure with heatmap and annotations42fig = go.Figure(43 data = [trace],44 layout = {45 "title": "Confusion Matrix",46 "xaxis": {"title": "Predicted", "scaleanchor": "y", "scaleratio": 1},47 "yaxis": {"title": "Actual"},48 "annotations": annotations,49 "height": 400,50 "width": 40051 }52)53 54fig.show()
In [19]:
1# Create confusion matrix2conf_matrix = confusion_matrix(y_true, y_pred_model)3 4# Define class labels5class_labels = ['Benign', 'Attack']6 7# Print confusion matrix with labels8print("Confusion Matrix:")9for i in range(len(class_labels)):10 for j in range(len(class_labels)):11 print(f"{class_labels[i]} (Actual) -> {class_labels[j]} (Predicted): {conf_matrix[i][j]}")
Accuracy
In [20]:
1# Calculate accuracy2acc = accuracy_score(y_true, y_pred_model, normalize = True, sample_weight = None)3precision = precision_score(y_true, y_pred_model)4recall = recall_score(y_true, y_pred_model)5 6print(f"Accuracy: {acc:.3f}")7print(f"Precision: {precision:.3f}")8print(f"Recall: {recall:.3f}")
Per Class Accuracy
In [21]:
1# Calculate per class accuracy2cmd = confusion_matrix(y_true, y_pred_model, normalize = "true").diagonal()3per_class_accuracy_df = pd.DataFrame([(index, round(value,4)) for index, value in zip(['Benign', 'Attack'], cmd)], columns = ['type', 'accuracy'])4per_class_accuracy_df = per_class_accuracy_df.round(2)5display(per_class_accuracy_df)
Conclusions
Utilizing SingleStoreDB's vector embeddings, we achieved an extremely high detection rate for attacks while maintaining a very small false-positive rate. Furthermore, our example showed that our similarity search methodology surpassed the direct classification approach that relies on the classifier's embedding model.

Details
About this Template
Part 2 or Real-time threat Detection - Validate the accuracy of the threat detection model with a test dataset
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.
See Notebook in action
Launch this notebook in SingleStore and start executing queries instantly.