New

IT Threat Detection, Part 2

Notebook

SingleStore Notebooks

IT Threat Detection, Part 2

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

Install Dependencies

In [1]:

1
!pip3 install tensorflow keras==2.15.0 scikit-learn --quiet

In [2]:

1
import os
2
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
3

4
import pandas as pd
5
import tensorflow.keras.backend as K
6
from collections import Counter
7
from sklearn.metrics import accuracy_score, precision_score, recall_score
8
from sklearn.metrics import confusion_matrix
9
from tensorflow import keras
10
from tensorflow.keras.models import Model

We'll define a Python context manager called clear_memory() using the contextlib module. This context manager will be used to clear memory by running Python's garbage collector (gc.collect()) after a block of code is executed.

In [3]:

1
import contextlib
2
import gc
3

4
@contextlib.contextmanager
5
def clear_memory():
6
    try:
7
        yield
8
    finally:
9
        gc.collect()

Load Model

In [4]:

1
with clear_memory():
2
    model = keras.models.load_model('it_threat_model')
3

4
model.summary()

In [5]:

1
with clear_memory():
2
    # Select the first layer
3
    layer_name = 'dense'
4
    intermediate_layer_model = Model(
5
        inputs = model.input,
6
        outputs = model.get_layer(layer_name).output
7
    )

Data Preparation

We'll use the second file we downloaded earlier for testing purposes.

Review Data

In [6]:

1
with clear_memory():
2
    data = pd.read_csv('Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv')
3

4
data.Label.value_counts()

Clean Data

We'll run a cleanup script from the previously downloaded GitHub repo.

In [7]:

!python DeepLearning-IDS/data_cleanup.py "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" "result22022018"

We'll now review the cleaned data from the previous step.

In [8]:

1
with clear_memory():
2
    data_22_cleaned = pd.read_csv('result22022018.csv')
3

4
data_22_cleaned.head()

In [9]:

1
data_22_cleaned.Label.value_counts()

We'll create a sample that encompasses all the distinct types of web attacks observed on this particular date.

In [10]:

1
with clear_memory():
2
    data_sample = data_22_cleaned[-2000:]
3

4
data_sample.Label.value_counts()

Get Connection Details

Action Required

Select the database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by SQLAlchemy to make connections to the selected database.

In [11]:

1
from sqlalchemy import *
2

3
db_connection = create_engine(connection_url)

Queries

Next, we'll perform queries on the test dataset and store the predicted and expected results, enabling us to construct a confusion matrix.

In [12]:

1
from tqdm import tqdm
2
import numpy as np
3

4
y_true = []
5
y_pred = []
6

7
BATCH_SIZE = 100
8

9
for i in tqdm(range(0, len(data_sample), BATCH_SIZE)):
10
    test_data = data_sample.iloc[i:i+BATCH_SIZE, :]
11

12
    # Create vector embedding using the model
13
    test_vector = intermediate_layer_model.predict(K.constant(test_data.iloc[:, :-1]))
14
    query_results = []
15

16
    for xq in test_vector.tolist():
17
        # SQL query here, make sure it returns 'id' column
18
        query_res = %sql SELECT id, EUCLIDEAN_DISTANCE(Model_Results, JSON_ARRAY_PACK('{{xq}}')) AS score FROM model_results WHERE score IS NOT NULL ORDER BY score ASC LIMIT 50;
19
        query_results.append(pd.DataFrame(query_res))
20

21
    for label, res in zip(test_data.Label.values, query_results):
22

23
        if 'id' not in res.columns:
24
            print("Column 'id' not found in res.")
25
            continue
26

27
        if label == 'Benign':
28
            y_true.append(0)
29
        else:
30
            y_true.append(1)
31

32
        ids_to_count = [id.split('_')[0] for id in res['id']]
33
        counter = Counter(ids_to_count)
34
        # print(counter)
35

36
        if counter.get('Bru') or counter.get('SQL'):
37
            y_pred.append(1)
38
        else:
39
            y_pred.append(0)

Visualize Results

Confusion Matrix

In [13]:

1
import plotly.graph_objs as go
2

3
# Calculate the confusion matrix
4
conf_matrix = confusion_matrix(y_true, y_pred)
5

6
# Create a DataFrame from the confusion matrix
7
conf_matrix_df = pd.DataFrame(
8
    conf_matrix,
9
    columns = ['Benign', 'Attack'],
10
    index = ['Benign', 'Attack']
11
)
12

13
# Create an empty list to store annotations
14
annotations = []
15

16
# Define a threshold for text color
17
thresh = conf_matrix_df.values.max() / 2
18

19
# Loop through the confusion matrix and add annotations with text color based on the threshold
20
for i in range(len(conf_matrix_df)):
21
    for j in range(len(conf_matrix_df)):
22
        value = conf_matrix_df.iloc[i, j]
23
        text_color = "white" if value > thresh else "black"
24
        annotations.append(
25
            go.layout.Annotation(
26
                x = j,
27
                y = i,
28
                text = str(value),
29
                font = dict(color = text_color),
30
                showarrow = False,
31
            )
32
        )
33

34
# Create a heatmap trace with showscale set to False
35
trace = go.Heatmap(
36
    z = conf_matrix_df.values,
37
    x = ['Benign', 'Attack'],
38
    y = ['Benign', 'Attack'],
39
    colorscale = 'Reds',
40
    showscale = False
41
)
42

43
# Create the figure with heatmap and annotations
44
fig = go.Figure(
45
    data = [trace],
46
    layout = {
47
        "title": "Confusion Matrix",
48
        "xaxis": {"title": "Predicted", "scaleanchor": "y", "scaleratio": 1},
49
        "yaxis": {"title": "Actual"},
50
        "annotations": annotations,
51
        "height": 400,
52
        "width": 400
53
    }
54
)
55

56
fig.show()

In [14]:

1
# Create confusion matrix
2
conf_matrix = confusion_matrix(y_true, y_pred)
3

4
# Define class labels
5
class_labels = ['Benign', 'Attack']
6

7
# Print confusion matrix with labels
8
print("Confusion Matrix:")
9
for i in range(len(class_labels)):
10
    for j in range(len(class_labels)):
11
        print(f"{class_labels[i]} (Actual) -> {class_labels[j]} (Predicted): {conf_matrix[i][j]}")

Accuracy

In [15]:

1
# Calculate accuracy
2
acc = accuracy_score(y_true, y_pred, normalize = True, sample_weight = None)
3
precision = precision_score(y_true, y_pred)
4
recall = recall_score(y_true, y_pred)
5

6
print(f"Accuracy: {acc:.3f}")
7
print(f"Precision: {precision:.3f}")
8
print(f"Recall: {recall:.3f}")

Per Class Accuracy

In [16]:

1
# Calculate per class accuracy
2
cmd = confusion_matrix(y_true, y_pred, normalize = "true").diagonal()
3
per_class_accuracy_df = pd.DataFrame([(index, round(value,4)) for index, value in zip(['Benign', 'Attack'], cmd)], columns = ['type', 'accuracy'])
4
per_class_accuracy_df = per_class_accuracy_df.round(2)
5
display(per_class_accuracy_df)

Predict Values Directly from Model

We achieved excellent results with SingleStoreDB. Now, let's explore what happens when we bypass the similarity search step and make predictions directly from the model. In other words, we'll utilize the model responsible for generating the embeddings as a classifier. We can then compare the accuracy of this approach with that of the similarity search method.

In [17]:

1
from tensorflow.keras.utils import normalize
2
import numpy as np
3

4
data_sample = normalize(data_22_cleaned.iloc[:, :-1])[-2000:]
5
y_pred_model = model.predict(normalize(data_sample)).flatten()
6
y_pred_model = np.round(y_pred_model)

Visualize Results

Confusion Matrix

In [18]:

1
# Create confusion matrix
2
conf_matrix = confusion_matrix(y_true, y_pred_model)
3

4
# Create a DataFrame from the confusion matrix
5
conf_matrix_df = pd.DataFrame(
6
    conf_matrix,
7
    columns = ['Benign', 'Attack'],
8
    index = ['Benign', 'Attack']
9
)
10

11
# Create an empty list to store annotations
12
annotations = []
13

14
# Define a threshold for text color
15
thresh = conf_matrix_df.values.max() / 2
16

17
# Loop through the confusion matrix and add annotations with text color based on the threshold
18
for i in range(len(conf_matrix_df)):
19
    for j in range(len(conf_matrix_df)):
20
        value = conf_matrix_df.iloc[i, j]
21
        text_color = "white" if value > thresh else "black"
22
        annotations.append(
23
            go.layout.Annotation(
24
                x = j,
25
                y = i,
26
                text = str(value),
27
                font = dict(color=text_color),
28
                showarrow = False,
29
            )
30
        )
31

32
# Create a heatmap trace with showscale set to False
33
trace = go.Heatmap(
34
    z = conf_matrix_df.values,
35
    x = ['Benign', 'Attack'],
36
    y = ['Benign', 'Attack'],
37
    colorscale = 'Reds',
38
    showscale = False
39
)
40

41
# Create the figure with heatmap and annotations
42
fig = go.Figure(
43
    data = [trace],
44
    layout = {
45
        "title": "Confusion Matrix",
46
        "xaxis": {"title": "Predicted", "scaleanchor": "y", "scaleratio": 1},
47
        "yaxis": {"title": "Actual"},
48
        "annotations": annotations,
49
        "height": 400,
50
        "width": 400
51
    }
52
)
53

54
fig.show()

In [19]:

1
# Create confusion matrix
2
conf_matrix = confusion_matrix(y_true, y_pred_model)
3

4
# Define class labels
5
class_labels = ['Benign', 'Attack']
6

7
# Print confusion matrix with labels
8
print("Confusion Matrix:")
9
for i in range(len(class_labels)):
10
    for j in range(len(class_labels)):
11
        print(f"{class_labels[i]} (Actual) -> {class_labels[j]} (Predicted): {conf_matrix[i][j]}")

Accuracy

In [20]:

1
# Calculate accuracy
2
acc = accuracy_score(y_true, y_pred_model, normalize = True, sample_weight = None)
3
precision = precision_score(y_true, y_pred_model)
4
recall = recall_score(y_true, y_pred_model)
5

6
print(f"Accuracy: {acc:.3f}")
7
print(f"Precision: {precision:.3f}")
8
print(f"Recall: {recall:.3f}")

Per Class Accuracy

In [21]:

1
# Calculate per class accuracy
2
cmd = confusion_matrix(y_true, y_pred_model, normalize = "true").diagonal()
3
per_class_accuracy_df = pd.DataFrame([(index, round(value,4)) for index, value in zip(['Benign', 'Attack'], cmd)], columns = ['type', 'accuracy'])
4
per_class_accuracy_df = per_class_accuracy_df.round(2)
5
display(per_class_accuracy_df)

Conclusions

Utilizing SingleStoreDB's vector embeddings, we achieved an extremely high detection rate for attacks while maintaining a very small false-positive rate. Furthermore, our example showed that our similarity search methodology surpassed the direct classification approach that relies on the classifier's embedding model.

Details

About this Template

Part 2 or Real-time threat Detection - Validate the accuracy of the threat detection model with a test dataset

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

IT Threat Detection, Part 2

Notebook

IT Threat Detection, Part 2

Install Dependencies

Load Model

Data Preparation

Review Data

Clean Data

Get Connection Details

Queries

Visualize Results

Confusion Matrix

Accuracy

Per Class Accuracy

Predict Values Directly from Model

Visualize Results

Confusion Matrix

Accuracy

Per Class Accuracy

Conclusions

Details

About this Template

Tags

License

See Notebook in action