Telecom Fraud Detection Through CDR Using SingleStore Vector Functions

AI offers a dynamic and sophisticated means to detect and prevent telecom anomalies, which often involves identifying unusual patterns in data. Vectorization makes the database detecting these anomalies more efficient, executing the complex queries needed for pattern recognition in some fraud cases. SingleStore can play a vital role in setting pattern recognition using its vector capabilities.

What is SingleStore?

SingleStore is a real-time, distributed SQL database built for AI. With familiar SQL tooling and MySQL wire protocol compatibility, SingleStore eliminates the need for specialized databases and simplifies data architectures.

SingleStore is also built to handle multiple data types — including JSON, time-series, geospatial, full-text search and vectors — delivering high-speed data ingestion on a unified transactional and analytical foundation.

Vector capabilities of SingleStore

SingleStore supports vector database processing, allowing you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc., and vector database searches find data based on content or meaning — even without exact matches. For example, vector search allows for semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

Some benefits of using SingleStore for vector database processing — as opposed to a specialized vector database system — are:

A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support and more.
Less data movement is needed between different data subsystems (e.g., caches, text search systems and SQL databases) when all t data, including vector data, is stored in SingleStore.
Operational costs may be reduced since fewer data management tools and copies of data are needed.
Less specialized skills and reduced labor are needed to run an application environment.

Anomalies in telecom CDR

Anomalies in telecom CDR (Call Detail Record) refer to irregular or unusual patterns in the data that records details of phone calls, SMS messages or data usage in a telecommunications network. A report from the Communication Fraud Control Association (CFCA) reveals that telecommunications fraud increased 12% in 2023, equating to an estimated $38.95 billion lost to fraud.

**Source:** *Communication Fraud Control Association, Nov. 2023*

Anomalies in CDR records are a good source to detect frauds in telecommunication.
We will address the following frauds through vector capabilities of SingleStore in telecom CDR records:

A. International revenue share fraud

B. Wangiri fraud

C. Subscription fraud

D. PBX hacking

Lets understand these frauds through sample CDR records

Explanation:

International revenue share fraud (IRSF):

Example: Call ID 1
Signs: The call is unusually long (1 hour) to an international number and incurs a high cost. IRSF typically involves making long calls to premium-rate international numbers owned by the fraudster.
Detection: Look for long-duration calls to known high-cost international destinations.

Wangiri fraud:

Example: Call ID 2
Signs: The call has a very short duration (10 seconds) to an international number, which is a common sign of Wangiri fraud where the fraudster expects a callback.
Detection: Track short missed or very short duration calls to international numbers, followed by callbacks.

Subscription fraud:

Example: Call ID 3
Signs: This is harder to detect directly from a CDR. However, if this number is new, recently activated and immediately starts making expensive international calls, it could be a sign
Detection: Monitor new accounts for unusual calling patterns, especially to high-cost destinations.

PBX hacking:

Example: Call ID 5
Signs: The caller number is an extension (e.g., 1234), indicating a call made from a PBX system, and it's a long-duration international call. Hackers often break into PBX systems to make long or numerous international calls.
Detection: Look for unusual call patterns from PBX extensions, like calls at odd hours, to international destinations or of unusually long duration.

Detecting anomalies for these frauds using SingleStore vector capabilities

Step 1: Setting up the SingleStore database environment

1.1. Create the database and table to store Telecom CDR

1drop database IF EXISTS cdranomalies;2create database cdranomalies;3use cdranomalies;4drop table if EXISTS telecom_cdr;5create table telecom_cdr(Call_ID bigint auto_increment not null primary key,6Caller_Number varchar(50),7Callee_Number varchar(50),8Start_Time timestamp,9Duration bigint,10Cost float,11Call_Type varchar(50),12Usage_Type varchar(50),13Embeddings blob,14Anomaly varchar(100));

1.2. Create and start the pipeline to ingest records into the table

1CREATE OR REPLACE  AGGREGATOR PIPELINE telecomcdr_pipeline AS2LOAD DATA S3 's3://gpsteam/telecomcdr/combined_cdr_data.csv'3CONFIG '{"region":"us-east-1"}'4CREDENTIALS.                     '{"aws_access_key_id":                  "","aws_secret_access_key": 5"","aws_session_token": ""}'6SKIP DUPLICATE KEY ERRORS7INTO TABLE telecom_cdr 8FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'9LINES TERMINATED BY '\n'10IGNORE                                                                                                                                    111LINES(Caller_Number,Callee_Number,Start_Time,Duration,Cost,Call_Type,Usage_Type;12START PIPELINE telecomcdr_pipeline;13

1.3. Stored procedure to detect and mark anomalies for CDR records in the table

1CREATE OR REPLACE PROCEDURE detectanomaly() AS2BEGIN3    UPDATE telecom_cdr SET Anomaly = CASE    4WHEN 5dot_product(Embeddings,JSON_ARRAY_PACK("[0.006190149579197168,........,-0.0149634113249182701]")) >= 0.981 THEN 'IRSF'7WHEN 8dot_product(Embeddings,JSON_ARRAY_PACK("[-0.006254644598811865,.........,-0.0269534441858530045]")) >= 0.981 THEN 'Wangiri Fraud'10WHEN 11dot_product(Embeddings,JSON_ARRAY_PACK("[0.0009138236637227237,.........,-0.01812494699150323868]")) >= 0.981 THEN 'PBX Hacking'13        ELSE NULL14    END;15END;

Step 2: Setting up the environment and writing the Python program to generate vector embeddings, marking anomalies for different frauds and its visual representation over the time

2.1. Environment setup for Python program

1!pip install pymysql2!pip install panda3!pip install openai4!pip install singlestoredb5!pip install sqlalchemy6!pip install mysql-connector-python7!pip install matplotlib8!pip install plotly

2.2. Actual Python program

1import pandas as pd2import sqlalchemy3import openai4from singlestoredb import connect5from sqlalchemy import create_engine,text6import matplotlib.pyplot as plt7from matplotlib.dates import DateFormatter8import matplotlib.dates as mdates9import numpy as np10import datetime11import plotly.express as px12import plotly.graph_objs as go13# OpenAI API key14openai.api_key = 15'sk-DXwarIh88HeIexe7VnwzX3BlbkFJOZim8Mot9Y2YHupF2n2x'16
17# Create a connection to the database18engine = create_engine(connection_url)19
20# Query the database to get the data from 'telecom_cdr' table21#df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type` FROM 22telecom_cdr', engine)23df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type`, 24`Start_Time`, `Anomaly` FROM telecom_cdr', engine)25# Function to generate embeddings26def generate_embeddings(row):27    text = f"Duration: {row['Duration']}, Cost: {row['Cost']}, Call Type: 28{row['Call_Type']}"29    response = openai.embeddings.create(input=[text],30model="text-embedding-ada-002")31    return response.data[0].embedding32
33# Apply the function to each row34df['Embeddings'] = df.apply(generate_embeddings, axis=1)35with engine.connect() as connection:36# Update the database with the new embeddings37    for index, row in df.iterrows():38        update_query = f"""UPDATE telecom_cdr SET `Embeddings` = 39JSON_ARRAY_PACK(:embeddings) WHERE Call_ID = :call_id"""40        connection.execute(text(update_query), {"embeddings": 41str(row['Embeddings']), "call_id": row['Call_ID']})42with engine.connect() as connection1:43    update_query1 = f"""CALL detectanomaly();"""44    connection1.execute(text(update_query1))45
46anomaly_query = """47SELECT 48    DATE_FORMAT(Start_Time, '%Y-%m') AS Month, 49    Anomaly, 50    COUNT(*) AS Anomaly_Count 51FROM 52    telecom_cdr 53GROUP BY 54    Month, Anomaly55"""56anomaly_df = pd.read_sql(anomaly_query, engine)57
58# Pivot the dataframe to get a column for each anomaly type59pivot_df = anomaly_df.pivot(index='Month', columns='Anomaly', 60values='Anomaly_Count').fillna(0)61
62# Reset index to make 'Month' a column again63pivot_df.reset_index(inplace=True)64
65# Convert the pivoted dataframe into a format suitable for Plotly66plotly_data = pivot_df.melt(id_vars=['Month'], var_name='Anomaly', 67value_name='Anomaly_Count')68
69# Ensure the 'Month' column is in datetime format for proper plotting70plotly_data['Month'] = pd.to_datetime(plotly_data['Month'])71
72# Create the line plot using Plotly73fig = px.line(plotly_data, 74              x='Month', 75              y='Anomaly_Count', 76              color='Anomaly',77              title='Anomaly Counts Over Time',78              labels={'Anomaly_Count': 'Anomaly Count', 'Month': 'Month'})79
80# Update layout for better readability81fig.update_layout(82    xaxis_title='Month',83    yaxis_title='Anomaly Count',84    legend_title='Anomaly Types',85    xaxis_tickangle=-4586)87
88# Show the plot89fig.show()# Close the database connection90engine.dispose()91

2.3. Visual graph for anomaly detection.

3. Here are the steps demonstrating how how SingleStore works in the preceding Python program to detect anomalies for fraud detection

3.1. Importing required packages and defining OpenAI key

3.2. Create the database connection and query the database to get data from the table to mark anomalies.

3.3. Defining a function to generate embeddings using the OpenAI method text-embedding-ada-002 model.

3.4. Apply the function “generate_embeddings” to generate embeddings for each row. In this section I am also running one database stored procedure “detectanomaly” to mark an anomaly to each record if the row satisfies conditions mentioned in the preceding Explanation¹ section.

How to mark anomalies for CDR

First, generate a base embedding for each condition mentioned in the Explanation¹section. These base embeddings help mark anomalies for each record near the base embeddings using the SingleStore DOT_PRODUCT vector function. In my case, I have considered the duration, cost and call_type column to generate base embeddings. You can also include other factors to generate this base embedding.

In the second step, get the embeddings of each CDR using the function “generate_embeddings”. This function is used in the previous step to generate embeddings for each CDR.

Finally, it's time to mark anomalies for each CDR matching or near the base embeddings. For this I have used one SingleStore stored procedure, “detectanomaly” which updates each the CDR with particular anomalies including 'IRSF,' 'Wangiri fraud' and 'PBX hacking'.

4. Criteria to mark anomalies

The stored procedure “detectanomaly” calculates whether each record is anomalous based on the SingleStore DOT_PRODUCT function of its embeddings with predefined vectors, each representing a specific type of fraud. If a certain threshold is met or exceeded, the record is flagged with the corresponding type of fraud.

DOT_PRODUCT() can be used to compute a cosine similarity metric of the two input vectors, if the input vectors are normalized to length 1.

In the stored procedure I have set the threshold 0.981 to mark anomalies of similar type based on a criteria of duration, cost and call_type — but you can set your own threshold to mark anomalies based on specific requirements.