Telecom Fraud Detection Through CDR Using SingleStore Vector Functions

JR

Jeetendra Ranjan

Senior Enterprise Solutions Engineer

Telecom Fraud Detection Through CDR Using SingleStore Vector Functions

AI offers a dynamic and sophisticated means to detect and prevent telecom anomalies, which often involves identifying unusual patterns in data. Vectorization makes the database detecting these anomalies more efficient, executing the complex queries needed for pattern recognition in some fraud cases. SingleStore can play a vital role in setting pattern recognition using its vector capabilities.

what-is-single-storeWhat is SingleStore?

SingleStore is a real-time, distributed SQL database built for AI. With familiar SQL tooling and MySQL wire protocol compatibility, SingleStore eliminates the need for specialized databases and simplifies data architectures.

SingleStore is also built to handle multiple data types — including JSON, time-series, geospatial, full-text search and vectors — delivering high-speed data ingestion on a unified transactional and analytical foundation.

vector-capabilities-of-single-storeVector capabilities of SingleStore

SingleStore supports vector database processing, allowing you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc., and vector database searches find data based on content or meaning — even without exact matches. For example, vector search allows for semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

Some benefits of using SingleStore for vector database processing — as opposed to a specialized vector database system — are:

  • A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support and more.
  • Less data movement is needed between different data subsystems (e.g., caches, text search systems and SQL databases) when all t data, including vector data, is stored in SingleStore.
  • Operational costs may be reduced since fewer data management tools and copies of data are needed.
  • Less specialized skills and reduced labor are needed to run an application environment.

anomalies-in-telecom-cdrAnomalies in telecom CDR

Anomalies in telecom CDR (Call Detail Record) refer to irregular or unusual patterns in the data that records details of phone calls, SMS messages or data usage in a telecommunications network. A report from the  Communication Fraud Control Association (CFCA) reveals that telecommunications fraud increased 12% in 2023, equating to an estimated $38.95 billion lost to fraud.

Source: Communication Fraud Control Association, Nov. 2023

Anomalies in CDR records are a good source to detect frauds in telecommunication.
We will address the following frauds through vector capabilities of SingleStore in telecom CDR records:

A. International revenue share fraud

B. Wangiri fraud

C. Subscription fraud

D. PBX hacking

Lets understand these frauds through sample CDR records

Explanation:

International revenue share fraud (IRSF):

  • Example: Call ID 1
  • Signs: The call is unusually long (1 hour) to an international number and incurs a high cost. IRSF typically involves making long calls to premium-rate international numbers owned by the fraudster.
  • Detection: Look for long-duration calls to known high-cost international destinations.

Wangiri fraud:

  • Example: Call ID 2
  • Signs: The call has a very short duration (10 seconds) to an international number, which is a common sign of Wangiri fraud where the fraudster expects a callback.
  • Detection: Track short missed or very short duration calls to international numbers, followed by callbacks.

Subscription fraud:

  • Example: Call ID 3
  • Signs: This is harder to detect directly from a CDR. However, if this number is new, recently activated and immediately starts making expensive international calls, it could be a sign
  • Detection: Monitor new accounts for unusual calling patterns, especially to high-cost destinations.

PBX hacking:

  • Example: Call ID 5
  • Signs: The caller number is an extension (e.g., 1234), indicating a call made from a PBX system, and it's a long-duration international call. Hackers often break into PBX systems to make long or numerous international calls.
  • Detection: Look for unusual call patterns from PBX extensions, like calls at odd hours, to international destinations or of unusually long duration.

detecting-anomalies-for-these-frauds-using-single-store-vector-capabilitiesDetecting anomalies for these frauds using SingleStore vector capabilities

Step 1: Setting up the SingleStore database environment

1.1. Create the database and table to store Telecom CDR

drop database IF EXISTS cdranomalies;
create database cdranomalies;
use cdranomalies;
drop table if EXISTS telecom_cdr;
create table telecom_cdr(Call_ID bigint auto_increment not null primary key,
Caller_Number varchar(50),
Callee_Number varchar(50),
Start_Time timestamp,
Duration bigint,
Cost float,
Call_Type varchar(50),
Usage_Type varchar(50),
Embeddings blob,
Anomaly varchar(100));

1.2. Create and start the pipeline to ingest records into the table

CREATE OR REPLACE AGGREGATOR PIPELINE telecomcdr_pipeline AS
LOAD DATA S3 's3://gpsteam/telecomcdr/combined_cdr_data.csv'
CONFIG '{"region":"us-east-1"}'
CREDENTIALS. '{"aws_access_key_id": "","aws_secret_access_key":
"","aws_session_token": ""}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE telecom_cdr
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1
LINES(Caller_Number,Callee_Number,Start_Time,Duration,Cost,Call_Type,Usage_Type;
START PIPELINE telecomcdr_pipeline;

1.3. Stored procedure to detect and mark anomalies for CDR records in the table

CREATE OR REPLACE PROCEDURE detectanomaly() AS
BEGIN
UPDATE telecom_cdr SET Anomaly = CASE
WHEN
dot_product(Embeddings,JSON_ARRAY_PACK("[0.006190149579197168,........,-0.0149
34113249182701]")) >= 0.981 THEN 'IRSF'
WHEN
dot_product(Embeddings,JSON_ARRAY_PACK("[-0.006254644598811865,.........,-0.026
534441858530045]")) >= 0.981 THEN 'Wangiri Fraud'
WHEN
dot_product(Embeddings,JSON_ARRAY_PACK("[0.0009138236637227237,.........,-0.018
494699150323868]")) >= 0.981 THEN 'PBX Hacking'
ELSE NULL
END;
END;

Step 2: Setting up the environment and writing the Python program to generate vector embeddings, marking anomalies for different frauds and its visual representation over the time

2.1. Environment setup for Python program

!pip install pymysql
!pip install panda
!pip install openai
!pip install singlestoredb
!pip install sqlalchemy
!pip install mysql-connector-python
!pip install matplotlib
!pip install plotly

2.2. Actual Python program

import pandas as pd
import sqlalchemy
import openai
from singlestoredb import connect
from sqlalchemy import create_engine,text
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import numpy as np
import datetime
import plotly.express as px
import plotly.graph_objs as go
# OpenAI API key
openai.api_key =
'sk-DXwarIh88HeIexe7VnwzX3BlbkFJOZim8Mot9Y2YHupF2n2x'
# Create a connection to the database
engine = create_engine(connection_url)
# Query the database to get the data from 'telecom_cdr' table
#df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type` FROM
telecom_cdr', engine)
df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type`,
`Start_Time`, `Anomaly` FROM telecom_cdr', engine)
# Function to generate embeddings
def generate_embeddings(row):
text = f"Duration: {row['Duration']}, Cost: {row['Cost']}, Call Type:
{row['Call_Type']}"
response = openai.embeddings.create(input=[text],
model="text-embedding-ada-002")
return response.data[0].embedding
# Apply the function to each row
df['Embeddings'] = df.apply(generate_embeddings, axis=1)
with engine.connect() as connection:
# Update the database with the new embeddings
for index, row in df.iterrows():
update_query = f"""UPDATE telecom_cdr SET `Embeddings` =
JSON_ARRAY_PACK(:embeddings) WHERE Call_ID = :call_id"""
connection.execute(text(update_query), {"embeddings":
str(row['Embeddings']), "call_id": row['Call_ID']})
with engine.connect() as connection1:
update_query1 = f"""CALL detectanomaly();"""
connection1.execute(text(update_query1))
anomaly_query = """
SELECT
DATE_FORMAT(Start_Time, '%Y-%m') AS Month,
Anomaly,
COUNT(*) AS Anomaly_Count
FROM
telecom_cdr
GROUP BY
Month, Anomaly
"""
anomaly_df = pd.read_sql(anomaly_query, engine)
# Pivot the dataframe to get a column for each anomaly type
pivot_df = anomaly_df.pivot(index='Month', columns='Anomaly',
values='Anomaly_Count').fillna(0)
# Reset index to make 'Month' a column again
pivot_df.reset_index(inplace=True)
# Convert the pivoted dataframe into a format suitable for Plotly
plotly_data = pivot_df.melt(id_vars=['Month'], var_name='Anomaly',
value_name='Anomaly_Count')
# Ensure the 'Month' column is in datetime format for proper plotting
plotly_data['Month'] = pd.to_datetime(plotly_data['Month'])
# Create the line plot using Plotly
fig = px.line(plotly_data,
x='Month',
y='Anomaly_Count',
color='Anomaly',
title='Anomaly Counts Over Time',
labels={'Anomaly_Count': 'Anomaly Count', 'Month': 'Month'})
# Update layout for better readability
fig.update_layout(
xaxis_title='Month',
yaxis_title='Anomaly Count',
legend_title='Anomaly Types',
xaxis_tickangle=-45
)
# Show the plot
fig.show()# Close the database connection
engine.dispose()

2.3. Visual graph for anomaly detection.

3. Here are the steps demonstrating how how SingleStore works in the preceding Python program to detect anomalies for fraud detection

3.1. Importing required packages and defining OpenAI key

3.2. Create the database connection and query the database to get data from the table to mark anomalies.

3.3. Defining a function to generate embeddings using the OpenAI method text-embedding-ada-002 model.

3.4. Apply the function “generate_embeddings” to generate embeddings for each row. In this section I am also running one database stored procedure “detectanomaly” to mark an anomaly to each record if the row satisfies conditions mentioned in the preceding Explanation1 section.

How to mark anomalies for CDR

First, generate a base embedding for each condition mentioned in the Explanation1 section. These base embeddings help mark anomalies for each record near the base embeddings using the SingleStore DOT_PRODUCT vector function. In my case, I have considered the duration, cost and call_type column to generate base embeddings. You can also include other factors to generate this base embedding.

In the second step, get the embeddings of each CDR using the function “generate_embeddings”. This function is used in the previous step to generate embeddings for each CDR.

Finally, it's  time to mark anomalies for each CDR matching or near the base embeddings. For this I have used one SingleStore stored procedure, “detectanomaly” which updates each the CDR with particular anomalies including  'IRSF,'  'Wangiri fraud' and 'PBX hacking'.

4. Criteria to mark anomalies

The stored procedure “detectanomaly” calculates whether each record is anomalous based on the SingleStore DOT_PRODUCT function of its embeddings with predefined vectors, each representing a specific type of fraud. If a certain threshold is met or exceeded, the record is flagged with the corresponding type of fraud.

DOT_PRODUCT() can be used to compute a cosine similarity metric of the two input vectors, if the input vectors are normalized to length 1.

In the stored procedure I have set the threshold 0.981 to mark anomalies of similar type based on a criteria of duration, cost and call_type — but you can set your  own threshold to mark anomalies based on specific requirements.

Example anomaly, "PBX hacking"

Interested in trying out AI-driven fraud detection? Get started with SingleStore today.


Share