Day 1: QSVM
Imagine your computer network as a bustling city and the data that flows through it as the cars on the roads. Now, some of those cars might be driven by robbers (called botnets) trying to cause trouble. Traditional ways of spotting these robbers can be slow, and might miss some of them. But what if we used something really futuristic to catch them? Quantum Machine Learning, like Support Vector Machines (QSVM), are like super-powered police cars that use the principles of quantum physics. They can detect these robbers much faster and more accurately. In this article, we’ll explore how QSVM works to protect your computer network from botnets, all without needing a degree in quantum physics!
After picturing how Quantum Support Vector Machines act like super-powered police cars in our digital city, let’s turn our attention to the roads and vehicles themselves. For our series of case studies, we’ll be using the Botnet DGA Dataset, available here: https://ieee-dataport.org/open-access/botnet-dga-dataset. This dataset simplifies our task considerably. Normally, we would have to connect to various sources, gather domain names, and engage in some intricate feature engineering to prepare clean data. But with this dataset, we have everything we need for a supervised binary classification task that can quickly detect Domain Generation Algorithm botnets.
DGA botnets are growing because bot masters can mask their DNSs and IPs using fake domains generated with simple code like the one below:
!pip install qiskit-machine-learning
!pip install qiskit-aer
#what is DGA
def generate_domain(year: int, month: int, day: int, seed: str) -> str:
"""Generate a domain name for the given date using a basic DGA algorithm and a seed name."""
domain = seed
# Iterate to create a domain name of desired length
for i in range(16 - len(seed)):
year = ((year ^ (8 * year)) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ (4 * month)) >> 25) ^ (16 * (month & 0xFFFFFFF8))
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)
return domain + ".com"
year = 2023
month = 8
day = 1
legit_domains = ["microsoft", "teams", "zoom", "google", "bmw", "apple", "amazon", "twitter", "netflix", "facebook"]
# Iterate through the legitimate domains and generate two examples for each
for seed in legit_domains:
for example in range(2): # Generate two examples
generated_domain = generate_domain(year, month, day, seed)
print(f"The generated domain for the date {year}-{month}-{day} with seed '{seed}' is: {generated_domain}")
# Modify the date (or other variables) if you want each example to be unique
day += 1
With our generated domains in hand, let’s venture into the shadowy world of botnets and their masters. In our bustling digital city, there are hidden alleyways and secret tunnels known only to those with malicious intent. The botmaster, a skilled manipulator of this concealed infrastructure, sees an opportunity in the domains we’ve just created.
Hiding the DNS and IP: Think of the Domain Name System (DNS) and IP addresses as the signs and house numbers in our city. Normally, they help us find our way, but the botmaster uses the generated domains to confuse and mislead. By constantly changing these signs (domains), he can hide the real location of his malicious servers. It’s like a network of secret hideouts, always moving, always just out of reach.
Command and Control Servers: In this hidden network, the botmaster has set up command and control servers, orchestrating the movements of his minions, known as “zombies.” These are compromised computers, infected and controlled without their owners’ knowledge. The servers are the puppet masters, and the zombies are their unwitting puppets, dancing to a tune only the botmaster can hear.
Using Zombies and Domains for a DDOS Attack: Now, the botmaster has a plan: a Distributed Denial of Service (DDOS) attack. He wants to flood a target’s defenses with so much traffic that it becomes overwhelmed like a city street clogged with cars during rush hour. The zombies, under the command of the generated domains, begin to send wave after wave of fake requests to the target. Because the domains are constantly changing, tracking down the source of the attack is like trying to catch smoke with your bare hands.
The target’s defenses struggle under the onslaught. Firewalls and security measures buckle and bend. The digital city’s traffic grinds to a halt, and the target is brought to its knees. All the while, the botmaster watches from the shadows, hidden behind the cloak of domains, DNS, and IP obfuscation. His network of zombies and the stealthy adversarial infrastructure remain concealed, ready to strike again.
The task is to detect these domains using the Quantum Support Vector Machine Learning algorithm. Without going into the technicality of the hybrid quantum algorithms, we show here how QSVM works and will elaborate on the details in the next note.
Let's first install the required libraries and get 1000 sample of the abovementioned dataset and prepare it for QSVM:
!pip install qiskit-machine-learning
!pip install qiskit-aer
!wget https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/BotnetDgaDataset_1000.csv
import csv
import os
import numpy as np
from sklearn.datasets import make_blobs
datafilename="BotnetDgaDataset_1000.csv"
resultname="result_BotnetDgaDataset_pegasos_1000.txt"
cwd=os.getcwd()
mycsv=cwd+"/"+datafilename
print(mycsv)
def load_data(filepath):
with open(filepath) as csv_file:
data_file = csv.reader(csv_file)
temp = next(data_file)
n_samples = 1000
n_features = 7
data = np.empty((n_samples, n_features))
target = np.empty((n_samples,), dtype=int)
for i, ir in enumerate(data_file):
data[i] = np.asarray(ir[:-1], dtype=np.float64)
target[i] = np.asarray(ir[-1], dtype=int)
return data, target
features, labels = load_data(mycsv)
print (len(features))
print (len(labels))
Now we must prepare the data for training and testing:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
features = MinMaxScaler(feature_range=(0, np.pi)).fit_transform(features)
train_features, test_features, train_labels, test_labels = train_test_split(
features, labels, train_size=700, shuffle=False)
Now we will use the QSVM functionality; we will elaborate on QSVM details in the next note:
from qiskit import Aer
from qiskit.circuit.library import ZZFeatureMap
from qiskit.utils import QuantumInstance
from qiskit_machine_learning.kernels import QuantumKernel
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Define the feature map
feature_map = ZZFeatureMap(feature_dimension=7, reps=1, entanglement="linear")
# Define the quantum kernel using the feature map
kernel = QuantumKernel(feature_map=feature_map, quantum_instance=QuantumInstance(Aer.get_backend('statevector_simulator')))
# Define the SVM with the quantum kernel
qsvm = SVC(kernel=kernel.evaluate)
qsvm.fit(train_features, train_labels)
predicted = qsvm.predict(train_features)
# Calculate accuracy
accuracy = accuracy_score(train_labels, predicted)
print("Accuracy:", accuracy)
# Confusion Matrix
conf_matrix = confusion_matrix(train_labels, predicted)
print("Confusion Matrix:")
print(conf_matrix)
# Visualize Confusion Matrix
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Results are promising over the AER simulator with 97.85% accuracy:
Could we claim that QSVM is better than SVM? Let's do a hypotheses testing.
from sklearn.svm import SVC as ClassicalSVC
from scipy.stats import ttest_rel
feature_map = ZZFeatureMap(feature_dimension=7, reps=1, entanglement="linear")
kernel = QuantumKernel(feature_map=feature_map, quantum_instance=QuantumInstance(Aer.get_backend('statevector_simulator')))
classical_accuracies = []
quantum_accuracies = []
import warnings
warnings.filterwarnings("ignore")
for _ in range(10):
# Split the data
train_features, test_features, train_labels, test_labels = train_test_split(
features, labels, train_size=700, shuffle=True
)
# Train Classical SVM
classical_svm = ClassicalSVC()
classical_svm.fit(train_features, train_labels)
classical_predicted = classical_svm.predict(test_features)
classical_accuracy = accuracy_score(test_labels, classical_predicted) * 100
classical_accuracies.append(classical_accuracy)
# Train Quantum SVM (QSVM)
qsvm = SVC(kernel=kernel.evaluate)
qsvm.fit(train_features, train_labels)
quantum_predicted = qsvm.predict(test_features)
quantum_accuracy = accuracy_score(test_labels, quantum_predicted) * 100
quantum_accuracies.append(quantum_accuracy)
print("Classical SVM Accuracies (%):")
for acc in classical_accuracies:
print(f"{acc:.5f}")
print("\nQuantum SVM Accuracies (%):")
for acc in quantum_accuracies:
print(f"{acc:.5f}")
# Perform t-test
t_stat, p_value = ttest_rel(quantum_accuracies, classical_accuracies)
print("\nT-Test Results: t-statistic =", t_stat, "p-value =", p_value)
# Interpretation
if p_value < 0.05:
print("Reject null hypothesis: QSVM significantly outperforms classical SVM.")
else:
print("Fail to reject null hypothesis: No significant difference between QSVM and classical SVM.")
Here is the result:
As you can observe, there is no significant difference between QSVM and classical SVM. But why? We will share in the next note why QSVM is not sufficient and what could be the best next steps to unleash the power of Quantum Computers. Stay with us, and don’t forget to follow our LinkedIn and Twitter, where we will show how hybrid quantum machine learning will change the realm of cyber defense.
Comments