Day 5: VQC - detection of phishing and DGA Botnets
Almost every cyber attack begins with gaining initial access. Often, this access is obtained through sophisticated phishing attacks that exploit unpatched vulnerabilities or even zero-days. Let’s delve into the use of the variational quantum classifier (VQC) in detecting such phishing attempts. Further, we’ll evaluate VQC’s performance against DGA botnets, aiming to understand how the algorithm responds to diverse data sets. But first, let’s get access to our phishing dataset. The dataset can be found [here], and here are some code snippets to help you preprocess it:
!wget https://dgadata.blob.core.windows.net/dga/test_Phishing_VQC.csv
!wget https://dgadata.blob.core.windows.net/dga/train_Phishing_VQC.csv
import numpy as np
import os
train_CSV = os.path.join("train_Phishing_VQC.csv")
test_CSV = os.path.join("test_Phishing_VQC.csv")
data_train = np.genfromtxt(train_CSV, delimiter=',', skip_header=1)
data_test = np.genfromtxt(test_CSV, delimiter=',', skip_header=1)
# Separate target and features
y_train = data_train[:, -1]
y_test = data_test[:, -1]
X_train = np.delete(data_train, -1, axis=1)
X_test = np.delete(data_test, -1, axis=1)
One current limitation is the number of qubits the NISQ/simulators can handle. This necessitates dimension reduction, as each feature corresponds to a qubit. If we’re dealing with a massive 30 qubits, it becomes unmanageable with AER. Below is a code that employs PCA to cut down the feature count to six. We then compare the accuracy with SVC and RandomForest to demonstrate that this reduction doesn’t significantly compromise data quality:
import numpy as np
import os
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def train_predict_evaluate(classifier, X_train, y_train, X_test, y_test):
"""Train the classifier, predict and return the accuracy."""
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
return accuracytrain_CSV = os.path.join("train_Phishing_VQC.csv")
test_CSV = os.path.join("test_Phishing_VQC.csv")
data_train = np.genfromtxt(train_CSV, delimiter=',', skip_header=1)
data_test = np.genfromtxt(test_CSV, delimiter=',', skip_header=1)
# Separate target and features
y_train = data_train[:, -1]
y_test = data_test[:, -1]
X_train = np.delete(data_train, -1, axis=1)
X_test = np.delete(data_test, -1, axis=1)
# Apply PCA
pca = PCA(n_components=6)X_train_pca = pca.fit_transform(X_train)X_test_pca = pca.transform(X_test)
# Use the classifiers on the original features and then on the reduced features
results = {}classifiers = [RandomForestClassifier(), SVC()]
for clf in classifiers:
clf_name = clf.__class__.__name__
results[clf_name] = {"Before PCA": None, "After PCA": None}
results[clf_name]["Before PCA"] = train_predict_evaluate(clf, X_train, y_train, X_test, y_test)
results[clf_name]["After PCA"] = train_predict_evaluate(clf, X_train_pca, y_train, X_test_pca, y_test)
# Present the results in a table
print("Classifier".ljust(25) + "Before PCA".ljust(15) + "After PCA".ljust(15))print('-' * 55)
for clf_name, result in results.items():
print(clf_name.ljust(25) + f"{result['Before PCA']:.3f}".ljust(15) + f"{result['After PCA']:.3f}".ljust(15))
Though we omitted 24 features, the chosen ones are highly predictive, ensuring the retained accuracy remains commendable.
It’s time to set up. Let’s install the necessary libraries and execute VQC on our data:
!pip install -U azure-quantum
!pip install -U azure-quantum[qiskit]
!pip install -U qiskit_machine_learning
from qiskit import BasicAer, QuantumCircuit
from qiskit.circuit.library import ZZFeatureMap, EfficientSU2
backend = BasicAer.get_backend("qasm_simulator")
num_qubits = 6feature_map = ZZ
FeatureMap(num_qubits, reps=1)
model = EfficientSU2(num_qubits, reps=2, entanglement="pairwise")
circuit = feature_map.compose(model)
circuit.measure_all()
Bear in mind this phase can be quite time-intensive and may feel like a strenuous run, primarily due to the API’s limited visibility of progress. We’ll delve deeper into why this step is so timely in the next note :
from qiskit_machine_learning.algorithms import VQC
vqc = VQC(num_qubits, feature_map, model, quantum_instance=backend)
one_hot=np.array([[1,0] if label == 1 else [0,1] for label in y_train])
one_hot_test = np.array([[1, 0] if label == 1 else [0, 1] for label in y_test])
vqc.fit(X_train_pca,one_hot)
vqc_predictions=vqc.predict(X_train_pca)
After hours of processing, the outcome isn’t as promising as one might hope:
Interestingly, the VQC shines when detecting DGA Botnets. Using just seven qubits on AER, we achieved an accuracy rate of 69.8% with COBYLA and ZFeatureMap. However, expect a sizable circuit as depicted below:
While many studies praise the efficacy of Variational Quantum Algorithms in various domains, we’ll dissect its intricacies in our next piece. We aim to uncover why it’s so time-consuming and low performer with cyber data and potential avenues for advancement. Stay tuned and join us on LinkedIn and Twitter to dissect the anatomy of VQC. Together, let’s push the boundaries of Quantum Machine Learning on NISQs and pave the way for Quantum Cybersecurity Analytics.
Comments