Laboratory 2 – AI and Cybersecurity

This repository collects the material produced for Laboratory 2 of the AI and Cybersecurity course.

Overview

The goal of this project is to classify software behavioral reports as either Malware or Benign based on the sequence of Windows API calls they execute. The dataset originates from dynamic malware analysis using the Cuckoo Sandbox, which monitors program execution and records up to 100 consecutive API calls per sample. By learning these patterns, we aim to distinguish malicious software from legitimate applications. To achieve this goal, we use different neural network architectures: FFNN, RNN and GNN.

Repository Layout

Laboratory2/
├── lab/
│   ├── data/           # Raw CSV and preprocessed JSON splits
│   ├── notebooks/      # Task-specific Jupyter notebooks (Task1–Task4)
│   └── notes.txt       # Quick answers captured during the lab sessions
├── report/             # LaTeX report sources (compile via report/Makefile)
├── resources/          # Logos and reference PDFs
└── README.md           # This file

The detailed lab report, including all experimental results and analysis, can be found here. For a runnable summary of the experiments and step-by-step code, open lab/notebooks/.

Lab Objectives & Requirements

The main learning objectives for the lab were:

Learn how to preprocess the data based on the chosen Neural Network architecture.
Experiment with different architectures (FFNN, RNN, and GNN) using different hyper-parameters, optimizers and architectures.
Engineer a simple baseline solution to systematically evaluate the effects of increasingly complex architectures and choices.
Understand whether a model has converged successfully.

Requirements

We used a standard Python data-science stack. The notebooks are compatible with recent Python 3.8+ environments. Recommended packages (install with pip):

# create and activate virtual environment (zsh)
python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install jupyterlab notebook pandas numpy scikit-learn matplotlib seaborn torch torchvision tqdm

Notes:

If you plan to train large models and have an NVIDIA GPU, install the CUDA-enabled PyTorch build for faster training.
For reproducibility, set the same random seeds for numpy, torch and sklearn; the notebooks include seed-setting cells.
Task 4 requires the PyTorch Geometric ecosystem; consult their installation matrix if CUDA wheels are needed.

Data Preparation

lab/data/raw/dynamic_api_call_sequence_per_malware_100_0_306.csv contains the original traces.
Pre-split JSONs (train.json, test.json) live in lab/data/jsons/ and already match the splits used in the notebooks.
All experiments reuse the same train/validation split created inside the notebooks (70/30 stratified).

Summary of the results

Task 1: Frequency baseline

Vocabulary: 258 unique training API calls (232 in test, 3 unseen → routed to unknown).
Average non-zero ratios: 22.9 features/sample (9 %) for train, 25.2 (10 %) for test.
Despite the sparse representation, the feed-forward network already hits 97.6% accuracy, 78% macro F1 score.

Task 2: FFNN on fixed-length sequences

Sequences span 60–90 calls (train) and 70–100 (test); padding to the train-set max keeps validation realistic.
Raw sequential IDs suffer from treating APIs as equidistant tokens (95.9% accuracy, 55% macro F1 score).
Learnable embeddings plus dropout and class weighting climb to (94.4% accuracy, 58% macro F1 score); early stopping avoids overfitting.

Task 3: Recursive Neural Networks

Custom collate_fn sorts sequences by length and leverages pack_padded_sequence for memory efficiency.
Vanilla RNN reaches 95.8% accuracy and 75% macro F1 score.
LSTM reaches 97.3 % accuracy and 82% macro F1 score.
Bidirectional LSTM reaches 96.9 % accuracy and 81% macro F1 score.

Task 4: Graph Neural Networks

API sequences are mapped to call-transition graphs (nodes = APIs, edges = successive calls).
GCN reaches 96.8% accuracy and 80% macro F1 score.
GraphSAGE reaches 96.7% accuracy and 81% macro F1 score.
GraphSAGE reaches 97.2% accuracy and 82% macro F1 score.

Authors

Name	GitHub	LinkedIn	Email
Renato Mignone
Claudia Sanna
Chiara Iorio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laboratory 2 – AI and Cybersecurity

Overview

Repository Layout

Lab Objectives & Requirements

Requirements

Data Preparation

Summary of the results

Task 1: Frequency baseline

Task 2: FFNN on fixed-length sequences

Task 3: Recursive Neural Networks

Task 4: Graph Neural Networks

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
lab		lab
report		report
resources		resources
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Laboratory 2 – AI and Cybersecurity

Overview

Repository Layout

Lab Objectives & Requirements

Requirements

Data Preparation

Summary of the results

Task 1: Frequency baseline

Task 2: FFNN on fixed-length sequences

Task 3: Recursive Neural Networks

Task 4: Graph Neural Networks

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Task 1: Frequency baseline

Task 2: FFNN on fixed-length sequences

Task 3: Recursive Neural Networks

Task 4: Graph Neural Networks

Packages