Skip to content

AI4-Cybersec/Laboratory2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Laboratory 2 – AI and Cybersecurity

Polito Logo

This repository collects the material produced for Laboratory 2 of the AI and Cybersecurity course.

Overview

The goal of this project is to classify software behavioral reports as either Malware or Benign based on the sequence of Windows API calls they execute. The dataset originates from dynamic malware analysis using the Cuckoo Sandbox, which monitors program execution and records up to 100 consecutive API calls per sample. By learning these patterns, we aim to distinguish malicious software from legitimate applications. To achieve this goal, we use different neural network architectures: FFNN, RNN and GNN.

Repository Layout

Laboratory2/
├── lab/
│   ├── data/           # Raw CSV and preprocessed JSON splits
│   ├── notebooks/      # Task-specific Jupyter notebooks (Task1–Task4)
│   └── notes.txt       # Quick answers captured during the lab sessions
├── report/             # LaTeX report sources (compile via report/Makefile)
├── resources/          # Logos and reference PDFs
└── README.md           # This file

The detailed lab report, including all experimental results and analysis, can be found here. For a runnable summary of the experiments and step-by-step code, open lab/notebooks/.

Lab Objectives & Requirements

The main learning objectives for the lab were:

  • Learn how to preprocess the data based on the chosen Neural Network architecture.
  • Experiment with different architectures (FFNN, RNN, and GNN) using different hyper-parameters, optimizers and architectures.
  • Engineer a simple baseline solution to systematically evaluate the effects of increasingly complex architectures and choices.
  • Understand whether a model has converged successfully.

Requirements

We used a standard Python data-science stack. The notebooks are compatible with recent Python 3.8+ environments. Recommended packages (install with pip):

# create and activate virtual environment (zsh)
python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install jupyterlab notebook pandas numpy scikit-learn matplotlib seaborn torch torchvision tqdm

Notes:

  • If you plan to train large models and have an NVIDIA GPU, install the CUDA-enabled PyTorch build for faster training.
  • For reproducibility, set the same random seeds for numpy, torch and sklearn; the notebooks include seed-setting cells.
  • Task 4 requires the PyTorch Geometric ecosystem; consult their installation matrix if CUDA wheels are needed.

Data Preparation

  1. lab/data/raw/dynamic_api_call_sequence_per_malware_100_0_306.csv contains the original traces.
  2. Pre-split JSONs (train.json, test.json) live in lab/data/jsons/ and already match the splits used in the notebooks.
  3. All experiments reuse the same train/validation split created inside the notebooks (70/30 stratified).

Summary of the results

Task 1: Frequency baseline

  • Vocabulary: 258 unique training API calls (232 in test, 3 unseen → routed to unknown).
  • Average non-zero ratios: 22.9 features/sample (9 %) for train, 25.2 (10 %) for test.
  • Despite the sparse representation, the feed-forward network already hits 97.6% accuracy, 78% macro F1 score.

Task 2: FFNN on fixed-length sequences

  • Sequences span 60–90 calls (train) and 70–100 (test); padding to the train-set max keeps validation realistic.
  • Raw sequential IDs suffer from treating APIs as equidistant tokens (95.9% accuracy, 55% macro F1 score).
  • Learnable embeddings plus dropout and class weighting climb to (94.4% accuracy, 58% macro F1 score); early stopping avoids overfitting.

Task 3: Recursive Neural Networks

  • Custom collate_fn sorts sequences by length and leverages pack_padded_sequence for memory efficiency.
  • Vanilla RNN reaches 95.8% accuracy and 75% macro F1 score.
  • LSTM reaches 97.3 % accuracy and 82% macro F1 score.
  • Bidirectional LSTM reaches 96.9 % accuracy and 81% macro F1 score.

Task 4: Graph Neural Networks

  • API sequences are mapped to call-transition graphs (nodes = APIs, edges = successive calls).
  • GCN reaches 96.8% accuracy and 80% macro F1 score.
  • GraphSAGE reaches 96.7% accuracy and 81% macro F1 score.
  • GraphSAGE reaches 97.2% accuracy and 82% macro F1 score.

Authors

Name GitHub LinkedIn Email
Renato Mignone GitHub LinkedIn Email
Claudia Sanna GitHub LinkedIn Email
Chiara Iorio GitHub LinkedIn Email

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages