Python web scraper by FilippoMarletta · Pull Request #4 · UNICT-DMI/opis-manager-scraper

FilippoMarletta · 2026-03-25T16:15:04Z

Relates to UNICT-Quality-Development/qd-projects#46

Implementazione del web scraper Python per l'estrazione dei dati OPIS relativi agli anni accademici dal 2021/2022 al 2024/2025. Lo scraper interroga le API pubbliche di SmartEdu UniCT (https://public.smartedu.unict.it/EnqaDataViewer) e popola il database seguendo la struttura definita in opis-manager-core.

Features

Configurazione con DevContainers
Retry automatica su errori HTTP 429 e 5xx (3 tentativi, backoff esponenziale)
Parallelizzazione del fetch delle schede OPIS tramite ThreadPoolExecutor (3 worker di default), con gestione delle eccezioni per thread per evitare che un singolo fallimento blocchi l'elaborazione
Sistema di Logging
Unit Tests
Debug Mode configurabile
Pipeline CI: linting con Black, type checking con Pyright (modalità standard), test con pytest (coverage minima 80%)

Note

Dati Schede OPIS estratti dalla API (se presenti)

totale_schede: Totale questionari compilati dai frequentanti.
totale_schede_nf: Totale dei questionari compilati dai non frequentanti.
fc: Numero di studenti fuori corso.
inatt_nf: Numero di inattivi non frequentanti.
domande: Array delle risposte ai questionari dei frequentanti.
domande_nf: Array delle risposte ai questionari dei non frequentanti.
motivo_nf: Motivi della non frequenza.
sugg: Suggerimenti dei frequentanti.
sugg_nf: Suggerimenti dei non frequentanti.
femmine: Numero di studentesse frequentanti.
femmine_nf: Numero di studentesse non frequentanti.
inatt: Numero di inattivi frequentanti.
eta: Statistiche sulle fasce d'età.
anno_iscr: Anno di iscrizione.
num_studenti: Statistica sul numero medio di studenti.
ragg_uni: Statistiche sul domicilio o sul tempo impiegato per arrivare in sede.
studio_gg: Statistiche sulle ore di studio giornaliere.
studio_tot: Statistiche sulle ore di studio totali.

Problema — codici GOMP alfanumerici
Alcune attività hanno un activityCode alfanumerico (es. "A3688"). La colonna codice_gomp nel database è di tipo INT, rendendo impossibile l'inserimento di questi record. Tali attività vengono attualmente saltate con un warning in fase di elaborazione. Per risolvere definitivamente il problema sarebbe necessaria una migration che cambi il tipo della colonna in VARCHAR.

Problema — Suddivisione insegnamenti in canali
Le API non forniscono dati espliciti per i canali. Nello specifico:

channel -> sempre null.
part_code -> sempre null.
part_name (nome_modulo a DB) -> contiene una stringa se l'insegnamento è diviso in moduli (es. "PROGRAMMAZIONE 2" o "LABORATORIO"), altrimenti è null.

Workaround
Ho gestito l'assegnazione dei canali ai record duplicati di Insegnamento in questo modo:
1. Raggruppamento in set univoci: Per definire il numero di canali, l'algoritmo raggruppa gli Insegnamenti costruendo dei "set" univoci basati sul nome_modulo (o su un modulo unico se il campo è null). Ogni set completo compone un canale.
2. Assegnazione progressiva: Ai canali così formati viene assegnato un valore numerico progressivo ("1", "2", ..., "n").
3. Gestione del canale unico: Se al termine del raggruppamento risulta esistere un solo canale (un solo set), il campo canale viene impostato a "no".
Limitazioni: Questa suddivisione logica permette di strutturare correttamente i dati e risolvere i problemi di visualizzazione a frontend. Tuttavia, vista la totale assenza di identificativi lato API, non è garantito che le associazioni modulo-canale rispecchino al 100% la realtà accademica (non abbiamo dati per sapere se un modulo specifico appartiene fisicamente al canale 1 o al canale 2).

Problema di performance — Lentezza API sorgenti
Le API risultano estremamente lente nei tempi di risposta. Un'esecuzione dello scraper richiede circa 24 ore per recuperare i questionari OPIS di un singolo anno accademico (2021/2022).

Mitigations
Per cercare di rendere l'esecuzione più efficiente sono state introdotte le seguenti ottimizzazioni:

Esecuzione parallela: Sfruttato il modulo concurrent.futures.ThreadPoolExecutor per processare le singole materie in multithreading.
Gestione del carico (Throttling): Introdotto un delay controllato tra le richieste principali per non sovraccaricare il server sorgente ed evitare timeout o blocchi dell'IP a causa delle richieste parallele.

Tuttavia i tempi rimangono comunque molto lunghi.

…ling

… function

…ormers

…ontainer config

…adhere to the updated model

…ts, courses, and teaching records

…proved error handling

…mprehensive test cases

… data

… for enhanced testing

…s with additional test cases

…nd update parsing logic

…ed code and enhancing data aggregation logic

…or accurate submissions count

python_scraper/requirements.txt

.github/workflows/test.yml

.github/workflows/ci.yml

python_scraper/src/models.py

python_scraper/src/scraper.py

Helias · 2026-03-25T22:11:55Z

Did you test it with the current OPIS-Manager app?

python_scraper/src/transformers.py

FilippoMarletta · 2026-03-26T06:30:08Z

I tested it using a local database initialized with the migrations from opis-manager-core.

…ments.txt

Helias · 2026-03-26T12:31:51Z

I added the same lint.yml file in the main branch to see if the pipeline starts here
please solve merge conflicts

…arse_scheda_opis_data

Helias · 2026-03-26T22:02:38Z

please solve merge conflict

… int

FilippoMarletta · 2026-03-26T22:19:58Z

How should I proceed with the CD pipeline?

Helias · 2026-03-26T22:20:48Z

the pipeline is fine, I am testing the PR, I am getting some errors also for MATEMATCA E INFORMATICA

Helias · 2026-03-26T23:01:53Z

python_scraper/src/scraper.py

+ACCADEMIC_YEARS = [2021, 2022, 2023, 2024]
+DELAY = 1.0
+
+MAX_WORKERS = 3


I noticed that untl 10 is fine, I would put 10 as default

FilippoMarletta added 30 commits January 19, 2026 21:06

chore: setup devcontainer and project structure

19d1b5c

feat: implement get_departments and models

4088a4b

chore: add .gitattributes to enforce LF line endings

f783e35

build: add pytest-mock and pytest-cov to requirements.txt

171de79

feat: add get_courses function and update models for course data hand…

a4d2683

…ling

feat: add tests for get_departments and get_courses functions

3a8b6b3

feat: implement parse_course_name function and add corresponding tests

288d224

chore: update .gitignore to include coverage and cache files

54abfe1

build: update vscode extensions

519220d

fix: prevents pylance from crushing

db67ba1

feat: add tests for get_activities and parse_insegnamento_data functions

47f19f9

feat: add get_activities function, insegnamento dataclass and parsing…

a5ea68f

… function

feat: add SchedaOpis dataclass

50068a3

feat: add function parse_scheda_opis and correspondig tests

f7d7893

feat: add get_questions function and update related models and transf…

1c85d9c

…ormers

feat: implement scraper functionality with logging and data processing

de2e526

refactor: use requests.Session to improve scraping speed

20696ad

feat: enhance API client with logging and timeout management

d73c5e1

feat: extend SchedaOpis model with additional and previously fields

b08d43f

build: add mysql-connector and python-dotenv dependecies and update c…

99e4c04

…ontainer config

fix: update parse_course_name regex and rewrite parse_scheda_opis to …

29f402a

…adhere to the updated model

feat: implement database connection and CRUD operations for departmen…

de010a4

…ts, courses, and teaching records

feat: enhance scraper functionality with concurrent processing and im…

5afaabc

…proved error handling

fixt: update parse_course_name regex for improved matching and add co…

b6290fb

…mprehensive test cases

fix: ensure professor names default to empty string if not present in…

47502a5

… data

feat: add random sampling of activities and departments in debug mode…

0092fd2

… for enhanced testing

fix: update parse_course_name regex to support 'c.u.' and 'cu' format…

4778821

…s with additional test cases

fix: add previously missing nome_modulo field to Insegnamento model a…

3a85d1c

…nd update parsing logic

refactor: streamline parse_scheda_opis_data function by removing unus…

6c833c6

…ed code and enhancing data aggregation logic

fix: update mock API calls to use session.post and adjust test data f…

31b9759

…or accurate submissions count

FilippoMarletta added 4 commits March 25, 2026 14:54

fix: more general regex for parse_course_name

c85fefe

tests: add 2 test cases for test_parse_course_name

cb1a547

feat: enriches debug mode with more customization

8d8c150

style: black linting

d2e0669