Skip to content

Commit ba587dd

Browse files
committed
add bertopic script
1 parent 5177556 commit ba587dd

File tree

10 files changed

+1270
-33
lines changed

10 files changed

+1270
-33
lines changed

.DS_Store

0 Bytes
Binary file not shown.

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,4 +189,5 @@ notebooks/*.json
189189
notebooks/*.md
190190
notebooks/results/
191191
.DS_Store
192-
data/.DS_Store
192+
data/.DS_Store
193+
results/

baseline_experiments/QUICKSTART.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Baseline Experiments Quick Start
2+
3+
## Prerequisites
4+
5+
Make sure you have Poetry installed and the environment set up:
6+
7+
```bash
8+
# Install dependencies (if not already done)
9+
cd /Users/ngc436/Documents/projects/AutoTM
10+
poetry install
11+
12+
# Install BERTopic dependencies
13+
poetry add "bertopic>=0.16.0" "sentence-transformers>=3.0.0" "umap-learn>=0.5.6" "hdbscan>=0.8.38"
14+
15+
# For LLM evaluation
16+
poetry add openai python-dotenv
17+
```
18+
19+
## LLM Evaluation Setup
20+
21+
Create a `.env` file in the project root:
22+
23+
```bash
24+
# For vLLM with Qwen (or similar)
25+
AUTOTM_LLM_API_KEY=your-api-key-here
26+
AUTOTM_LLM_BASE_URL=http://your-server:8041/v1
27+
AUTOTM_LLM_MODEL_NAME=/model
28+
```
29+
30+
## Running Experiments
31+
32+
### Quick Test (Single Dataset, Few Seeds)
33+
34+
```bash
35+
# Test BERTopic on hotel reviews dataset
36+
poetry run python3 baseline_experiments/run_experiments.py \
37+
--model bertopic \
38+
--datasets "hotel:data/hotel_reviews/Datafiniti_Hotel_Reviews.csv:reviews.text" \
39+
--language-map "hotel:en" \
40+
--embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
41+
--seeds 0-2 \
42+
--grid preset_tiny \
43+
--output-dir results/bertopic_test \
44+
--cache-dir cache/bertopic \
45+
--n-jobs 2
46+
```
47+
48+
### Full Baseline Experiments (All Datasets)
49+
50+
```bash
51+
# Run BERTopic on all three datasets with comprehensive grid
52+
poetry run python3 baseline_experiments/run_experiments.py \
53+
--model bertopic \
54+
--datasets "hotel:data/hotel_reviews/Datafiniti_Hotel_Reviews.csv:reviews.text,amazon:data/amazon_food/Reviews.csv:Text,lenta:data/lenta_ru/lenta-ru-news.csv:text" \
55+
--language-map "hotel:en,amazon:en,lenta:ru" \
56+
--embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
57+
--seeds 0-4 \
58+
--grid preset_small \
59+
--output-dir results/bertopic_all_datasets \
60+
--cache-dir cache/bertopic \
61+
--n-jobs 4 \
62+
--name bertopic_baseline_all
63+
```
64+
65+
**Note:** This will run ~180 experiments (3 datasets × 5 seeds × 12 configs) and may take several hours.
66+
67+
### Monitoring Progress
68+
69+
While experiments are running, you can monitor progress:
70+
71+
```bash
72+
# Option 1: Use the monitoring script
73+
poetry run python3 baseline_experiments/monitor_experiments.py \
74+
--results-dir results/bertopic_all_datasets \
75+
--interval 30
76+
77+
# Option 2: Check the log file
78+
tail -f bertopic_experiments.log
79+
80+
# Option 3: Check partial results
81+
ls -lh results/bertopic_all_datasets/
82+
cat results/bertopic_all_datasets/results_partial.csv | wc -l
83+
```
84+
85+
### Gensim LDA Experiments
86+
87+
For comparison, you can also run Gensim LDA baselines:
88+
89+
```bash
90+
# Single dataset
91+
poetry run python3 baseline_experiments/run_experiments.py \
92+
--model gensim_lda \
93+
--dataset hotel_reviews \
94+
--data-path data/hotel_reviews/Datafiniti_Hotel_Reviews.csv \
95+
--text-col reviews.text \
96+
--topics 20 \
97+
--budget 300 \
98+
--seeds 0-4 \
99+
--output-dir results/gensim_hotel \
100+
--preproc auto
101+
```
102+
103+
## Understanding Results
104+
105+
### BERTopic Output Structure
106+
107+
```
108+
results/bertopic_all_datasets/
109+
├── results.csv # All runs with full metrics
110+
├── summary.csv # Aggregated stats (mean±std per config)
111+
└── runs/ # Per-run artifacts
112+
├── hotel__cfg000__seed0.topics.json
113+
├── hotel__cfg000__seed0.config.json
114+
├── hotel__cfg000__seed0.assignments.csv
115+
└── ...
116+
```
117+
118+
### Key Metrics
119+
120+
- **Coherence (C_V)**: Semantic coherence of topics (higher is better)
121+
- **Coherence (C_NPMI)**: Normalized pointwise mutual information (higher is better)
122+
- **Topic Diversity**: Uniqueness of top words across topics (higher is better)
123+
- **Runtime**: Execution time in seconds
124+
- **N Topics**: Number of discovered topics
125+
- **Outlier Rate**: % of documents not assigned to any topic (lower is better)
126+
127+
### Analyzing Results
128+
129+
```python
130+
import pandas as pd
131+
132+
# Load results
133+
df = pd.read_csv('results/bertopic_all_datasets/results.csv')
134+
135+
# Best configuration per dataset
136+
for dataset in df['dataset'].unique():
137+
best = df[df['dataset'] == dataset].nlargest(1, 'coherence_c_v')
138+
print(f"\n{dataset} - Best Config:")
139+
print(f" Coherence: {best['coherence_c_v'].values[0]:.4f}")
140+
print(f" Diversity: {best['topic_diversity'].values[0]:.4f}")
141+
print(f" Topics: {best['n_topics'].values[0]}")
142+
143+
# Summary statistics
144+
summary = pd.read_csv('results/bertopic_all_datasets/summary.csv')
145+
print("\n" + "="*80)
146+
print("SUMMARY STATISTICS")
147+
print("="*80)
148+
print(summary)
149+
```
150+
151+
## Troubleshooting
152+
153+
### Import Errors
154+
155+
If you see `ModuleNotFoundError`, make sure you're using `poetry run`:
156+
157+
```bash
158+
# ✗ Wrong
159+
python3 baseline_experiments/run_experiments.py ...
160+
161+
# ✓ Correct
162+
poetry run python3 baseline_experiments/run_experiments.py ...
163+
```
164+
165+
### Memory Issues
166+
167+
Large datasets may require more memory. Reduce batch size or use fewer parallel jobs:
168+
169+
```bash
170+
--n-jobs 1 # More deterministic, less memory
171+
```
172+
173+
### Slow Embeddings
174+
175+
Embeddings are cached after first computation. First run per dataset will be slower.
176+
177+
## Next Steps
178+
179+
After running baselines:
180+
181+
1. **Compare with AutoTM**: Use `analyze_results.py` to compare baseline results with AutoTM
182+
2. **Visualize**: Create plots showing coherence/diversity trade-offs
183+
3. **Analyze Topics**: Examine per-topic output to understand discovered themes
184+
4. **Write Up**: Document findings in your research paper/report
185+
186+
## Tips
187+
188+
- Start with `preset_tiny` for quick tests
189+
- Use `preset_small` for balanced exploration
190+
- Reserve `preset_medium` for final comprehensive experiments
191+
- Always use the same seeds across frameworks for fair comparison
192+
- Cache embeddings to save time on repeated runs

baseline_experiments/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,13 @@ This folder contains baseline topic modeling experiments with various frameworks
99
- **`run_experiments.py`** - Unified experiment runner that supports multiple topic modeling frameworks
1010
- Supports Gensim LDA and BERTopic baselines
1111
- Handles multiple seeds, result aggregation, and summary statistics
12+
- Optional LLM-based topic evaluation
1213
- See usage examples below
1314

15+
- **`llm_evaluate.py`** - Standalone LLM topic evaluation script
16+
- Can evaluate saved topic model results using OpenAI-compatible API
17+
- Works with vLLM, Qwen, GPT-4, and other compatible models
18+
1419
### Framework-Specific Scripts
1520

1621
- **`gensim_lda.py`** - Gensim LDA baseline implementation
@@ -144,6 +149,69 @@ Both frameworks compute:
144149
- **Runtime** - execution time
145150
- **Number of Topics** - discovered/specified topics
146151
- **Outlier Rate** (BERTopic) - percentage of documents not assigned to any topic
152+
- **LLM Score** (optional) - LLM-based topic quality rating (1-4 scale)
153+
154+
## LLM Evaluation
155+
156+
Evaluate topic quality using an LLM (OpenAI-compatible API, including vLLM with Qwen).
157+
158+
### Setup
159+
160+
Create a `.env` file with your API credentials:
161+
162+
```bash
163+
# For vLLM with Qwen
164+
AUTOTM_LLM_API_KEY=your-api-key-here
165+
AUTOTM_LLM_BASE_URL=http://your-server:8041/v1
166+
AUTOTM_LLM_MODEL_NAME=/model
167+
168+
# For OpenAI
169+
AUTOTM_LLM_API_KEY=sk-your-openai-key
170+
# AUTOTM_LLM_BASE_URL= # Not needed for OpenAI
171+
AUTOTM_LLM_MODEL_NAME=gpt-4o
172+
```
173+
174+
### Run with Experiments
175+
176+
```bash
177+
# BERTopic with LLM evaluation
178+
python baseline_experiments/run_experiments.py \
179+
--model bertopic \
180+
--datasets "hotel:data/hotel.csv:text" \
181+
--language-map "hotel:en" \
182+
--seeds 0-4 \
183+
--grid preset_fast \
184+
--output-dir results/bertopic_hotel \
185+
--llm-evaluate \
186+
--llm-max-topics 10 \
187+
--llm-estimations 3
188+
```
189+
190+
### Post-Processing Evaluation
191+
192+
Evaluate already completed experiments:
193+
194+
```bash
195+
# BERTopic results
196+
python baseline_experiments/llm_evaluate.py \
197+
--results_dir results/bertopic_hotel_full \
198+
--framework bertopic \
199+
--max_topics 10 \
200+
--estimations 3
201+
202+
# Gensim LDA results
203+
python baseline_experiments/llm_evaluate.py \
204+
--results_dir results/gensim_hotel \
205+
--framework gensim \
206+
--results_file results/gensim_hotel/hotel_all_results.jsonl
207+
```
208+
209+
### LLM Score Interpretation
210+
211+
- **1** = Unrelated words (poor topic)
212+
- **2** = Weakly related (marginal topic)
213+
- **3** = Related (good topic)
214+
- **4** = Strongly related (excellent topic)
147215

148216
## Dependencies
149217

0 commit comments

Comments
 (0)