LIVEVQA

NEWS

1. `collectors`: News Collector

This module is responsible for gathering news hot topics. It fetches articles and their associated data, serving as the initial data acquisition pipeline for the entire project.

2. `ranking`: Image Filter

The ranking module acts as an intelligent image filter. It processes the collected news topics, evaluating the relevance of associated images to the article's core content. This step helps ensure that only high-quality, contextually relevant images are passed down the pipeline.

3. `qa_makers`: Level 1 Question Generator

This component specializes in creating "Level 1" questions. These are foundational questions generated directly from the filtered images and their corresponding news content.

4. `test_set`: Level 1 Question Filter

The test_set module is crucial for maintaining question quality. It evaluates the Level 1 questions generated by qa_makers against a strict set of criteria, filtering out any questions that are ambiguous, too simple, lack temporal context, or violate other quality standards.

5. `qa_makers_mh`: Level 2 Question Generator

Building on the filtered Level 1 questions, the qa_makers_mh (multi-hop) module generates "Level 2" questions. These questions are designed to be more complex, often requiring multi-step reasoning and leveraging the answers from Level 1 questions to create challenging, multi-hop queries.

How to Run

To execute the entire pipeline, simply run the start.py script from your project's root directory:

pip install -r requirements.txt
python start.py

VIDEO

QAs pipeline for video is the same as the news pipeline.

But you should install the following Github Repositories:

How to run?

bash video_pipeline.sh

ARXIV

Dataset Preprocessing

direct_download.py: Downloads arXiv papers by year-month and ID range with support for multi-processing and multi-threading.
cate.py: Categorizes papers based on their date.
get_article.py: Extracts images, paragraphs, and association information from HTML files of arXiv papers.
select_best_images.py: Uses GPT-4 to analyze and rank figures from papers, selecting the most representative images.
streamlit_viewer.py: Provides a web interface for annotating selected images.

Dataset Construction

construct_level1.py: Generates level 1 QA pairs focused on basic paper identification from images.
construct_level2.py: Generates level 2 QA pairs that require deeper understanding of the paper content.

Usage Instructions

Downloading Papers

python direct_download.py --yearmonth 2405 --start-id 1 --end-id 100 --concurrent 5 --processes 4

Parameters:

--yearmonth: Year and month in format YYMM (e.g., 2405 for May 2024)
--start-id, --end-id: Range of paper IDs to download
--concurrent: Number of concurrent downloads per process
--processes: Number of processes to use

Categorizing Papers

python cate.py

Extracting Paper Content

python get_article.py --dir /path/to/html/files --output /path/to/output --workers 4

Parameters:

--dir: Directory containing HTML files
--output: Output directory for processed data
--workers: Number of parallel processes
--force: Force reprocessing of already processed files
--limit: Limit number of files to process

Selecting Best Images

python select_best_images.py --input_dir /path/to/json/files --output_dir /path/to/output --workers 4

Parameters:

--input_dir: Directory containing JSON files
--output_dir: Directory to output results
--workers: Number of worker processes
--start_index, --end_index: Range of folders to process
--force: Force reprocessing files with existing results

Viewing Papers and Images

streamlit run streamlit_viewer.py

Generating QA Datasets

For Level 1 QA pairs (basic paper identification):

python construct_level1.py --input-path /path/to/json/files --output-file output.jsonl --workers 4

For Level 2 QA pairs (deeper paper understanding):

python construct_level2.py --input-file output.jsonl --output-file output_with_level2.jsonl --processes 4

Environmental Variables

OPENAI_API_KEY: Your OpenAI API key for GPT-4 interaction

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
arxiv		arxiv
collectors		collectors
qa_makers		qa_makers
qa_makers_mh		qa_makers_mh
ranking		ranking
test_set		test_set
video_code		video_code
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIVEVQA

NEWS

1. `collectors`: News Collector

2. `ranking`: Image Filter

3. `qa_makers`: Level 1 Question Generator

4. `test_set`: Level 1 Question Filter

5. `qa_makers_mh`: Level 2 Question Generator

How to Run

VIDEO

How to run?

ARXIV

Dataset Preprocessing

Dataset Construction

Usage Instructions

Downloading Papers

Categorizing Papers

Extracting Paper Content

Selecting Best Images

Viewing Papers and Images

Generating QA Datasets

Environmental Variables

About

Uh oh!

Releases

Packages

Uh oh!

Languages

P-Y-Y/LiveVQA

Folders and files

Latest commit

History

Repository files navigation

LIVEVQA

NEWS

1. collectors: News Collector

2. ranking: Image Filter

3. qa_makers: Level 1 Question Generator

4. test_set: Level 1 Question Filter

5. qa_makers_mh: Level 2 Question Generator

How to Run

VIDEO

How to run?

ARXIV

Dataset Preprocessing

Dataset Construction

Usage Instructions

Downloading Papers

Categorizing Papers

Extracting Paper Content

Selecting Best Images

Viewing Papers and Images

Generating QA Datasets

Environmental Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. `collectors`: News Collector

2. `ranking`: Image Filter

3. `qa_makers`: Level 1 Question Generator

4. `test_set`: Level 1 Question Filter

5. `qa_makers_mh`: Level 2 Question Generator

Packages