Skip to content

NOISSUE - Add OpenAI embedding dims and OCR preprocessing#235

Open
fbugarski wants to merge 1 commit into
mainfrom
openai
Open

NOISSUE - Add OpenAI embedding dims and OCR preprocessing#235
fbugarski wants to merge 1 commit into
mainfrom
openai

Conversation

@fbugarski
Copy link
Copy Markdown
Contributor

What type of PR is this?

This is primarily a feature PR (adds OCR preprocessing and runtime configuration), with a small bug fix/improvement (OpenAI embedding dimensions support/validation), and a documentation update.

What does this do?

This PR improves the Embedder ingestion and model-integration flow by adding:

  • OpenAI embedding request dimensions support (and response-dimension validation) to keep compatibility with existing vector(768) usage.
  • OCR preprocessing pipeline for ingestion:
    • OCR for image/* files using tesseract.
    • OCR fallback for PDFs (pdftoppm + tesseract) when pdftotext output is too small.
  • OCR configuration via environment variables (EMBEDDER_OCR_*) and startup preflight checks for OCR binaries.
  • Docker/runtime updates for OCR dependencies (tesseract-ocr, tesseract-ocr-eng).
  • Compose and .env wiring for OCR-related settings.
  • Documentation updates in Embedder README and workflow docs.
  • Added/updated tests for OpenAI embedding behavior and OCR extraction behavior.

Behavior notes:

  • OCR is enabled by config and defaults are safe/configurable.
  • No API contract breaking changes were introduced.

Which issue(s) does this PR fix/relate to?

  • Related Issue: NOISSUE

Have you included tests for your changes?

Yes.

Added tests:

  • internal/embedder/embedding/openai/client_test.go
  • internal/embedder/ingest/ocr_test.go

Executed locally:

  • go test ./internal/embedder/embedding/... -v
  • go test ./internal/embedder/ingest/... -v
  • go test ./cmd/embedder/...

Did you document any new/modified features?

Yes.

Documentation updated:

  • internal/embedder/README.md (new OCR env vars and behavior)
  • internal/embedder/workflows/ollama-vs-openai-eval.md (evaluation flow and OCR notes)

Notes

  • OpenAI API keys are intentionally not committed in repo config.
  • OpenAI live API validation may require active API quota/billing on the target project.
  • OCR dependencies are now present in docker/Dockerfile.embedder, so enabling OCR in deployment only requires env configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant