Skip to content

Add support for other text formats in the file content filter #464

@dpomykala

Description

@dpomykala

Is your feature request related to a problem? Please describe.
The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file).

The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here:

EXTRACTORS: Dict[str, Callable[[Path], str]] = {
".md": extract_txt,
".txt": extract_txt,
".log": extract_txt,
".pdf": extract_pdf,
".docx": extract_docx,
}

There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.

E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.

Describe the solution you'd like
We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (extract_txt) as a fallback for all other files:

def extract_txt(path: Path) -> str:
return path.read_text(encoding="utf-8")

I can provide a PR if this solution is accepted.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions