Is your feature request related to a problem? Please describe.
The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file).
The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here:
|
EXTRACTORS: Dict[str, Callable[[Path], str]] = { |
|
".md": extract_txt, |
|
".txt": extract_txt, |
|
".log": extract_txt, |
|
".pdf": extract_pdf, |
|
".docx": extract_docx, |
|
} |
There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.
E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.
Describe the solution you'd like
We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (extract_txt) as a fallback for all other files:
|
def extract_txt(path: Path) -> str: |
|
return path.read_text(encoding="utf-8") |
I can provide a PR if this solution is accepted.
Is your feature request related to a problem? Please describe.
The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file).
The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here:
organize/organize/filters/filecontent.py
Lines 82 to 88 in ac52034
There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.
E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.
Describe the solution you'd like
We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (
extract_txt) as a fallback for all other files:organize/organize/filters/filecontent.py
Lines 37 to 38 in ac52034
I can provide a PR if this solution is accepted.