Add support for other text formats in the file content filter

**Is your feature request related to a problem? Please describe.**
The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file). 

The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here: https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L82-L88

There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.

E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.

**Describe the solution you'd like**
We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (`extract_txt`) as a fallback for all other files:
https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L37-L38

I can provide a PR if this solution is accepted.

	EXTRACTORS: Dict[str, Callable[[Path], str]] = {
	".md": extract_txt,
	".txt": extract_txt,
	".log": extract_txt,
	".pdf": extract_pdf,
	".docx": extract_docx,
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for other text formats in the file content filter #464

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	def extract_txt(path: Path) -> str:
	return path.read_text(encoding="utf-8")

Uh oh!

Add support for other text formats in the file content filter #464

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions