@@ -192,7 +192,9 @@ def __init__(remove_empty_lines: bool = True,
192192 remove_regex : str | None = None ,
193193 unicode_normalization : Literal[" NFC" , " NFKC" , " NFD" , " NFKD" ]
194194 | None = None ,
195- ascii_only : bool = False )
195+ ascii_only : bool = False ,
196+ strip_whitespaces : bool = False ,
197+ replace_regexes : dict[str , str ] | None = None )
196198```
197199
198200Initialize DocumentCleaner.
@@ -213,6 +215,12 @@ Note: This will run before any other steps.
213215Will remove accents from characters and replace them with ASCII characters.
214216Other non-ASCII characters will be removed.
215217Note: This will run before any pattern matching or removal.
218+ - ` strip_whitespaces ` : If ` True ` , removes leading and trailing whitespace from the document content
219+ using Python's ` str.strip() ` . Unlike ` remove_extra_whitespaces ` , this only affects the beginning
220+ and end of the text, preserving internal whitespace (useful for markdown formatting).
221+ - ` replace_regexes ` : A dictionary mapping regex patterns to their replacement strings.
222+ For example, ` {r'\n\n+': '\n'} ` replaces multiple consecutive newlines with a single newline.
223+ This is applied after ` remove_regex ` and allows custom replacements instead of just removal.
216224
217225<a id =" document_cleaner.DocumentCleaner.run " ></a >
218226
0 commit comments