Skip to content

Need to pass encoding to partition_csv() in langchain_community/document_loaders/csv_loader.py --> _get_elements #505

@mqslllloveddyy

Description

Checked other resources

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain Community rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain Community.
  • I read what a minimal reproducible example is (https://stackoverflow.com/help/minimal-reproducible-example).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

from langchain_community.document_loaders import UnstructuredCSVLoader
import chardet

file_path = "docs/test.csv"
with open(file_path, 'rb') as f:
raw_data = f.read()
encoding = chardet.detect(raw_data)['encoding']
print("Detected encoding:", encoding)

loader = UnstructuredCSVLoader(file_path=file_path, unstructured_kwargs={"encoding": encoding})
data = loader.load()
print(data)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
File "E:\mycodes\ai_agent_practice\rag\data_load\demo-14.py", line 46, in
data = loader.load()
^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\langchain_core\document_loaders\base.py", line 43, in load
return list(self.lazy_load())
^^^^^^^^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\langchain_community\document_loaders\unstructured.py", line 107, in lazy_load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 226, in _get_elements
return partition_csv(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\partition\common\metadata.py", line 161, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\partition\csv.py", line 60, in partition_csv
dataframe = pd.read_csv(file, header=ctx.header, sep=ctx.delimiter, encoding=ctx.encoding)
^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\utils.py", line 154, in get
value = self._fget(obj)
^^^^^^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\partition\csv.py", line 125, in delimiter
data = "\n".join(
^^^^^^^^^^
File "E:\mycodes\ai_agent_practice\rag.venv\Lib\site-packages\unstructured\partition\csv.py", line 126, in
ln.decode(self._encoding or "utf-8") for ln in file.readlines(num_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 3: invalid continuation byte

Description

when i checked the source codes in both langchain_community and unstructured, and found in unstructured, the method partition_csv() didn't use kwargs to get "encoding", so the input encoding will be ignored and set to default encoding "utf-8". and then i checked the source code in langchain_community/document_loaders/csv_loader.py : _get_elements(), and found we only passthrough self.unstructured_kwargs directly, so this is the issue.

hope we will get the code updated by below :

def _get_elements(self) -> List:
from unstructured.partition.csv import partition_csv
input_encoding = self.unstructured_kwargs.get("encoding", None)
return partition_csv(filename=self.file_path, encoding=input_encoding, **self.unstructured_kwargs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions