generated from langchain-ai/integration-repo-template
-
Notifications
You must be signed in to change notification settings - Fork 377
Open
Copy link
Description
Checked other resources
- This is a feature request, not a bug report or usage question.
- I added a clear and descriptive title that summarizes the feature request.
- I used the GitHub search to find a similar feature request and didn't find it.
- I checked the LangChain documentation and API reference to see if this feature already exists.
Feature Description
When I tried to load gitbook page using GitbookLoader, I found that it couldn't select the main element using a css selector like: div#main. Although the current description of content_selector is: The CSS selector for the content to load., it seems to only be configurable as an HTML element name.
Use Case
loader = GitbookLoader(
web_page="https://www.gitbook.com/blog/improve-product-documentation-tips",
content_selector='div#main',
load_all_paths=False
)
docs = loader.load()
print(len(docs))
Proposed Solution
Use soup.select_one(self.content_selector) instead of soup.find(self.content_selector).
langchain-community/libs/community/langchain_community/document_loaders/gitbook.py
Line 369 in befdf57
| page_content_raw = soup.find(self.content_selector) |
Alternatives Considered
No response
Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels