Skip to content

The content_selector does not work well when configured as a CSS selector #515

@wForget

Description

@wForget

Checked other resources

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.

Feature Description

When I tried to load gitbook page using GitbookLoader, I found that it couldn't select the main element using a css selector like: div#main. Although the current description of content_selector is: The CSS selector for the content to load., it seems to only be configurable as an HTML element name.

Use Case

    loader = GitbookLoader(
        web_page="https://www.gitbook.com/blog/improve-product-documentation-tips",
        content_selector='div#main',
        load_all_paths=False
    )

    docs = loader.load()
    print(len(docs))

Proposed Solution

Use soup.select_one(self.content_selector) instead of soup.find(self.content_selector).

page_content_raw = soup.find(self.content_selector)

Alternatives Considered

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions