Skip to content

Commit c80336c

Browse files
Upgrade to v1.6.0 and copy docs folder (#2764)
* Upgrade to v1.6.0 and copy docs folder * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 2a7c013 commit c80336c

File tree

93 files changed

+32379
-2
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+32379
-2
lines changed

VERSION.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.5.1rc0
1+
1.6.0

docs/_src/api/openapi/openapi-1.6.0.json

Lines changed: 893 additions & 0 deletions
Large diffs are not rendered by default.

docs/_src/api/openapi/openapi.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"openapi": "3.0.2",
33
"info": {
44
"title": "Haystack REST API",
5-
"version": "1.5.1rc0"
5+
"version": "1.6.0"
66
},
77
"paths": {
88
"/initialized": {

docs/v1.6.0/Makefile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
7+
SPHINXBUILD := sphinx-build
8+
MAKEINFO := makeinfo
9+
10+
BUILDDIR := build
11+
SOURCE := _src/
12+
# SPHINXFLAGS := -a -W -n -A local=1 -d $(BUILDDIR)/doctree
13+
SPHINXFLAGS := -A local=1 -d $(BUILDDIR)/doctree
14+
SPHINXOPTS := $(SPHINXFLAGS) $(SOURCE)
15+
16+
# Put it first so that "make" without argument is like "make help".
17+
help:
18+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
19+
20+
.PHONY: help Makefile
21+
22+
# Catch-all target: route all unknown targets to Sphinx using the new
23+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
24+
%: Makefile
25+
$(SPHINXBUILD) -M $@ $(SPHINXOPTS) $(BUILDDIR)/$@

docs/v1.6.0/_src/api/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
div.sphinxsidebarwrapper {
2+
position: relative;
3+
top: 0px;
4+
padding: 0;
5+
}
6+
7+
div.sphinxsidebar {
8+
margin: 0;
9+
padding: 0 15px 0 15px;
10+
width: 210px;
11+
float: left;
12+
font-size: 1em;
13+
text-align: left;
14+
}
15+
16+
div.sphinxsidebar .logo {
17+
font-size: 1.8em;
18+
color: #0A507A;
19+
font-weight: 300;
20+
text-align: center;
21+
}
22+
23+
div.sphinxsidebar .logo img {
24+
vertical-align: middle;
25+
}
26+
27+
div.sphinxsidebar .download a img {
28+
vertical-align: middle;
29+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{# put the sidebar before the body #}
2+
{% block sidebar1 %}{{ sidebar() }}{% endblock %}
3+
{% block sidebar2 %}{% endblock %}
4+
5+
{% block extrahead %}
6+
<link href='https://fonts.googleapis.com/css?family=Open+Sans:300,400,700'
7+
rel='stylesheet' type='text/css' />
8+
{{ super() }}
9+
{#- if not embedded #}
10+
<style type="text/css">
11+
table.right { float: left; margin-left: 20px; }
12+
table.right td { border: 1px solid #ccc; }
13+
{% if pagename == 'index' %}
14+
.related { display: none; }
15+
{% endif %}
16+
</style>
17+
<script>
18+
// intelligent scrolling of the sidebar content
19+
$(window).scroll(function() {
20+
var sb = $('.sphinxsidebarwrapper');
21+
var win = $(window);
22+
var sbh = sb.height();
23+
var offset = $('.sphinxsidebar').position()['top'];
24+
var wintop = win.scrollTop();
25+
var winbot = wintop + win.innerHeight();
26+
var curtop = sb.position()['top'];
27+
var curbot = curtop + sbh;
28+
// does sidebar fit in window?
29+
if (sbh < win.innerHeight()) {
30+
// yes: easy case -- always keep at the top
31+
sb.css('top', $u.min([$u.max([0, wintop - offset - 10]),
32+
$(document).height() - sbh - 200]));
33+
} else {
34+
// no: only scroll if top/bottom edge of sidebar is at
35+
// top/bottom edge of window
36+
if (curtop > wintop && curbot > winbot) {
37+
sb.css('top', $u.max([wintop - offset - 10, 0]));
38+
} else if (curtop < wintop && curbot < winbot) {
39+
sb.css('top', $u.min([winbot - sbh - offset - 20,
40+
$(document).height() - sbh - 200]));
41+
}
42+
}
43+
});
44+
</script>
45+
{#- endif #}
46+
{% endblock %}
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
<a id="crawler"></a>
2+
3+
# Module crawler
4+
5+
<a id="crawler.Crawler"></a>
6+
7+
## Crawler
8+
9+
```python
10+
class Crawler(BaseComponent)
11+
```
12+
13+
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
14+
15+
**Example:**
16+
```python
17+
| from haystack.nodes.connector import Crawler
18+
|
19+
| crawler = Crawler(output_dir="crawled_files")
20+
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
21+
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
22+
| filter_urls= ["haystack\.deepset\.ai\/overview\/"])
23+
```
24+
25+
<a id="crawler.Crawler.__init__"></a>
26+
27+
#### Crawler.\_\_init\_\_
28+
29+
```python
30+
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None, extract_hidden_text=True, loading_wait_time: Optional[int] = None)
31+
```
32+
33+
Init object with basic params for crawling (can be overwritten later).
34+
35+
**Arguments**:
36+
37+
- `output_dir`: Path for the directory to store files
38+
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
39+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
40+
0: Only initial list of urls
41+
1: Follow links found on the initial URLs (but no further)
42+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
43+
All URLs not matching at least one of the regular expressions will be dropped.
44+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
45+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
46+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
47+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
48+
In this case the id will be generated by using the content and the defined metadata.
49+
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
50+
E.g. the text can be inside a span with style="display: none"
51+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
52+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
53+
E.g. 2: Crawler will wait 2 seconds before scraping page
54+
55+
<a id="crawler.Crawler.crawl"></a>
56+
57+
#### Crawler.crawl
58+
59+
```python
60+
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = None, loading_wait_time: Optional[int] = None) -> List[Path]
61+
```
62+
63+
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
64+
65+
file per URL, including text and basic meta data).
66+
You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
67+
All parameters are optional here and only meant to overwrite instance attributes at runtime.
68+
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
69+
70+
**Arguments**:
71+
72+
- `output_dir`: Path for the directory to store files
73+
- `urls`: List of http addresses or single http address
74+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
75+
0: Only initial list of urls
76+
1: Follow links found on the initial URLs (but no further)
77+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
78+
All URLs not matching at least one of the regular expressions will be dropped.
79+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
80+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
81+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
82+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
83+
In this case the id will be generated by using the content and the defined metadata.
84+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
85+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
86+
E.g. 2: Crawler will wait 2 seconds before scraping page
87+
88+
**Returns**:
89+
90+
List of paths where the crawled webpages got stored
91+
92+
<a id="crawler.Crawler.run"></a>
93+
94+
#### Crawler.run
95+
96+
```python
97+
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = True, loading_wait_time: Optional[int] = None) -> Tuple[Dict[str, Union[List[Document], List[Path]]], str]
98+
```
99+
100+
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
101+
102+
**Arguments**:
103+
104+
- `output_dir`: Path for the directory to store files
105+
- `urls`: List of http addresses or single http address
106+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
107+
0: Only initial list of urls
108+
1: Follow links found on the initial URLs (but no further)
109+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
110+
All URLs not matching at least one of the regular expressions will be dropped.
111+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
112+
- `return_documents`: Return json files content
113+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
114+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
115+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
116+
In this case the id will be generated by using the content and the defined metadata.
117+
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
118+
E.g. the text can be inside a span with style="display: none"
119+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
120+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
121+
E.g. 2: Crawler will wait 2 seconds before scraping page
122+
123+
**Returns**:
124+
125+
Tuple({"paths": List of filepaths, ...}, Name of output edge)
126+

0 commit comments

Comments
 (0)