You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cleans/normalizes URLs and resolves relative links against a base URL. For example, if we are currently on https://github.com/ and we have the href /nishoof, cleaning it would give https://github.com/nishoof.
crawl.go
Crawls the provided href. First downloads the page. Then extracts the words and hrefs. Cleans the hrefs. Then repeats the crawling process using the new hrefs. Uses a queue and tracks visited URLs to avoid cycles. Returns a map from URLs to their extracted words. If an Index was provided, then crawl will also build the index by calling the index's increment method.
download.go
Downloads the contents of a web page using HTTP and returns a readable stream for further processing.
extract.go
Extracts relevant unique words and hrefs, skipping unwanted elements like <style> and <script>.
index_in_memory.go
An in-memory inverted index (implementing the Index interface) that maps words to another map mapping the documents (URLs) the words appear in to the word's frequency in that document.
index_interface.go
Defines the interface for an index to search for documents using keywords. Provides methods including GetFrequency() which gets the frequency of a given word in a given document. The increment method should be called for every occurrence of every word.
index_sqlite.go
An SQLite-based inverted index (implementing the Index interface). Uses an SQLite database to store the index persistently on disk. Uses 3 tables: documents, words, and frequencies.
robots.go
Used by crawl to parse the robots.txt file of a website to make sure we're following its rules (including crawl delays and disallowed paths).
search.go
Searches the index for documents matching the provided words. Calculates a TF-IDF score for each document with at least one occurrence of the search word and returns a list of results (containing document URL, score, and num of occurrences).
stop.go
Checks if a word is a stop word. Stop words are common words that we should filter out.
tfidf.go
Implements the TF-IDF ranking to determine how relevant a document is to a search word.
About
Concurrent search engine with crawling, indexing, and ranking. Uses 1k+ worker threads, transactional batch flushes, journal-mode tuning, etc. for performance