search-engine-nishoof

File	Description
`clean.go`	Cleans/normalizes URLs and resolves relative links against a base URL. For example, if we are currently on `https://github.com/` and we have the href `/nishoof`, cleaning it would give `https://github.com/nishoof`.
`crawl.go`	Crawls the provided href. First downloads the page. Then extracts the words and hrefs. Cleans the hrefs. Then repeats the crawling process using the new hrefs. Uses a queue and tracks visited URLs to avoid cycles. Returns a map from URLs to their extracted words. If an Index was provided, then crawl will also build the index by calling the index's increment method.
`download.go`	Downloads the contents of a web page using HTTP and returns a readable stream for further processing.
`extract.go`	Extracts relevant unique words and hrefs, skipping unwanted elements like `<style>` and `<script>`.
`index_in_memory.go`	An in-memory inverted index (implementing the Index interface) that maps words to another map mapping the documents (URLs) the words appear in to the word's frequency in that document.
`index_interface.go`	Defines the interface for an index to search for documents using keywords. Provides methods including GetFrequency() which gets the frequency of a given word in a given document. The increment method should be called for every occurrence of every word.
`index_sqlite.go`	An SQLite-based inverted index (implementing the Index interface). Uses an SQLite database to store the index persistently on disk. Uses 3 tables: `documents`, `words`, and `frequencies`.
`robots.go`	Used by crawl to parse the `robots.txt` file of a website to make sure we're following its rules (including crawl delays and disallowed paths).
`search.go`	Searches the index for documents matching the provided words. Calculates a TF-IDF score for each document with at least one occurrence of the search word and returns a list of results (containing document URL, score, and num of occurrences).
`stop.go`	Checks if a word is a stop word. Stop words are common words that we should filter out.
`tfidf.go`	Implements the TF-IDF ranking to determine how relevant a document is to a search word.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
crawler		crawler
index		index
logger		logger
searcher		searcher
templates		templates
testdata		testdata
testutils		testutils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-engine-nishoof

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

search-engine-nishoof

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages