Skip to content

Commit 0073587

Browse files
Improves detection for generic bots (matomo-org#8236)
* Add TprAdsTxtCrawler to generic bots * Adds detection for CopyvioDetector * Improves detection for generic bots * Improves detection for generic bots * Improves regex * Improves detection for generic bots * Improves detection for generic bots * Move Secweb-Sectxt
1 parent e0fff23 commit 0073587

File tree

2 files changed

+54
-31
lines changed

2 files changed

+54
-31
lines changed

Tests/fixtures/bots.yml

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9059,7 +9059,8 @@
90599059
-
90609060
user_agent: Secweb-Sectxt/1.0
90619061
bot:
9062-
name: Generic Bot
9062+
name: Secweb-Sectxt
9063+
category: Crawler
90639064
-
90649065
user_agent: parser
90659066
bot:
@@ -9763,3 +9764,45 @@
97639764
producer:
97649765
name: DEED Baltic, UAB
97659766
url: http://www.deed.lt/
9767+
-
9768+
user_agent: TprAdsTxtCrawler/1.1
9769+
bot:
9770+
name: TprAdsTxtCrawler
9771+
category: Crawler
9772+
-
9773+
user_agent: 'Mozilla/5.0 (compatible; CopyvioDetector/0.4.dev0; +wikipedia.earwig[at]gmail.com)'
9774+
bot:
9775+
name: CopyvioDetector
9776+
category: Crawler
9777+
url: https://copyvios.toolforge.org/
9778+
producer:
9779+
name: Ben Kurtovic
9780+
url: https://github.com/earwig
9781+
-
9782+
user_agent: Mozilla/5.0 (compatible; Dormouse/1.0)
9783+
bot:
9784+
name: Dormouse
9785+
category: Crawler
9786+
-
9787+
user_agent: Mozilla/5.0 (compatible; FormFinder/1.0)
9788+
bot:
9789+
name: FormFinder
9790+
category: Crawler
9791+
-
9792+
user_agent: Mozilla/5.0 (compatible; ImportDomains/1.0)
9793+
bot:
9794+
name: ImportDomains
9795+
category: Crawler
9796+
-
9797+
user_agent: YourUserAgentHere
9798+
bot:
9799+
name: Generic Bot
9800+
-
9801+
user_agent: desktop
9802+
bot:
9803+
name: Generic Bot
9804+
-
9805+
user_agent: StudyBot/1.0 (Research Purposes)
9806+
bot:
9807+
name: StudyBot
9808+
category: Crawler

regexes/bots.yml

Lines changed: 10 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2507,10 +2507,6 @@
25072507
name: 'Spotify'
25082508
url: 'https://www.spotify.com'
25092509

2510-
- regex: 'The Knowledge AI'
2511-
name: 'The Knowledge AI'
2512-
category: 'Crawler'
2513-
25142510
- regex: 'Embedly'
25152511
name: 'Embedly'
25162512
category: 'Crawler'
@@ -2734,10 +2730,6 @@
27342730
category: 'Service bot'
27352731
url: 'https://www.grammarly.com'
27362732

2737-
- regex: 'Robozilla'
2738-
name: 'Robozilla'
2739-
category: 'Crawler'
2740-
27412733
- regex: 'Domains Project'
27422734
name: 'Domains Project'
27432735
category: 'Crawler'
@@ -3221,14 +3213,6 @@
32213213
category: 'Site Monitor'
32223214
url: 'https://www.dotcom-monitor.com'
32233215

3224-
- regex: 'ThinkChaos/'
3225-
name: 'ThinkChaos'
3226-
category: 'Crawler'
3227-
3228-
- regex: 'Thinkbot/'
3229-
name: 'Thinkbot'
3230-
category: 'Crawler'
3231-
32323216
- regex: 'DataForSeoBot'
32333217
name: 'DataForSeoBot'
32343218
category: 'Crawler'
@@ -3454,14 +3438,6 @@
34543438
category: 'Crawler'
34553439
url: 'https://domaincrawler.com/about-us/'
34563440

3457-
- regex: 'DNSResearchBot'
3458-
name: 'DNSResearchBot'
3459-
category: 'Crawler'
3460-
3461-
- regex: 'GitCrawlerBot'
3462-
name: 'GitCrawlerBot'
3463-
category: 'Crawler'
3464-
34653441
- regex: 'AdAuth'
34663442
name: 'AdAuth'
34673443
category: 'Crawler'
@@ -4675,10 +4651,6 @@
46754651
name: 'WebMon'
46764652
category: 'Site Monitor'
46774653

4678-
- regex: 'AdsTxtCrawlerTP'
4679-
name: 'AdsTxtCrawlerTP'
4680-
category: 'Crawler'
4681-
46824654
- regex: 'fragFINN'
46834655
name: 'fragFINN'
46844656
category: 'Crawler'
@@ -5836,12 +5808,20 @@
58365808
name: 'Semantic Visions, s.r.o.'
58375809
url: 'https://www.semantic-visions.com/'
58385810

5811+
- regex: 'CopyvioDetector'
5812+
name: 'CopyvioDetector'
5813+
category: 'Crawler'
5814+
url: 'https://copyvios.toolforge.org/'
5815+
producer:
5816+
name: 'Ben Kurtovic'
5817+
url: 'https://github.com/earwig'
5818+
58395819
# Generic bots
5840-
- regex: '(ABEvalBot|HanaleiBot|PrivacyPolicyBot|SeoCherryBot)'
5820+
- regex: '(ABEvalBot|AdsTxtCrawlerTP|DNSResearchBot|Dormouse|FormFinder|GitCrawlerBot|HanaleiBot|ImportDomains|PrivacyPolicyBot|Robozilla|Secweb-Sectxt|SeoCherryBot|StudyBot|The Knowledge AI|Thinkbot|ThinkChaos|TprAdsTxtCrawler)'
58415821
name: '$1'
58425822
category: 'Crawler'
58435823

5844-
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|Keydrop|\(compatible\)|John Recon|SPARK COMMIT|masjesu|Komaru_The_Cat|Jesus Christ of Nazareth is LORD|Kowai|Hakai|LoliSec|LMAO|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|node|Node.js|Report Runner|url|Zeus|ZmEu)$|OnlyScans|TheInternetSearchx|Laravel Reaver|bang2013|libredtail|Mozilliqa|Tiberius|Secweb-Sectxt|honeygain|AW-WB-Filter|SaferSoftwashLeadGen'
5824+
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|Keydrop|\(compatible\)|John Recon|SPARK COMMIT|masjesu|Komaru_The_Cat|Jesus Christ of Nazareth is LORD|Kowai|Hakai|LoliSec|LMAO|^xenu|^(?:chrome|desktop|firefox|Abcd|Dark|KvshClient|node|Node.js|Report Runner|url|Zeus|ZmEu)$|OnlyScans|TheInternetSearchx|Laravel Reaver|bang2013|libredtail|Mozilliqa|Tiberius|honeygain|AW-WB-Filter|SaferSoftwashLeadGen|YourUserAgentHere'
58455825
name: 'Generic Bot'
58465826

58475827
# Generic detections

0 commit comments

Comments
 (0)