Commit Graph

19 Commits

Author SHA1 Message Date
David Taylor
6417173082
DEV: Apply syntax_tree formatting to lib/* 2023-01-09 12:10:19 +00:00
Dan Ungureanu
4e46732346
FEATURE: Implement browser update in crawler view (#12448)
browser-update script does not work correctly in some very old browsers
because the contents of <noscript> is not accessible in JavaScript.
For these browsers, the server can display the crawler page and add the
browser update notice.

Simply loading the browser-update script in the crawler view is not a
solution because that means all crawlers will also see it.
2021-03-22 19:41:42 +02:00
Krzysztof Kotlarek
e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Dan Ungureanu
3ed6a0e904
FIX: Detect Wayback Machine using user agent (#9777) 2020-05-14 21:10:07 +10:00
Maja Komel
42809f4d69 FIX: use crawler layout when saving url in Wayback Machine (#7667) 2019-06-03 12:13:32 +10:00
Sam Saffron
30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Sam
f66efc601d FIX: cubot android devices were detected as crawlers 2018-06-21 10:56:46 +10:00
Neil Lalonde
ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Sam
d7657d8e47 correct specs, ensure crawler layout only applies to html 2018-01-16 16:28:11 +11:00
Sam
7b562d2f46 FEATURE: much improved and simplified crawler detection
- phase one does it match 'trident|webkit|gecko|chrome|safari|msie|opera'
    yes- well it is possibly a browser

- phase two does it match 'rss|bot|spider|crawler|facebook|archive|wayback|ping|monitor'
    probably a crawler then

Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53
2018-01-16 15:41:45 +11:00
Sam
f6fdc1ebe8 FEATURE: flexible crawler detection
You can use the crawler user agents site setting to amend what user agents
are considered crawlers based on a string match in the user agent

Also improves performance of crawler detection slightly
2017-09-29 12:31:50 +10:00
mcmcclur
a307ad6517 Update crawler_detection.rb
Add HTTrack to the list of detected crawlers so that Discourse will serve vanilla HTML per https://meta.discourse.org/t/a-basic-discourse-archival-tool/62614/25
2017-05-16 11:17:05 -04:00
Robin Ward
2a4006fe0c Add YandexBot to our list of crawlers 2016-07-26 13:21:37 -04:00
Jeff Atwood
bbb1348118 add Swiftbot to crawler regex 2015-05-02 03:18:58 -07:00
Erick Guan
026cdd8fc3 FEATURE: add 360Spider UA to allow 360 crawl Discourse sites 2015-03-16 22:58:33 +08:00
Jeff Atwood
ceef06e771 add support for "Save Page Now" archive.org/web 2015-01-06 01:05:45 -08:00
riking
37dbc4b5e6 Add archive.org to crawler list to serve no-js to 2014-11-02 16:51:23 -08:00
Vikhyat Korrapati
e3702ecb30 Improved crawler detection: add Twitterbot, Facebook, curl, Bing, Baidu. 2014-03-16 19:30:20 +05:30
Robin Ward
c4b5455c21 REFACTOR: Rename GooglebotDetection to CrawlerDetection because we
will likely whitelist more crawlers in the future.
2014-02-20 16:07:02 -05:00