discourse/lib/crawler_detection.rb

module CrawlerDetection

  # added 'ia_archiver' based on https://meta.discourse.org/t/unable-to-archive-discourse-pages-with-the-internet-archive/21232
  # added 'Wayback Save Page' based on https://meta.discourse.org/t/unable-to-archive-discourse-with-the-internet-archive-save-page-now-button/22875
  # added 'Swiftbot' based on https://meta.discourse.org/t/how-to-add-html-markup-or-meta-tags-for-external-search-engine/28220
  def self.to_matcher(string)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')
    Regexp.new(escaped)
  end

  def self.crawler?(user_agent)
    # this is done to avoid regenerating regexes
    @matchers ||= {}
    matcher = (@matchers[SiteSetting.crawler_user_agents] ||= to_matcher(SiteSetting.crawler_user_agents))
    matcher.match?(user_agent)
  end
end
REFACTOR: Rename `GooglebotDetection` to `CrawlerDetection` because we will likely whitelist more crawlers in the future. 2014-02-21 05:07:02 +08:00			`module CrawlerDetection`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00
add support for "Save Page Now" archive.org/web 2015-01-06 17:05:45 +08:00			`# added 'ia_archiver' based on https://meta.discourse.org/t/unable-to-archive-discourse-pages-with-the-internet-archive/21232`
			`# added 'Wayback Save Page' based on https://meta.discourse.org/t/unable-to-archive-discourse-with-the-internet-archive-save-page-now-button/22875`
add Swiftbot to crawler regex 2015-05-02 18:18:58 +08:00			`# added 'Swiftbot' based on https://meta.discourse.org/t/how-to-add-html-markup-or-meta-tags-for-external-search-engine/28220`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`def self.to_matcher(string)`
			`escaped = string.split('\|').map { \|agent\| Regexp.escape(agent) }.join('\|')`
			`Regexp.new(escaped)`
			`end`
FEATURE: add 360Spider UA to allow 360 crawl Discourse sites 2015-02-14 22:24:51 +08:00
REFACTOR: Rename `GooglebotDetection` to `CrawlerDetection` because we will likely whitelist more crawlers in the future. 2014-02-21 05:07:02 +08:00			`def self.crawler?(user_agent)`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`# this is done to avoid regenerating regexes`
			`@matchers \|\|= {}`
			`matcher = (@matchers[SiteSetting.crawler_user_agents] \|\|= to_matcher(SiteSetting.crawler_user_agents))`
			`matcher.match?(user_agent)`
Detect Googlebot from user agent and use a different layout that doesn't load javascript 2014-02-15 06:10:08 +08:00			`end`
			`end`