discourse/lib/crawler_detection.rb

# frozen_string_literal: true

module CrawlerDetection
  WAYBACK_MACHINE_URL = "archive.org"

  def self.to_matcher(string, type: nil)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')

    if type == :real && Rails.env == "test"
      # we need this bypass so we properly render views
      escaped << "|Rails Testing"
    end

    Regexp.new(escaped, Regexp::IGNORECASE)
  end

  def self.crawler?(user_agent, via_header = nil)
    return true if user_agent.nil? || user_agent&.include?(WAYBACK_MACHINE_URL) || via_header&.include?(WAYBACK_MACHINE_URL)

    # this is done to avoid regenerating regexes
    @non_crawler_matchers ||= {}
    @matchers ||= {}

    possibly_real = (@non_crawler_matchers[SiteSetting.non_crawler_user_agents] ||= to_matcher(SiteSetting.non_crawler_user_agents, type: :real))

    if user_agent.match?(possibly_real)
      known_bots = (@matchers[SiteSetting.crawler_user_agents] ||= to_matcher(SiteSetting.crawler_user_agents))
      if user_agent.match?(known_bots)
        bypass = (@matchers[SiteSetting.crawler_check_bypass_agents] ||= to_matcher(SiteSetting.crawler_check_bypass_agents))
        !user_agent.match?(bypass)
      else
        false
      end
    else
      true
    end

  end

  # Given a user_agent that returns true from crawler?, should its request be allowed?
  def self.allow_crawler?(user_agent)
    return true if SiteSetting.whitelisted_crawler_user_agents.blank? &&
      SiteSetting.blacklisted_crawler_user_agents.blank?

    @whitelisted_matchers ||= {}
    @blacklisted_matchers ||= {}

    if SiteSetting.whitelisted_crawler_user_agents.present?
      whitelisted = @whitelisted_matchers[SiteSetting.whitelisted_crawler_user_agents] ||= to_matcher(SiteSetting.whitelisted_crawler_user_agents)
      !user_agent.nil? && user_agent.match?(whitelisted)
    else
      blacklisted = @blacklisted_matchers[SiteSetting.blacklisted_crawler_user_agents] ||= to_matcher(SiteSetting.blacklisted_crawler_user_agents)
      user_agent.nil? || !user_agent.match?(blacklisted)
    end
  end

  def self.is_blocked_crawler?(user_agent)
    crawler?(user_agent) && !allow_crawler?(user_agent)
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-03 06:17:27 +08:00			`# frozen_string_literal: true`

REFACTOR: Rename `GooglebotDetection` to `CrawlerDetection` because we will likely whitelist more crawlers in the future. 2014-02-21 05:07:02 +08:00			`module CrawlerDetection`
FIX: Detect Wayback Machine using user agent (#9777) 2020-05-14 19:10:07 +08:00			`WAYBACK_MACHINE_URL = "archive.org"`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00
correct specs, ensure crawler layout only applies to html 2018-01-16 13:28:11 +08:00			`def self.to_matcher(string, type: nil)`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`escaped = string.split('\|').map { \|agent\| Regexp.escape(agent) }.join('\|')`
correct specs, ensure crawler layout only applies to html 2018-01-16 13:28:11 +08:00
			`if type == :real && Rails.env == "test"`
			`# we need this bypass so we properly render views`
			`escaped << "\|Rails Testing"`
			`end`

FEATURE: much improved and simplified crawler detection - phase one does it match 'trident\|webkit\|gecko\|chrome\|safari\|msie\|opera' yes- well it is possibly a browser - phase two does it match 'rss\|bot\|spider\|crawler\|facebook\|archive\|wayback\|ping\|monitor' probably a crawler then Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53 2018-01-16 12:41:13 +08:00			`Regexp.new(escaped, Regexp::IGNORECASE)`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`end`
FEATURE: add 360Spider UA to allow 360 crawl Discourse sites 2015-02-14 22:24:51 +08:00
FIX: use crawler layout when saving url in Wayback Machine (#7667) 2019-06-03 10:13:32 +08:00			`def self.crawler?(user_agent, via_header = nil)`
FIX: Detect Wayback Machine using user agent (#9777) 2020-05-14 19:10:07 +08:00			`return true if user_agent.nil? \|\| user_agent&.include?(WAYBACK_MACHINE_URL) \|\| via_header&.include?(WAYBACK_MACHINE_URL)`
correct specs, ensure crawler layout only applies to html 2018-01-16 13:28:11 +08:00
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`# this is done to avoid regenerating regexes`
FEATURE: much improved and simplified crawler detection - phase one does it match 'trident\|webkit\|gecko\|chrome\|safari\|msie\|opera' yes- well it is possibly a browser - phase two does it match 'rss\|bot\|spider\|crawler\|facebook\|archive\|wayback\|ping\|monitor' probably a crawler then Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53 2018-01-16 12:41:13 +08:00			`@non_crawler_matchers \|\|= {}`
FEATURE: flexible crawler detection You can use the crawler user agents site setting to amend what user agents are considered crawlers based on a string match in the user agent Also improves performance of crawler detection slightly 2017-09-29 10:31:50 +08:00			`@matchers \|\|= {}`
FEATURE: much improved and simplified crawler detection - phase one does it match 'trident\|webkit\|gecko\|chrome\|safari\|msie\|opera' yes- well it is possibly a browser - phase two does it match 'rss\|bot\|spider\|crawler\|facebook\|archive\|wayback\|ping\|monitor' probably a crawler then Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53 2018-01-16 12:41:13 +08:00
correct specs, ensure crawler layout only applies to html 2018-01-16 13:28:11 +08:00			`possibly_real = (@non_crawler_matchers[SiteSetting.non_crawler_user_agents] \|\|= to_matcher(SiteSetting.non_crawler_user_agents, type: :real))`
FEATURE: much improved and simplified crawler detection - phase one does it match 'trident\|webkit\|gecko\|chrome\|safari\|msie\|opera' yes- well it is possibly a browser - phase two does it match 'rss\|bot\|spider\|crawler\|facebook\|archive\|wayback\|ping\|monitor' probably a crawler then Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53 2018-01-16 12:41:13 +08:00
			`if user_agent.match?(possibly_real)`
			`known_bots = (@matchers[SiteSetting.crawler_user_agents] \|\|= to_matcher(SiteSetting.crawler_user_agents))`
FIX: cubot android devices were detected as crawlers 2018-06-21 08:56:46 +08:00			`if user_agent.match?(known_bots)`
			`bypass = (@matchers[SiteSetting.crawler_check_bypass_agents] \|\|= to_matcher(SiteSetting.crawler_check_bypass_agents))`
			`!user_agent.match?(bypass)`
			`else`
			`false`
			`end`
FEATURE: much improved and simplified crawler detection - phase one does it match 'trident\|webkit\|gecko\|chrome\|safari\|msie\|opera' yes- well it is possibly a browser - phase two does it match 'rss\|bot\|spider\|crawler\|facebook\|archive\|wayback\|ping\|monitor' probably a crawler then Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53 2018-01-16 12:41:13 +08:00			`else`
			`true`
			`end`

Detect Googlebot from user agent and use a different layout that doesn't load javascript 2014-02-15 06:10:08 +08:00			`end`
FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-16 05:10:45 +08:00
			`# Given a user_agent that returns true from crawler?, should its request be allowed?`
			`def self.allow_crawler?(user_agent)`
			`return true if SiteSetting.whitelisted_crawler_user_agents.blank? &&`
			`SiteSetting.blacklisted_crawler_user_agents.blank?`

			`@whitelisted_matchers \|\|= {}`
			`@blacklisted_matchers \|\|= {}`

			`if SiteSetting.whitelisted_crawler_user_agents.present?`
			`whitelisted = @whitelisted_matchers[SiteSetting.whitelisted_crawler_user_agents] \|\|= to_matcher(SiteSetting.whitelisted_crawler_user_agents)`
			`!user_agent.nil? && user_agent.match?(whitelisted)`
			`else`
			`blacklisted = @blacklisted_matchers[SiteSetting.blacklisted_crawler_user_agents] \|\|= to_matcher(SiteSetting.blacklisted_crawler_user_agents)`
			`user_agent.nil? \|\| !user_agent.match?(blacklisted)`
			`end`
			`end`

			`def self.is_blocked_crawler?(user_agent)`
			`crawler?(user_agent) && !allow_crawler?(user_agent)`
			`end`
Detect Googlebot from user agent and use a different layout that doesn't load javascript 2014-02-15 06:10:08 +08:00			`end`