discourse/app/controllers/robots_txt_controller.rb
Osama Sayegh 7bd3986b21
FEATURE: Replace Crawl-delay directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00

90 lines
2.6 KiB
Ruby

# frozen_string_literal: true
class RobotsTxtController < ApplicationController
layout false
skip_before_action :preload_json, :check_xhr, :redirect_to_login_if_required
OVERRIDDEN_HEADER = "# This robots.txt file has been customized at /admin/customize/robots\n"
# NOTE: order is important!
DISALLOWED_PATHS ||= %w{
/admin/
/auth/
/assets/browser-update*.js
/email/
/session
/user-api-key
/*?api_key*
/*?*api_key*
}
DISALLOWED_WITH_HEADER_PATHS ||= %w{
/badges
/u/
/my
/search
/tag/*/l
/g
/t/*/*.rss
/c/*.rss
}
def index
if (overridden = SiteSetting.overridden_robots_txt.dup).present?
overridden.prepend(OVERRIDDEN_HEADER) if guardian.is_admin? && !is_api?
render plain: overridden
return
end
if SiteSetting.allow_index_in_robots_txt?
@robots_info = self.class.fetch_default_robots_info
render :index, content_type: 'text/plain'
else
render :no_index, content_type: 'text/plain'
end
end
# If you are hosting Discourse in a subfolder, you will need to create your robots.txt
# in the root of your web server with the appropriate paths. This method will return
# JSON that can be used by a script to create a robots.txt that works well with your
# existing site.
def builder
result = self.class.fetch_default_robots_info
overridden = SiteSetting.overridden_robots_txt
result[:overridden] = overridden if overridden.present?
render json: result
end
def self.fetch_default_robots_info
deny_paths_googlebot = DISALLOWED_PATHS.map { |p| Discourse.base_path + p }
deny_paths = deny_paths_googlebot + DISALLOWED_WITH_HEADER_PATHS.map { |p| Discourse.base_path + p }
deny_all = [ "#{Discourse.base_path}/" ]
result = {
header: "# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file",
agents: []
}
if SiteSetting.allowed_crawler_user_agents.present?
SiteSetting.allowed_crawler_user_agents.split('|').each do |agent|
paths = agent == "Googlebot" ? deny_paths_googlebot : deny_paths
result[:agents] << { name: agent, disallow: paths }
end
result[:agents] << { name: '*', disallow: deny_all }
else
if SiteSetting.blocked_crawler_user_agents.present?
SiteSetting.blocked_crawler_user_agents.split('|').each do |agent|
result[:agents] << { name: agent, disallow: deny_all }
end
end
result[:agents] << { name: '*', disallow: deny_paths }
result[:agents] << { name: 'Googlebot', disallow: deny_paths_googlebot }
end
result
end
end