Commit Graph

26 Commits

Author SHA1 Message Date
Sam Saffron
bb4e8899c4
FEATURE: let Google index pages so it can remove them
Google insists on indexing pages so it can figure out if they
can be removed from the index.

see: https://support.google.com/webmasters/answer/6332384?hl=en

This change ensures the we have special behavior for Googlebot
where we allow indexing, but block the actual indexing via
X-Robots-Tag
2020-05-11 12:15:18 +10:00
Sam Saffron
5feb342914 Revert "FEATURE: add Noindex to robots.txt for disallowed routes"
This reverts commit d84256a876.

This is not supported by Google and causes robots.txt to be flagged as
invalid

Removing Noindex
2019-07-30 11:33:38 +10:00
Sam
d84256a876 FEATURE: add Noindex to robots.txt for disallowed routes
This strips pages out of indexes that should not exist see:

https://meta.discourse.org/t/pages-listed-in-the-robots-txt-are-crawled-and-indexed-by-google/100309/11?u=sam
2018-11-02 16:39:47 +11:00
Robin Ward
3d7dbdedc0 FEATURE: An API to help sites build robots.txt files programatically
This is mainly useful for subfolder sites, who need to expose their
robots.txt contents to a parent site.
2018-04-16 15:43:20 -04:00
Sam
223379e21a per spec we need to repeat disallow paths per agent 2018-04-16 15:38:10 +10:00
Régis Hanol
1a9271dd2f add a warning in robots.txt when using subfolder 2018-04-12 00:00:15 +02:00
Régis Hanol
df7970a6f6 prefix the robots.txt rules with the directory when using subfolder 2018-04-11 22:05:02 +02:00
Sam
489c22d93c FEATURE: Disallow tags and categories rss feeds
This stops crawlers from hitting tags and category rss feeds to discover
new content, instead they should focus on latest/posts if they need to
consume something regular
2018-04-11 14:36:10 +10:00
Sam
f40f10240c FEATURE: remove topic rss from robots
Crawlers love hitting the rss feeds (confirmed that both Google and Bing do)

Experimenting with the impact of blocking these feeds and forcing Crawlers to hit
the content direct. It is better if they hit the actual page to start with as opposed to

1. Hit RSS feed
2. Find new content
3. Hit post link
4. Get canonical
5. Hit canonical

Lots of pointless work.

We do not know for sure what impact this will have on newsreader apps,
we will listen for feedback.
2018-04-11 11:57:52 +10:00
Sam
3a7b696703 FEATURE: allow for setting crawl delay per user agent
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed

New site settings:

"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests

Not enforced server side yet
2018-04-06 10:15:23 +10:00
Arpit Jalan
5e4dd20795 Revert "Prevent robots from indexing uploads"
This reverts commit 0fd622e5d1.
2018-04-02 21:29:29 +05:30
Neil Lalonde
ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Dan Nicholson
0fd622e5d1 Prevent robots from indexing uploads
Although most user uploads are probably harmless, it's possible someone
has (either maliciously or not) uploaded sensitive information. Prevent
robots from indexing the uploads route.
2018-03-09 05:51:55 -06:00
Sam
e19ae6c55e FEATURE: disallow groups from being indexed 2018-03-02 13:38:30 +11:00
Robin Ward
0776340b29 SECURITY: Prevent robots from indexing more routes
These routes could contain sensitive material and should never be
indexed for content.
2018-02-04 13:24:36 -05:00
Guo Xiang Tan
77d4c4d8dc Fix all the errors to get our tests green on Rails 5.1. 2017-09-25 13:48:58 +08:00
Robin Ward
14410b71fb Convert server side paths to use /u/ 2017-03-30 10:23:24 -04:00
Vinoth Kannan
08c14dd689 new: server plugin outlet for indexable robots.txt 2017-02-13 17:31:10 +05:30
Neil Lalonde
ae671355da FIX: add /tags routes to robots.txt 2017-02-03 11:57:00 -05:00
Sam
54645261aa better disallow search ... this could get ugly 2015-04-02 17:08:00 +11:00
Robin Ward
e66c53a4a7 Add /badges to robots.txt for now, we don't have a crawlable view so
it's better to exclude it.
2014-10-30 14:32:42 -04:00
Neil Lalonde
8267a451b2 Disallow /users/ in robots.txt 2014-05-23 10:28:26 -04:00
Neil Lalonde
9c4dc9a966 Block browser-update.js in robots.txt. Move noscript block above everything else in application layout. 2014-02-14 15:33:00 -05:00
Sam
7ad00f426c FEATURE REMOVAL: persona login
see: https://meta.discourse.org/t/pulling-persona-out-of-discourse-core/12613
2014-02-11 16:56:48 +11:00
Neil Lalonde
88d9f3a786 Disallow auth callbacks in robots.txt 2014-01-14 10:42:22 -05:00
Sam Saffron
c50a9e4d01 added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-11 11:02:57 +11:00