When crawlers visit a post-specific URL like `/t/-/{topic-id}/{post-number}`, we use the canonical to direct them to the appropriate crawler-optimised paginated view (e.g. `?page=3`).
However, analysis of google results shows that the post-specific URLs are still being included in the index. Google doesn't tell us exactly why this is happening. However, as a general rule, 'A large portion of the duplicate page's content should be present on the canonical version'.
In our previous implementation, this wasn't 100% true all the time. That's because a request for a post-specific URL would include posts 'surrounding' that post, and won't exactly conform to the page boundaries which are used in the canonical version of the page. Essentially: in some cases, the content of the post-specific pages would include many posts which were not present on the canonical paginated version.
This commit aims to resolve that problem by simplifying the implementation. Instead of rendering posts surrounding the target post_number, we will only render the target post, and include a link to 'show post in topic'. With this new implementation, 100% of the post-specific page content will be present on the canonical paginated version, which will hopefully mean google reduces their indexing of the non-canonical post-specific pages.
To improve performance, we omit the basic-HTML version of pages when users are logged in, or when they are using a modern mobile device. This can be confusing when analysing the SEO of sites, so this commit adds a short static message when content is omitted.
Some attributes of the microdata schema `DiscussionForumPosting` are rendered in the context of the first post.
Ensure these attributes are also set if the first post is not part of the current view.
* Ensure consistent `datePublished` and remove `text` on second page in topic microdata schema
Always use `datePublished` from topic and never from `first_psot`. This ensures `datePublished` to be consistent on `first page` and `page=2+`.
No need to repeat `text` on `page=2+`. Especially do not set `text` on `page=2+` if it is only an abstract and thereby not 100% consistent with `text` on `first page`.
* Keep `text`attribute on follow-up pages
Previously, we used the schema type "DiscussionForumPosting" for all the posts including replies. This is not recommended as per Google search experts. This commit changes the schema type to "Comment" for replies.
This simplifies the crawler-linkback-list to only be a point of reference to the actual DiscussionForumPosting objects.
See "Summary page": https://developers.google.com/search/docs/advanced/structured-data/carousel?hl=en#summary-page
> [It] defines an ItemList, where each ListItem has only three properties: @type (set to ListItem), position (the position in the list), and url (the URL of a page with full details about that item).
This replaces the position declared as `#123` with the more simple version `123`.
The property position may be of type Integer or Text. A value of type Integer, or more precise of type Text which simply casts to integer, is sufficient here.
See: https://schema.org/position
In category-view the topic-list already uses this notation for the position of topics:
`<meta itemprop="position" content="123">`
Currently when generating a onebox for Discourse topics, some important
context is missing such as categories and tags.
This patch addresses this issue by introducing a new onebox engine
dedicated to display this information when available. Indeed to get this
new information, categories and tags are exposed in the topic metadata
as opengraph tags.
* FEATURE: use canonical links in posts.rss feed
Previously we used non canonical links in posts.rss
These links get crawled frequently by crawlers when discovering new
content forcing crawlers to hop to non canonical pages just to end up
visiting canonical pages
This uses up expensive crawl time and adds load on Discourse sites
Old links were of the form:
`https://DOMAIN/t/SLUG/43/21`
New links are of the form
`https://DOMAIN/t/SLUG/43?page=2#post_21`
This also adds a post_id identified element to crawler view that was
missing.
Note, to avoid very expensive N+1 queries required to figure out the
page a post is on during rss generation, we cache that information.
There is a smart "cache breaker" which ensures worst case scenario is
a "page drift" - meaning we would publicize a post is on page 11 when
it is actually on page 10 due to post deletions. Cache holds for up to
12 hours.
Change only impacts public post RSS feeds (`/posts.rss`)
The modification date should always be a meta tag to make this less confusing. Especially for imported posts.
That's more in line with how the rest of Discourse presents post dates.
* Do not show "Uncategorized" category in topics list.
* Use "BreadcrumbList" only if topic is in a category.
* Add tags list as keywords to the first post.
* Add "dateModified" even if it is the same with "datePublished".
* Show "crawler-linkback-list" only if there are links to be shown.
The data-vocabulary.org schema is being deprecated.
We're now using the BreadcrumList data from the latest and greatest schema.org.
FIX: categories_breadcrumb helper to support more than 2 levels of categories.
* Cleaning up crawler styles, improving some schema.org markup
* Cleaning up crawler styles, improving some schema.org markup
* additional styling
* add space for pagination
* show likes value in crawler view if count is > 0
* remove <hr> since horizontal line is already provided by css - this removes one of 2 horizontal lines in post crawler view