discourse/lib/retrieve_title.rb

# frozen_string_literal: true

module RetrieveTitle
  CRAWL_TIMEOUT = 1

  def self.crawl(url, max_redirects: nil, initial_https_redirect_ignore_limit: false)
    fetch_title(
      url,
      max_redirects: max_redirects,
      initial_https_redirect_ignore_limit: initial_https_redirect_ignore_limit,
    )
  rescue Net::ReadTimeout, FinalDestination::SSRFDetector::LookupFailedError
    # do nothing for Net::ReadTimeout errors
  end

  def self.extract_title(html, encoding = nil)
    title = nil
    return nil if html =~ /<title>/ && html !~ %r{</title>}

    doc = nil
    begin
      doc = Nokogiri.HTML5(html, nil, encoding)
    rescue ArgumentError
      # invalid HTML (Eg: too many attributes, status tree too deep) - ignore
      # Error in nokogumbo is not specialized, uses generic ArgumentError
      # see: https://www.rubydoc.info/gems/nokogiri/Nokogiri/HTML5#label-Error+reporting
    end

    if doc
      title = doc.at("title")&.inner_text

      # A horrible hack - YouTube uses `document.title` to populate the title
      # for some reason. For any other site than YouTube this wouldn't be worth it.
      if title == "YouTube" && html =~ /document\.title *= *"(.*)";/
        title = Regexp.last_match[1].sub(/ - YouTube\z/, "")
      end

      if !title && node = doc.at('meta[property="og:title"]')
        title = node["content"]
      end
    end

    if title.present?
      title.gsub!(/\n/, " ")
      title.gsub!(/ +/, " ")
      title.strip!
      return title
    end
    nil
  end

  private

  def self.max_chunk_size(uri)
    # Exception for sites that leave the title until very late.
    if uri.host =~
         /(^|\.)amazon\.(com|ca|co\.uk|es|fr|de|it|com\.au|com\.br|cn|in|co\.jp|com\.mx)\z/
      return 500
    end
    return 300 if uri.host =~ /(^|\.)youtube\.com\z/ || uri.host =~ /(^|\.)youtu\.be\z/
    return 50 if uri.host =~ /(^|\.)github\.com\z/

    # default is 20k
    20
  end

  # Fetch the beginning of a HTML document at a url
  def self.fetch_title(url, max_redirects: nil, initial_https_redirect_ignore_limit: false)
    fd =
      FinalDestination.new(
        url,
        timeout: CRAWL_TIMEOUT,
        stop_at_blocked_pages: true,
        max_redirects: max_redirects,
        initial_https_redirect_ignore_limit: initial_https_redirect_ignore_limit,
      )

    current = nil
    title = nil
    encoding = nil

    fd.get do |_response, chunk, uri|
      unless Net::HTTPRedirection === _response
        throw :done if uri.blank?

        if current
          current << chunk
        else
          current = chunk
        end

        if !encoding && content_type = _response["content-type"]&.strip&.downcase
          if content_type =~ /charset="?([a-z0-9_-]+)"?/
            encoding = Regexp.last_match(1)
            encoding = nil if !Encoding.list.map(&:name).map(&:downcase).include?(encoding)
          end
        end

        max_size = max_chunk_size(uri) * 1024
        title = extract_title(current, encoding)
        throw :done if title || max_size < current.length
      end
    end
    title
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-03 06:17:27 +08:00			`# frozen_string_literal: true`

FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`module RetrieveTitle`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-29 12:36:52 +08:00			`CRAWL_TIMEOUT = 1`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00
FEATURE: Site setting for blocking onebox of URLs that redirect (#16881) Meta topic: https://meta.discourse.org/t/prevent-to-linkify-when-there-is-a-redirect/226964/2?u=osama. This commit adds a new site setting `block_onebox_on_redirect` (default off) for blocking oneboxes (full and inline) of URLs that redirect. Note that an initial http → https redirect is still allowed if the redirect location is identical to the source (minus the scheme of course). For example, if a user includes a link to `http://example.com/page` and the link resolves to `https://example.com/page`, then the link will onebox (assuming it can be oneboxed) even if the setting is enabled. The reason for this is a user may type out a URL (i.e. the URL is short and memorizable) with http and since a lot of sites support TLS with http traffic automatically redirected to https, so we should still allow the URL to onebox. 2022-05-23 18:52:06 +08:00			`def self.crawl(url, max_redirects: nil, initial_https_redirect_ignore_limit: false)`
			`fetch_title(`
			`url,`
			`max_redirects: max_redirects,`
			`initial_https_redirect_ignore_limit: initial_https_redirect_ignore_limit,`
			`)`
FIX: Gracefully handle DNS issued from SSRF lookup when inline oneboxing (#19631) There is an issue where chat message processing breaks due to unhandles `SocketError` exceptions originating in the SSRF check, specifically in `FinalDestination::Resolver`. This change gives `FinalDestination::SSRFDetector` a new error class to wrap the `SocketError` in, and haves the `RetrieveTitle` class handle that error gracefully. 2022-12-28 10:30:20 +08:00			`rescue Net::ReadTimeout, FinalDestination::SSRFDetector::LookupFailedError`
DEV: Supress logs when RetrieveTitle.crawl fails with Net::ReadTimeout errors (#16971) This PR changes the rescue block to rescue only Net::TimeoutError exceptions and removes the log line to prevent clutter the logs with errors that are ignored. Other errors can bubble up because they're errors we probably want to know about 2022-06-10 03:30:22 +08:00			`# do nothing for Net::ReadTimeout errors`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`end`

FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-05 03:32:08 +08:00			`def self.extract_title(html, encoding = nil)`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`title = nil`
FIX: increase chunk size to fetch title tag correctly (#14144) 2021-09-03 15:45:58 +08:00			`return nil if html =~ /<title>/ && html !~ %r{</title>}`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00
FIX: ignore malformed HTML for title extraction (#18040) Certain HTML can be rejected by nokogumbo, specifically cases where there are enormous amounts of attributes This ensures that malformed HTML is simply skipped instead of leaking out an exception and terminating downstream processes. 2022-08-23 13:03:57 +08:00			`doc = nil`
			`begin`
			`doc = Nokogiri.HTML5(html, nil, encoding)`
			`rescue ArgumentError`
DEV: improve comment (#18041) improves comment about caught exception to match implementation 2022-08-23 13:14:24 +08:00			`# invalid HTML (Eg: too many attributes, status tree too deep) - ignore`
			`# Error in nokogumbo is not specialized, uses generic ArgumentError`
			`# see: https://www.rubydoc.info/gems/nokogiri/Nokogiri/HTML5#label-Error+reporting`
FIX: ignore malformed HTML for title extraction (#18040) Certain HTML can be rejected by nokogumbo, specifically cases where there are enormous amounts of attributes This ensures that malformed HTML is simply skipped instead of leaking out an exception and terminating downstream processes. 2022-08-23 13:03:57 +08:00			`end`

			`if doc`
FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-03 02:27:21 +08:00			`title = doc.at("title")&.inner_text`

FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 21:29:50 +08:00			# A horrible hack - YouTube uses `document.title` to populate the title
			`# for some reason. For any other site than YouTube this wouldn't be worth it.`
			`if title == "YouTube" && html =~ /document\.title = "(.*)";/`
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-21 02:52:49 +08:00			`title = Regexp.last_match[1].sub(/ - YouTube\z/, "")`
FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 21:29:50 +08:00			`end`

FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-03 02:27:21 +08:00			`if !title && node = doc.at('meta[property="og:title"]')`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`title = node["content"]`
			`end`
			`end`

			`if title.present?`
			`title.gsub!(/\n/, " ")`
			`title.gsub!(/ +/, " ")`
			`title.strip!`
			`return title`
			`end`
			`nil`
			`end`

			`private`

			`def self.max_chunk_size(uri)`
FIX: inline onebox for github (#15859) Increase size of downloaded HTML for Github when getting title for inline Onebox. 2022-02-10 05:53:27 +08:00			`# Exception for sites that leave the title until very late.`
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-21 02:52:49 +08:00			`if uri.host =~`
			`/(^\|\.)amazon\.(com\|ca\|co\.uk\|es\|fr\|de\|it\|com\.au\|com\.br\|cn\|in\|co\.jp\|com\.mx)\z/`
FIX: inline onebox for github (#15859) Increase size of downloaded HTML for Github when getting title for inline Onebox. 2022-02-10 05:53:27 +08:00			`return 500`
DEV: Apply syntax_tree formatting to `lib/*` 2023-01-09 20:10:19 +08:00			`end`
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-21 02:52:49 +08:00			`return 300 if uri.host =~ /(^\|\.)youtube\.com\z/ \|\| uri.host =~ /(^\|\.)youtu\.be\z/`
			`return 50 if uri.host =~ /(^\|\.)github\.com\z/`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00
FIX: increase chunk size to fetch title tag correctly (#14144) 2021-09-03 15:45:58 +08:00			`# default is 20k`
			`20`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`end`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-29 12:36:52 +08:00
			`# Fetch the beginning of a HTML document at a url`
FEATURE: Site setting for blocking onebox of URLs that redirect (#16881) Meta topic: https://meta.discourse.org/t/prevent-to-linkify-when-there-is-a-redirect/226964/2?u=osama. This commit adds a new site setting `block_onebox_on_redirect` (default off) for blocking oneboxes (full and inline) of URLs that redirect. Note that an initial http → https redirect is still allowed if the redirect location is identical to the source (minus the scheme of course). For example, if a user includes a link to `http://example.com/page` and the link resolves to `https://example.com/page`, then the link will onebox (assuming it can be oneboxed) even if the setting is enabled. The reason for this is a user may type out a URL (i.e. the URL is short and memorizable) with http and since a lot of sites support TLS with http traffic automatically redirected to https, so we should still allow the URL to onebox. 2022-05-23 18:52:06 +08:00			`def self.fetch_title(url, max_redirects: nil, initial_https_redirect_ignore_limit: false)`
			`fd =`
			`FinalDestination.new(`
			`url,`
			`timeout: CRAWL_TIMEOUT,`
			`stop_at_blocked_pages: true,`
			`max_redirects: max_redirects,`
			`initial_https_redirect_ignore_limit: initial_https_redirect_ignore_limit,`
			`)`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-29 12:36:52 +08:00
			`current = nil`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`title = nil`
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-05 03:32:08 +08:00			`encoding = nil`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-29 12:36:52 +08:00
			`fd.get do \|_response, chunk, uri\|`
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 22:23:39 +08:00			`unless Net::HTTPRedirection === _response`
FIX: Do not raise if title cannot be crawled (#16247) If the crawled page returned an error, `FinalDestination#safe_get` yielded `nil` for `uri` and `chunk` arguments. Another problem is that `get` did not handle the case when `safe_get` failed and did not return the `location` and `set_cookie` headers. 2022-03-23 02:13:27 +08:00			`throw :done if uri.blank?`

FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 22:23:39 +08:00			`if current`
			`current << chunk`
			`else`
			`current = chunk`
			`end`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-29 12:36:52 +08:00
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 22:23:39 +08:00			`if !encoding && content_type = _response["content-type"]&.strip&.downcase`
			`if content_type =~ /charset="?([a-z0-9_-]+)"?/`
			`encoding = Regexp.last_match(1)`
			`encoding = nil if !Encoding.list.map(&:name).map(&:downcase).include?(encoding)`
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-05 03:32:08 +08:00			`end`
			`end`
Make rubocop happy again. 2018-06-07 13:28:18 +08:00
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 22:23:39 +08:00			`max_size = max_chunk_size(uri) * 1024`
			`title = extract_title(current, encoding)`
			`throw :done if title \|\| max_size < current.length`
			`end`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`end`
DEV: Apply Rubocop redundant return style 2019-11-15 04:10:51 +08:00			`title`
Make rubocop happy again. 2018-06-07 13:28:18 +08:00			`end`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`end`