discourse/app/jobs/regular/crawl_topic_link.rb

# frozen_string_literal: true

require 'open-uri'
require 'nokogiri'
require 'excon'

module Jobs
  class CrawlTopicLink < ::Jobs::Base

    sidekiq_options queue: 'low'

    def execute(args)
      raise Discourse::InvalidParameters.new(:topic_link_id) unless args[:topic_link_id].present?

      topic_link = TopicLink.find_by(id: args[:topic_link_id], internal: false, crawled_at: nil)
      return if topic_link.blank?

      # Look for a topic embed for the URL. If it exists, use its title and don't crawl
      topic_embed = TopicEmbed.where(embed_url: topic_link.url).includes(:topic).references(:topic).first
      # topic could be deleted, so skip
      if topic_embed && topic_embed.topic
        TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', topic_embed.topic.title[0..255]])
        return
      end

      begin
        crawled = false

        # Special case: Images
        # If the link is to an image, put the filename as the title
        if FileHelper.is_supported_image?(topic_link.url)
          uri = URI(topic_link.url)
          filename = File.basename(uri.path)
          crawled = (TopicLink.where(id: topic_link.id).update_all(["title = ?, crawled_at = CURRENT_TIMESTAMP", filename]) == 1)
        end

        unless crawled
          # Fetch the beginning of the document to find the title
          title = RetrieveTitle.crawl(topic_link.url)
          if title.present?
            crawled = (TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', title[0..254]]) == 1)
          end
        end
      rescue Exception
        # If there was a connection error, do nothing
      ensure
        TopicLink.where(id: topic_link.id).update_all('crawled_at = CURRENT_TIMESTAMP') if !crawled && topic_link.present?
      end
    end

  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-03 06:17:27 +08:00			`# frozen_string_literal: true`

Support for crawling topic links 2014-04-06 02:47:25 +08:00			`require 'open-uri'`
			`require 'nokogiri'`
			`require 'excon'`

			`module Jobs`
DEV: Upgrading Discourse to Zeitwerk (#8098) Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. We no longer need to use Rails "require_dependency" anywhere and instead can just use standard Ruby patterns to require files. This is a far reaching change and we expect some followups here. 2019-10-02 12:01:53 +08:00			`class CrawlTopicLink < ::Jobs::Base`
Support for crawling topic links 2014-04-06 02:47:25 +08:00
PERF: move crawl_topic_links to the low queue Crawling topic links can be somewhat delayed no need to run it in the default queue. 2019-05-22 08:18:49 +08:00			`sidekiq_options queue: 'low'`

Support for crawling topic links 2014-04-06 02:47:25 +08:00			`def execute(args)`
			`raise Discourse::InvalidParameters.new(:topic_link_id) unless args[:topic_link_id].present?`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-08 02:38:18 +08:00
Perform the where(...).first to find_by(...) refactoring. This refactoring was automated using the command: bundle exec "ruby refactorings/where_dot_first_to_find_by/app.rb" 2014-05-06 21:41:59 +08:00			`topic_link = TopicLink.find_by(id: args[:topic_link_id], internal: false, crawled_at: nil)`
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-18 02:00:22 +08:00			`return if topic_link.blank?`

			`# Look for a topic embed for the URL. If it exists, use its title and don't crawl`
			`topic_embed = TopicEmbed.where(embed_url: topic_link.url).includes(:topic).references(:topic).first`
Don't try loading embeds on deleted topics 2015-05-06 14:53:28 +08:00			`# topic could be deleted, so skip`
			`if topic_embed && topic_embed.topic`
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-18 02:00:22 +08:00			`TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', topic_embed.topic.title[0..255]])`
			`return`
			`end`
FIX: Change crawl size to 10k. Youtube for example doesn't work with the first 1k 2014-04-08 04:03:47 +08:00
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-18 02:00:22 +08:00			`begin`
FIX: Change crawl size to 10k. Youtube for example doesn't work with the first 1k 2014-04-08 04:03:47 +08:00			`crawled = false`

Special case: When crawling a link to an image, just put the filename as the title. 2014-04-11 01:45:13 +08:00			`# Special case: Images`
			`# If the link is to an image, put the filename as the title`
Rename `FileHelper.is_image?` -> `FileHelper.is_supported_image?`. 2018-09-10 10:22:45 +08:00			`if FileHelper.is_supported_image?(topic_link.url)`
Special case: When crawling a link to an image, just put the filename as the title. 2014-04-11 01:45:13 +08:00			`uri = URI(topic_link.url)`
			`filename = File.basename(uri.path)`
			`crawled = (TopicLink.where(id: topic_link.id).update_all(["title = ?, crawled_at = CURRENT_TIMESTAMP", filename]) == 1)`
			`end`

			`unless crawled`
			`# Fetch the beginning of the document to find the title`
FEATURE: Whitelists for inline oneboxing 2017-07-22 03:29:04 +08:00			`title = RetrieveTitle.crawl(topic_link.url)`
			`if title.present?`
			`crawled = (TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', title[0..254]]) == 1)`
Support for crawling topic links 2014-04-06 02:47:25 +08:00			`end`
			`end`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-08 02:38:18 +08:00			`rescue Exception`
			`# If there was a connection error, do nothing`
			`ensure`
Use `update_all` to prevent `after_commit` from executing again. 2014-04-11 01:19:38 +08:00			`TopicLink.where(id: topic_link.id).update_all('crawled_at = CURRENT_TIMESTAMP') if !crawled && topic_link.present?`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-08 02:38:18 +08:00			`end`
Support for crawling topic links 2014-04-06 02:47:25 +08:00			`end`

			`end`
			`end`