discourse/app/jobs/scheduled/clean_up_uploads.rb

# frozen_string_literal: true

module Jobs
  class CleanUpUploads < ::Jobs::Scheduled
    every 1.hour

    def execute(args)
      grace_period = [SiteSetting.clean_orphan_uploads_grace_period_hours, 1].max

      # always remove invalid upload records
      Upload
        .by_users
        .where("retain_hours IS NULL OR created_at < current_timestamp - interval '1 hour' * retain_hours")
        .where("created_at < ?", grace_period.hour.ago)
        .where(url: "")
        .find_each(&:destroy!)

      return unless SiteSetting.clean_up_uploads?

      if c = last_cleanup
        return if (Time.zone.now.to_i - c) < (grace_period / 2).hours
      end

      base_url = Discourse.store.internal? ? Discourse.store.relative_base_url : Discourse.store.absolute_base_url
      s3_hostname = URI.parse(base_url).hostname
      s3_cdn_hostname = URI.parse(SiteSetting.Upload.s3_cdn_url || "").hostname

      result = Upload.by_users
        .where("uploads.retain_hours IS NULL OR uploads.created_at < current_timestamp - interval '1 hour' * uploads.retain_hours")
        .where("uploads.created_at < ?", grace_period.hour.ago)
        .where("uploads.access_control_post_id IS NULL")
        .joins("LEFT JOIN post_uploads pu ON pu.upload_id = uploads.id")
        .where("pu.upload_id IS NULL")
        .with_no_non_post_relations

      result.find_each do |upload|
        if upload.sha1.present?
          encoded_sha = Base62.encode(upload.sha1.hex)
          next if ReviewableQueuedPost.pending.where("payload->>'raw' LIKE '%#{upload.sha1}%' OR payload->>'raw' LIKE '%#{encoded_sha}%'").exists?
          next if Draft.where("data LIKE '%#{upload.sha1}%' OR data LIKE '%#{encoded_sha}%'").exists?
          next if UserProfile.where("bio_raw LIKE '%#{upload.sha1}%' OR bio_raw LIKE '%#{encoded_sha}%'").exists?
          if defined?(ChatMessage)
            # TODO after May 2022 - remove this. No longer needed as chat uploads are in a table
            next if ChatMessage.where("message LIKE ? OR message LIKE ?", "%#{upload.sha1}%", "%#{encoded_sha}%").exists?
          end

          if defined?(ChatUpload)
            next if ChatUpload.where(upload: upload).exists?
          end
          upload.destroy
        else
          upload.delete
        end
      end

      ExternalUploadStub.cleanup!

      self.last_cleanup = Time.zone.now.to_i
    end

    def last_cleanup=(v)
      Discourse.redis.setex(last_cleanup_key, 7.days.to_i, v.to_s)
    end

    def last_cleanup
      v = Discourse.redis.get(last_cleanup_key)
      v ? v.to_i : v
    end

    def reset_last_cleanup!
      Discourse.redis.del(last_cleanup_key)
    end

    protected

    def last_cleanup_key
      "LAST_UPLOAD_CLEANUP"
    end

  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-03 06:17:27 +08:00			`# frozen_string_literal: true`

added a job to clean up orphan uploads 2013-10-14 20:27:41 +08:00			`module Jobs`
DEV: Upgrading Discourse to Zeitwerk (#8098) Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. We no longer need to use Rails "require_dependency" anywhere and instead can just use standard Ruby patterns to require files. This is a far reaching change and we expect some followups here. 2019-10-02 12:01:53 +08:00			`class CleanUpUploads < ::Jobs::Scheduled`
FEATURE: new scheduler Removed sidetiq, introduced new scheduler - add basic UI - add schedule discover - add scheduling in initializer 2014-02-06 07:14:41 +08:00			`every 1.hour`
added a job to clean up orphan uploads 2013-10-14 20:27:41 +08:00
			`def execute(args)`
FIX: always delete invalid upload records 2018-06-05 00:40:57 +08:00			`grace_period = [SiteSetting.clean_orphan_uploads_grace_period_hours, 1].max`
make rubocop happy 2018-06-05 01:06:52 +08:00
FIX: always delete invalid upload records 2018-06-05 00:40:57 +08:00			`# always remove invalid upload records`
			`Upload`
FIX: Properly support defaults for upload site settings. 2019-01-02 15:29:17 +08:00			`.by_users`
FIX: always delete invalid upload records 2018-06-05 00:40:57 +08:00			`.where("retain_hours IS NULL OR created_at < current_timestamp - interval '1 hour' * retain_hours")`
			`.where("created_at < ?", grace_period.hour.ago)`
Upload.url can't be NULL 2018-06-05 00:43:00 +08:00			`.where(url: "")`
FIX: Avoid `destroy_all` in `Jobs::CleanUpUploads`. `destroy_all` loads all the relation into memory as once. See https://github.com/rails/rails/issues/22510 2018-07-02 12:41:53 +08:00			`.find_each(&:destroy!)`
make rubocop happy 2018-06-05 01:06:52 +08:00
add a sitesetting to enable the CleanUpUploads job 2013-10-16 16:55:42 +08:00			`return unless SiteSetting.clean_up_uploads?`
added a job to clean up orphan uploads 2013-10-14 20:27:41 +08:00
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00			`if c = last_cleanup`
			`return if (Time.zone.now.to_i - c) < (grace_period / 2).hours`
			`end`

FIX: the 'clean_up_uploads' jobs would delete images used in site settings when they were entered using absolute URLs, with the CDN or simple a different format than the one used in the database 2017-06-08 04:53:15 +08:00			`base_url = Discourse.store.internal? ? Discourse.store.relative_base_url : Discourse.store.absolute_base_url`
			`s3_hostname = URI.parse(base_url).hostname`
FEATURE: allow specifying s3 config via globals This refactors handling of s3 so it can be specified via GlobalSetting This means that in a multisite environment you can configure s3 uploads without actual sites knowing credentials in s3 It is a critical setting for situations where assets are mirrored to s3. 2017-10-06 13:20:01 +08:00			`s3_cdn_hostname = URI.parse(SiteSetting.Upload.s3_cdn_url \|\| "").hostname`
FIX: the 'clean_up_uploads' jobs would delete images used in site settings when they were entered using absolute URLs, with the CDN or simple a different format than the one used in the database 2017-06-08 04:53:15 +08:00
FIX: Properly support defaults for upload site settings. 2019-01-02 15:29:17 +08:00			`result = Upload.by_users`
			`.where("uploads.retain_hours IS NULL OR uploads.created_at < current_timestamp - interval '1 hour' * uploads.retain_hours")`
PERF: `NOT IN` query is really inefficient for large tables. 2016-11-02 11:14:02 +08:00			`.where("uploads.created_at < ?", grace_period.hour.ago)`
FEATURE: Secure media allowing duplicated uploads with category-level privacy and post-based access rules (#8664) ### General Changes and Duplication * We now consider a post `with_secure_media?` if it is in a read-restricted category. * When uploading we now set an upload's secure status straight away. * When uploading if `SiteSetting.secure_media` is enabled, we do not check to see if the upload already exists using the `sha1` digest of the upload. The `sha1` column of the upload is filled with a `SecureRandom.hex(20)` value which is the same length as `Upload::SHA1_LENGTH`. The `original_sha1` column is filled with the _real_ sha1 digest of the file. * Whether an upload `should_be_secure?` is now determined by whether the `access_control_post` is `with_secure_media?` (if there is no access control post then we leave the secure status as is). * When serializing the upload, we now cook the URL if the upload is secure. This is so it shows up correctly in the composer preview, because we set secure status on upload. ### Viewing Secure Media * The secure-media-upload URL will take the post that the upload is attached to into account via `Guardian.can_see?` for access permissions * If there is no `access_control_post` then we just deliver the media. This should be a rare occurrance and shouldn't cause issues as the `access_control_post` is set when `link_post_uploads` is called via `CookedPostProcessor` ### Removed We no longer do any of these because we do not reuse uploads by sha1 if secure media is enabled. * We no longer have a way to prevent cross-posting of a secure upload from a private context to a public context. * We no longer have to set `secure: false` for uploads when uploading for a theme component. 2020-01-16 11:50:27 +08:00			`.where("uploads.access_control_post_id IS NULL")`
PERF: `NOT IN` query is really inefficient for large tables. 2016-11-02 11:14:02 +08:00			`.joins("LEFT JOIN post_uploads pu ON pu.upload_id = uploads.id")`
			`.where("pu.upload_id IS NULL")`
DEV: Improve `script/downsize_uploads.rb` (#13508) * Only shrink images that are used in Posts and no other models * Don't save the upload if the size is the same 2021-06-24 06:09:40 +08:00			`.with_no_non_post_relations`
added a job to clean up orphan uploads 2013-10-14 20:27:41 +08:00
FIX: don't destroy uploads in queued posts and drafts 2016-08-02 00:35:57 +08:00			`result.find_each do \|upload\|`
FIX: always clean up uploads with no sha1 2017-11-14 17:56:10 +08:00			`if upload.sha1.present?`
			`encoded_sha = Base62.encode(upload.sha1.hex)`
FIX: Only consider pending queued posts for cleaning up uploads 2019-04-13 02:39:32 +08:00			`next if ReviewableQueuedPost.pending.where("payload->>'raw' LIKE '%#{upload.sha1}%' OR payload->>'raw' LIKE '%#{encoded_sha}%'").exists?`
FIX: always clean up uploads with no sha1 2017-11-14 17:56:10 +08:00			`next if Draft.where("data LIKE '%#{upload.sha1}%' OR data LIKE '%#{encoded_sha}%'").exists?`
FEATURE: Pull hotlinked images in user bios (#14726) 2021-10-29 22:58:05 +08:00			`next if UserProfile.where("bio_raw LIKE '%#{upload.sha1}%' OR bio_raw LIKE '%#{encoded_sha}%'").exists?`
DEV: Changes to support chat uploads (#15153) 2021-12-02 03:24:16 +08:00			`if defined?(ChatMessage)`
			`# TODO after May 2022 - remove this. No longer needed as chat uploads are in a table`
			`next if ChatMessage.where("message LIKE ? OR message LIKE ?", "%#{upload.sha1}%", "%#{encoded_sha}%").exists?`
			`end`

			`if defined?(ChatUpload)`
			`next if ChatUpload.where(upload: upload).exists?`
DEV: Do not clean up chat message uploads (#14267) 2021-09-08 02:33:48 +08:00			`end`
FIX: delete upload records when sha is null 2017-11-21 17:20:42 +08:00			`upload.destroy`
			`else`
			`upload.delete`
FIX: always clean up uploads with no sha1 2017-11-14 17:56:10 +08:00			`end`
PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 15:22:30 +08:00			`end`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00
FEATURE: Initial implementation of direct S3 uploads with uppy and stubs (#13787) This adds a few different things to allow for direct S3 uploads using uppy. These changes are still not the default. There are hidden `enable_experimental_image_uploader` and `enable_direct_s3_uploads` settings that must be turned on for any of this code to be used, and even if they are turned on only the User Card Background for the user profile actually uses uppy-image-uploader. A new `ExternalUploadStub` model and database table is introduced in this pull request. This is used to keep track of uploads that are uploaded to a temporary location in S3 with the direct to S3 code, and they are eventually deleted a) when the direct upload is completed and b) after a certain time period of not being used. ### Starting a direct S3 upload When an S3 direct upload is initiated with uppy, we first request a presigned PUT URL from the new `generate-presigned-put` endpoint in `UploadsController`. This generates an S3 key in the `temp` folder inside the correct bucket path, along with any metadata from the clientside (e.g. the SHA1 checksum described below). This will also create an `ExternalUploadStub` and store the details of the temp object key and the file being uploaded. Once the clientside has this URL, uppy will upload the file direct to S3 using the presigned URL. Once the upload is complete we go to the next stage. ### Completing a direct S3 upload Once the upload to S3 is done we call the new `complete-external-upload` route with the unique identifier of the `ExternalUploadStub` created earlier. Only the user who made the stub can complete the external upload. One of two paths is followed via the `ExternalUploadManager`. 1. If the object in S3 is too large (currently 100mb defined by `ExternalUploadManager::DOWNLOAD_LIMIT`) we do not download and generate the SHA1 for that file. Instead we create the `Upload` record via `UploadCreator` and simply copy it to its final destination on S3 then delete the initial temp file. Several modifications to `UploadCreator` have been made to accommodate this. 2. If the object in S3 is small enough, we download it. When the temporary S3 file is downloaded, we compare the SHA1 checksum generated by the browser with the actual SHA1 checksum of the file generated by ruby. The browser SHA1 checksum is stored on the object in S3 with metadata, and is generated via the `UppyChecksum` plugin. Keep in mind that some browsers will not generate this due to compatibility or other issues. We then follow the normal `UploadCreator` path with one exception. To cut down on having to re-upload the file again, if there are no changes (such as resizing etc) to the file in `UploadCreator` we follow the same copy + delete temp path that we do for files that are too large. 3. Finally we return the serialized upload record back to the client There are several errors that could happen that are handled by `UploadsController` as well. Also in this PR is some refactoring of `displayErrorForUpload` to handle both uppy and jquery file uploader errors. 2021-07-28 06:42:25 +08:00			`ExternalUploadStub.cleanup!`

PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00			`self.last_cleanup = Time.zone.now.to_i`
			`end`

			`def last_cleanup=(v)`
DEV: s/\$redis/Discourse\.redis (#8431) This commit also adds a rubocop rule to prevent global variables. 2019-12-03 17:05:53 +08:00			`Discourse.redis.setex(last_cleanup_key, 7.days.to_i, v.to_s)`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00			`end`

			`def last_cleanup`
DEV: s/\$redis/Discourse\.redis (#8431) This commit also adds a rubocop rule to prevent global variables. 2019-12-03 17:05:53 +08:00			`v = Discourse.redis.get(last_cleanup_key)`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00			`v ? v.to_i : v`
PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 15:22:30 +08:00			`end`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00
			`def reset_last_cleanup!`
DEV: s/\$redis/Discourse\.redis (#8431) This commit also adds a rubocop rule to prevent global variables. 2019-12-03 17:05:53 +08:00			`Discourse.redis.del(last_cleanup_key)`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-28 08:14:52 +08:00			`end`

			`protected`

			`def last_cleanup_key`
			`"LAST_UPLOAD_CLEANUP"`
			`end`

PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 15:22:30 +08:00			`end`
added a job to clean up orphan uploads 2013-10-14 20:27:41 +08:00			`end`