The generate_presigned_put endpoint for direct external uploads
(such as the one for the uppy-image-uploader) records allowed
S3 metadata values on the uploaded object. We use this to store
the sha1-checksum generated by the UppyChecksum plugin, for later
comparison in ExternalUploadManager.
However, we were not doing this for the create_multipart endpoint,
so the checksum was never captured and compared correctly.
Also includes a fix to make sure UppyChecksum is the last preprocessor to run.
It is important that the UppyChecksum preprocessor is the last one to
be added; the preprocessors are run in order and since other preprocessors
may modify the file (e.g. the UppyMediaOptimization one), we need to
checksum once we are sure the file data has "settled".
Previously we had temp/ in the middle of the S3 key path like so
* /uploads/default/temp/randomstring/test.png (normal site)
* /sitename/uploads/default/temp/randomstring/test.png (s3 folder path site)
* /standard10/uploads/sitename/temp/randomstring/test.png (multisite site)
However this necessitates making a lifecycle rule to clean up incomplete
S3 multipart uploads for every site, something which we cannot do. It makes
much more sense to have a structure with /temp at the start of the key,
which is what this commit does:
* /temp/uploads/default/randomstring/test.png (normal site)
* /temp/sitename/uploads/default/randomstring/test.png (s3 folder path site)
* /temp/standard10/uploads/sitename/randomstring/test.png (multisite site)
This pull request introduces the endpoints required, and the JavaScript functionality in the `ComposerUppyUpload` mixin, for direct S3 multipart uploads. There are four new endpoints in the uploads controller:
* `create-multipart.json` - Creates the multipart upload in S3 along with an `ExternalUploadStub` record, storing information about the file in the same way as `generate-presigned-put.json` does for regular direct S3 uploads
* `batch-presign-multipart-parts.json` - Takes a list of part numbers and the unique identifier for an `ExternalUploadStub` record, and generates the presigned URLs for those parts if the multipart upload still exists and if the user has permission to access that upload
* `complete-multipart.json` - Completes the multipart upload in S3. Needs the full list of part numbers and their associated ETags which are returned when the part is uploaded to the presigned URL above. Only works if the user has permission to access the associated `ExternalUploadStub` record and the multipart upload still exists.
After we confirm the upload is complete in S3, we go through the regular `UploadCreator` flow, the same as `complete-external-upload.json`, and promote the temporary upload S3 into a full `Upload` record, moving it to its final destination.
* `abort-multipart.json` - Aborts the multipart upload on S3 and destroys the `ExternalUploadStub` record if the user has permission to access that upload.
Also added are a few new columns to `ExternalUploadStub`:
* multipart - Whether or not this is a multipart upload
* external_upload_identifier - The "upload ID" for an S3 multipart upload
* filesize - The size of the file when the `create-multipart.json` or `generate-presigned-put.json` is called. This is used for validation.
When the user completes a direct S3 upload, either regular or multipart, we take the `filesize` that was captured when the `ExternalUploadStub` was first created and compare it with the final `Content-Length` size of the file where it is stored in S3. Then, if the two do not match, we throw an error, delete the file on S3, and ban the user from uploading files for N (default 5) minutes. This would only happen if the user uploads a different file than what they first specified, or in the case of multipart uploads uploaded larger chunks than needed. This is done to prevent abuse of S3 storage by bad actors.
Also included in this PR is an update to vendor/uppy.js. This has been built locally from the latest uppy source at d613b849a6. This must be done so that I can get my multipart upload changes into Discourse. When the Uppy team cuts a proper release, we can bump the package.json versions instead.
This adds a few different things to allow for direct S3 uploads using uppy. **These changes are still not the default.** There are hidden `enable_experimental_image_uploader` and `enable_direct_s3_uploads` settings that must be turned on for any of this code to be used, and even if they are turned on only the User Card Background for the user profile actually uses uppy-image-uploader.
A new `ExternalUploadStub` model and database table is introduced in this pull request. This is used to keep track of uploads that are uploaded to a temporary location in S3 with the direct to S3 code, and they are eventually deleted a) when the direct upload is completed and b) after a certain time period of not being used.
### Starting a direct S3 upload
When an S3 direct upload is initiated with uppy, we first request a presigned PUT URL from the new `generate-presigned-put` endpoint in `UploadsController`. This generates an S3 key in the `temp` folder inside the correct bucket path, along with any metadata from the clientside (e.g. the SHA1 checksum described below). This will also create an `ExternalUploadStub` and store the details of the temp object key and the file being uploaded.
Once the clientside has this URL, uppy will upload the file direct to S3 using the presigned URL. Once the upload is complete we go to the next stage.
### Completing a direct S3 upload
Once the upload to S3 is done we call the new `complete-external-upload` route with the unique identifier of the `ExternalUploadStub` created earlier. Only the user who made the stub can complete the external upload. One of two paths is followed via the `ExternalUploadManager`.
1. If the object in S3 is too large (currently 100mb defined by `ExternalUploadManager::DOWNLOAD_LIMIT`) we do not download and generate the SHA1 for that file. Instead we create the `Upload` record via `UploadCreator` and simply copy it to its final destination on S3 then delete the initial temp file. Several modifications to `UploadCreator` have been made to accommodate this.
2. If the object in S3 is small enough, we download it. When the temporary S3 file is downloaded, we compare the SHA1 checksum generated by the browser with the actual SHA1 checksum of the file generated by ruby. The browser SHA1 checksum is stored on the object in S3 with metadata, and is generated via the `UppyChecksum` plugin. Keep in mind that some browsers will not generate this due to compatibility or other issues.
We then follow the normal `UploadCreator` path with one exception. To cut down on having to re-upload the file again, if there are no changes (such as resizing etc) to the file in `UploadCreator` we follow the same copy + delete temp path that we do for files that are too large.
3. Finally we return the serialized upload record back to the client
There are several errors that could happen that are handled by `UploadsController` as well.
Also in this PR is some refactoring of `displayErrorForUpload` to handle both uppy and jquery file uploader errors.
Uploading lots of small files can be made significantly faster by parallelizing the `s3.put_object` calls. In testing, an UPLOAD_CONCURRENCY of 10 made a large restore 10x faster. An UPLOAD_CONCURRENCY of 20 made the same restore 18x faster.
This commit is careful to parallelize as little as possible, to reduce the chance of concurrency issues. In the worker threads, no database transactions are performed. All modification of shared objects is controlled with a mutex.
Unfortunately we do not have any existing tests for the `ToS3Migration` class. This change has been tested with a large site backup (120k uploads totalling 45GB)
Discourse shouldn't dynamically calculate the path of uploads and optimized images after a file has been stored on disk or S3. Otherwise it might calculate the wrong path if the SHA1 or extension stored in the database doesn't match the actual file path.
Over the years we accrued many spelling mistakes in the code base.
This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change
- comments
- test descriptions
- other low risk areas
`GlobalSetting.relative_url_root` comes from the destination site. We
can't be sure whether it was the same on the original site. It's safer
to use a wildcard here, so we can backup/restore sites with different
relative_url_root values.
- Only initialize the S3Helper when needed
- Skip initializing the S3Helper for S3Store#cdn_url
- Allow cook_url to be passed a `local` hint to skip unnecessary checks
DEV: Replace instances of Discourse.base_uri with Discourse.base_path
This is clearer because the base_uri is actually just a path prefix. This continues the work started in 555f467.
The download link on the lightbox for images was not downloading the image if the upload was marked secure, because the code in the upload controller route was not respecting the dl=1 param for force download.
This PR fixes this so the download link works for secure images as well as regular ligthboxed images.
* strip out the href and xlink:href attributes from use element that
are _not_ anchors in svgs which can be used for XSS
* adding the content-disposition: attachment ensures that
uploaded SVGs cannot be opened and executed using the XSS exploit.
svgs embedded using an img tag do not suffer from the same exploit
`OptimizedImage#filesize` calls `Discourse.store.download` with an OptimizedImage as an argument. It would in turn attempt to call `#original_filename` and `#secure?` on that object. Both would fail as these methods do not exist on OptimizedImage, only on Upload. We didn't know about these issues because:
1. `#calculate_filesize` is not called often, because the filesize is saved on OptimizedImage creation, so it's used mostly for manual filesize recalculation
2. we were using `rescue nil` which swallows all errors
* Change S3Helper::DOWNLOAD_URL_EXPIRES_AFTER_SECONDS to 5 minutes, which controls presigned URL expiry and secure-media route cache time.
* This is done because of the composer preview refreshing while typing causes a lot of requests sent to our server because of the short URL expiry. If this ends up being not enough we can always increase the time or explore other avenues (e.g. GitHub has a 7 day validity for secure URLs)
Fixed bugs, added specs, extracted the upload downsizing code to a class, added support for non-S3 setups, changed it so that images aren't downloaded twice.
This code has been tested on production and successfully resized ~180k uploads.
Includes:
* DEV: Extract upload downsizing logic
* DEV: Add support for non-S3 uploads
* DEV: Process only images uploaded by users
* FIX: Incorrect usage of `count` and `exist?` typo
* DEV: Spec S3 image downsizing
* DEV: Avoid downloading images twice
* DEV: Update filesizes earlier in the process
* DEV: Return false on invalid upload
* FIX: Download images that currently above the limit (If the image size limit is decreased, then there was no way to resize those images that now fall outside the allowed size range)
* Update script/downsize_uploads.rb (Co-authored-by: Régis Hanol <regis@hanol.fr>)
This reverts commit 20780a1eee.
* SECURITY: re-adds accidentally reverted commit:
03d26cd6: ensure embed_url contains valid http(s) uri
* when the merge commit e62a85cf was reverted, git chose the 2660c2e2 parent to land on
instead of the 03d26cd6 parent (which contains security fixes)
In some cases, between Discourse forums the hostname of a URL could match if they are hosting S3 files on the same bucket but the S3 bucket path might not. So e.g. https://testbucket.somesite.com/testpath/some/file/url.png vs https://testbucket.somesite.com/prodpath/some/file/url.png. So has_been_uploaded? was returning true for the second URL, even though it may have been uploaded on a different Discourse forum.
This is a very rare case but must be accounted for, because this impacts UrlHelper.is_local which mistakenly thinks the file has already been downloaded and thus allows the URL to be cooked, where we want to return the full URL to be downloaded using PullHotlinkedImages.
Previously we would raise a warning in the logs if downloading
a file (from s3) takes longer than 60 seconds.
At scale this happens reasonably frequently.
1. Raised the duration to 3 minutes
2. Pulled the resizing mutex out of the downloading mutex
so we have less and clearer error logs
We have the `# frozen_string_literal: true` comment on all our
files. This means all string literals are frozen. There is no need
to call #freeze on any literals.
For files with `# frozen_string_literal: true`
```
puts %w{a b}[0].frozen?
=> true
puts "hi".frozen?
=> true
puts "a #{1} b".frozen?
=> true
puts ("a " + "b").frozen?
=> false
puts (-("a " + "b")).frozen?
=> true
```
For more details see: https://samsaffron.com/archive/2018/02/16/reducing-string-duplication-in-ruby
The `uplaods:migrate_to_s3` rake task should always use the environment variables, because you usually don't want to break your site's uploads during the migration. But restoring a backup should work with site settings as well as environment variables, otherwise you can't restore uploads to S3 from the web interface.
The rake task aborted the migration with "Already migrated" when all upload URLs linked to the correct S3 bucket even though the files didn't exist on S3. By removing the first check we force the rake task to check for the existance of uploads on S3.
A race condition issue is possible when multiple thread/processes are calling this method.
`ls` prints out to stderr "cannot access '...': No such file or directory" if any of the files it's currently trying to list are being removed by the `xargs rm -rf` in an another process. That doesn't affect the result, but it did raise an error before this change.
Tested on a production instance where the original issue was observed.
Co-Authored-By: Régis Hanol <regis@hanol.fr>
### General Changes and Duplication
* We now consider a post `with_secure_media?` if it is in a read-restricted category.
* When uploading we now set an upload's secure status straight away.
* When uploading if `SiteSetting.secure_media` is enabled, we do not check to see if the upload already exists using the `sha1` digest of the upload. The `sha1` column of the upload is filled with a `SecureRandom.hex(20)` value which is the same length as `Upload::SHA1_LENGTH`. The `original_sha1` column is filled with the _real_ sha1 digest of the file.
* Whether an upload `should_be_secure?` is now determined by whether the `access_control_post` is `with_secure_media?` (if there is no access control post then we leave the secure status as is).
* When serializing the upload, we now cook the URL if the upload is secure. This is so it shows up correctly in the composer preview, because we set secure status on upload.
### Viewing Secure Media
* The secure-media-upload URL will take the post that the upload is attached to into account via `Guardian.can_see?` for access permissions
* If there is no `access_control_post` then we just deliver the media. This should be a rare occurrance and shouldn't cause issues as the `access_control_post` is set when `link_post_uploads` is called via `CookedPostProcessor`
### Removed
We no longer do any of these because we do not reuse uploads by sha1 if secure media is enabled.
* We no longer have a way to prevent cross-posting of a secure upload from a private context to a public context.
* We no longer have to set `secure: false` for uploads when uploading for a theme component.
This PR introduces a new secure media setting. When enabled, it prevent unathorized access to media uploads (files of type image, video and audio). When the `login_required` setting is enabled, then all media uploads will be protected from unauthorized (anonymous) access. When `login_required`is disabled, only media in private messages will be protected from unauthorized access.
A few notes:
- the `prevent_anons_from_downloading_files` setting no longer applies to audio and video uploads
- the `secure_media` setting can only be enabled if S3 uploads are already enabled and configured
- upload records have a new column, `secure`, which is a boolean `true/false` of the upload's secure status
- when creating a public post with an upload that has already been uploaded and is marked as secure, the post creator will raise an error
- when enabling or disabling the setting on a site with existing uploads, the rake task `uploads:ensure_correct_acl` should be used to update all uploads' secure status and their ACL on S3
POSIX's `head` specification states: "The application shall ensure that the number option-argument is a positive decimal integer"
Negative values are supported on GNU `head`, so this works in the discourse docker image. However, in some environments (e.g. macOS), the system `head` version fails with a negative `n` parameter.
This commit does two things:
Checks the status at each stage of the pipe, so it cannot fail silently
Flip the `ls` command to list in descending time order, and use `tail -n +501` instead of `head -n -500`.
The visible result is that macOS users no longer see head: illegal line count -- -500 printed throughout the test suite.
Doing .pluck(:column).first is a very common pattern in Discourse and in
most cases, a limit cause isn't being added. Instead of adding a limit
clause to all these callsites, this commit adds two new methods to
ActiveRecord::Relation:
pluck_first, equivalent to limit(1).pluck(*columns).first
and pluck_first! which, like other finder methods, raises an exception
when no record is found