s3: add docs on data integrity
Some checks are pending
Docker beta build / Build image job (push) Waiting to run

See: https://forum.rclone.org/t/help-me-figure-out-how-to-verify-backup-accuracy-and-completeness-on-s3/37632/5
This commit is contained in:
Nick Craig-Wood 2023-04-18 16:01:09 +01:00
parent 965bf19065
commit 91c8f92ccb

View File

@ -435,6 +435,83 @@ If you are doing a server-side copy, you can also increase the number of transfe
You will need to experiment with these values to find the optimal settings for your setup. You will need to experiment with these values to find the optimal settings for your setup.
### Data integrity
Rclone does its best to verify every part of an upload or download to
the s3 provider using various hashes.
Every HTTP transaction to/from the provider has a
`X-Amz-Content-Sha256` or a `Content-Md5` header to guard against
corruption of the HTTP body. The HTTP Header is protected by the
signature passed in the `Authorization` header.
All communications with the provider is done over https for encryption
and additional error protection.
#### Single part uploads
- Rclone uploads single part uploads with a `Content-Md5` using the
MD5 hash read from the source. The provider checks this is correct
on receipt of the data.
- Rclone then does a HEAD request (disable with `--s3-no-head`) to
read the `ETag` back which is the MD5 of the file and checks that with
what it sent.
Note that if the source does not have an MD5 then the single part
uploads will not have hash protection. In this case it is recommended
to use `--s3-upload-cutoff 0` so all files are uploaded as multipart
uploads.
#### Multipart uplaods
For files above `--s3-upload-cutoff` rclone splits the file into
multiple parts for upload.
- Each part is protected with both an `X-Amz-Content-Sha256` and a
`Content-Md5`
When rclone has finished the upload of all the parts it then completes
the upload by sending:
- The MD5 hash of each part
- The number of parts
- This info is all protected with a `X-Amz-Content-Sha256`
The provider checks the MD5 for all the parts it has received against
what rclone sends and if it is good it returns OK.
Rclone then does a HEAD request (disable with `--s3-no-head`) and
checks the ETag is what it expects (in this case it should be the MD5
sum of all the MD5 sums of all the parts with the number of parts on
the end).
If the source has an MD5 sum then rclone will attach the
`X-Amz-Meta-Md5chksum` with it as the `ETag` for a multipart upload
can't easily be checked against the file as the chunk size must be
known in order to calculate it.
#### Downloads
Rclone checks the MD5 hash of the data downloaded against either the
ETag or the `X-Amz-Meta-Md5chksum` metadata (if present) which rclone
uploads with multipart uploads.
#### Further checking
At each stage rclone and the provider are sending and checking hashes of
**everything**. Rclone deliberately HEADs each object after upload to
check it arrived safely for extra security. (You can disable this with
`--s3-no-head`).
If you require further assurance that your data is intact you can use
`rclone check` to check the hashes locally vs the remote.
And if you are feeling ultimately paranoid use `rclone check --download`
which will download the files and check them against the local copies.
(Note that this doesn't use disk to do this - it streams them in
memory).
### Versions ### Versions
When bucket versioning is enabled (this can be done with rclone with When bucket versioning is enabled (this can be done with rclone with