From c68c919cea82f6183c6bf93f28961a10c814f022 Mon Sep 17 00:00:00 2001 From: Ivan Andreev Date: Fri, 30 Aug 2019 17:18:40 +0300 Subject: [PATCH] docs: chunker documentation --- README.md | 1 + bin/make_manual.py | 1 + docs/content/about.md | 1 + docs/content/chunker.md | 367 ++++++++++++++++++++++++++++++++ docs/content/docs.md | 1 + docs/layouts/chrome/navbar.html | 1 + 6 files changed, 372 insertions(+) create mode 100644 docs/content/chunker.md diff --git a/README.md b/README.md index ca2617c78..1e5b30cff 100644 --- a/README.md +++ b/README.md @@ -76,6 +76,7 @@ Please see [the full list of all storage providers and their features](https://r * [Sync](https://rclone.org/commands/rclone_sync/) (one way) mode to make a directory identical * [Check](https://rclone.org/commands/rclone_check/) mode to check for file hash equality * Can sync to and from network, e.g. two different cloud accounts + * Optional large file chunking ([Chunker](https://rclone.org/chunker/)) * Optional encryption ([Crypt](https://rclone.org/crypt/)) * Optional cache ([Cache](https://rclone.org/cache/)) * Optional FUSE mount ([rclone mount](https://rclone.org/commands/rclone_mount/)) diff --git a/bin/make_manual.py b/bin/make_manual.py index 14bd63393..8b35b9f17 100755 --- a/bin/make_manual.py +++ b/bin/make_manual.py @@ -31,6 +31,7 @@ docs = [ "b2.md", "box.md", "cache.md", + "chunker.md", "crypt.md", "dropbox.md", "ftp.md", diff --git a/docs/content/about.md b/docs/content/about.md index 93a799734..be083fb56 100644 --- a/docs/content/about.md +++ b/docs/content/about.md @@ -66,6 +66,7 @@ Features * Can sync to and from network, eg two different cloud accounts * [Encryption](/crypt/) backend * [Cache](/cache/) backend + * [Chunking](/chunker/) backend * [Union](/union/) backend * Optional FUSE mount ([rclone mount](/commands/rclone_mount/)) * Multi-threaded downloads to local disk diff --git a/docs/content/chunker.md b/docs/content/chunker.md new file mode 100644 index 000000000..daf9fbf06 --- /dev/null +++ b/docs/content/chunker.md @@ -0,0 +1,367 @@ +--- +title: "Chunker" +description: "Split-chunking overlay remote" +date: "2019-08-30" +--- + +Chunker +---------------------------------------- + +The `chunker` overlay transparently splits large files into smaller chunks +during the upload to wrapped remote and transparently assembles them back +when the file is downloaded. This allows to effectively overcome size limits +imposed by storage providers. + +To use it, first set up the underlying remote following the configuration +instructions for that remote. You can also use a local pathname instead of +a remote. + +First check your chosen remote is working - we'll call it `remote:path` here. +Note that anything inside `remote:path` will be chunked and anything outside +won't. This means that if you are using a bucket based remote (eg S3, B2, swift) +then you should probably put the bucket in the remote `s3:bucket`. + +Now configure `chunker` using `rclone config`. We will call this one `overlay` +to separate it from the `remote`. + +``` +No remotes found - make a new one +n) New remote +s) Set configuration password +q) Quit config +n/s/q> n +name> overlay +Type of storage to configure. +Choose a number from below, or type in your own value +[snip] +XX / Transparently chunk/split large files + \ "chunker" +[snip] +Storage> chunker +Remote to chunk/unchunk. +Normally should contain a ':' and a path, eg "myremote:path/to/dir", +"myremote:bucket" or maybe "myremote:" (not recommended). +remote> remote:path +Files larger than chunk_size will be split in chunks. By default 2 Gb. +Enter a size with suffix k,M,G,T. Press Enter for the default ("2G"). +chunk_size> 1G +Edit advanced config? (y/n) +y) Yes +n) No +y/n> n +Remote config +-------------------- +[overlay] +type = chunker +remote = TestLocal: +chunk_size = 2G +-------------------- +y) Yes this is OK +e) Edit this remote +d) Delete this remote +y/e/d> y +``` + +### Specifying the remote + +In normal use, make sure the remote has a `:` in. If you specify the remote +without a `:` then rclone will use a local directory of that name. +So if you use a remote of `/path/to/secret/files` then rclone will +chunk stuff in that directory. If you use a remote of `name` then rclone +will put files in a directory called `name` in the current directory. + + +### Chunking + +When rclone starts a file upload, chunker checks the file size. +If it doesn't exceed the configured chunk size, chunker will just pass it +to the wrapped remote. If a file is large, chunker will transparently cut +data in pieces with temporary names and stream them one by one, on the fly. +Each chunk will contain the specified number of data byts, except for the +last one which may have less data. If file size is unknown in advance +(this is called a streaming upload), chunker will internally create +a temporary copy, record its size and repeat the above process. +When upload completes, temporary chunk files are finally renamed. +This scheme guarantees that operations look from outside as atomic. +A similar method with hidden temporary chunks is used for other operations +(copy/move/rename etc). If operation fails, hidden chunks are normally +destroyed, and the destination composite file stays intact. + +#### Chunk names + +By default chunk names are `BIG_FILE_NAME.rclone-chunk.001`, +`BIG_FILE_NAME.rclone-chunk.002` etc, because the default chunk name +format is `*.rclone-chunk.###`. You can configure another name format +using the `--chunker-name-format` option. The format uses asterisk +`*` as a placeholder for the base file name and one or more consecutive +hash characters `#` as a placeholder for the chunk number. There must be +one and only one asterisk. The number of consecutive hashes defines the +minimum length of a string representing a chunk number. If a chunk number +has less digits than the number of hashes, it is left-padded by zeros. +If there are more digits in the number, they are left as is. +By default numbering starts from 1 but there is another option that allows +user to start from 0, eg. for compatibility with legacy software. + +For example, if name format is `big_*-##.part`, and original file was +named `data.txt` and numbering starts from 0, then the first chunk will be +named `big_data.txt-00.part`, the 99th chunk will be `big_data.txt-98.part` +and the 302nd chunk will be `big_data.txt-301.part`. + +Would-be chunk files are ignored if their name does not match given format. +The list command might encounter composite files with missinng or invalid +chunks. By default, if chunker detects a missing chunk it will silently +ignore the whole group. Use the `--chunker-fail-on-bad-chunks` flag +to make it fail with an error message. + + +### Metadata + +By default when a file is large enough, chunker will create a metadata +object besides data chunks. The object is named after the original file. +Chunker allows to choose between few metadata formats. Please note that +currently metadata is not created for files smaller than configured +chunk size. This may change in future as new formats are developed. + +#### Simple JSON metadata format + +This is the default format. It supports hash sums and chunk validation +for composite files. Meta objects carry the following fields: + +- `size` - total size of chunks +- `nchunks` - number of chunks +- `md5` - MD5 hashsum (if present) +- `sha1` - SHA1 hashsum (if present) + +There is no field for composite file name as it's simply equal to the name +of meta object on the wrapped remote. Please refer to respective sections +for detils on hashsums and modified time handling. + +#### WedDavMailRu compatible metadata format + +The `wdmrcompat` metadata format is only useful to support historical files +created by [WebDriveMailru](https://github.com/yar229/WebDavMailRuCloud). +It keeps the following fields (most are ignored, though): + +- `Name` - name of the composite file (always equal to the meta file name) +- `Size` - total size of chunks +- `PublicKey` - ignored, always "null" +- `CreationDate` - last modification (sic!) time, ignored. + +#### No metadata + +You can disable meta objects by setting the meta format option to `none`. +In this mode chunker will scan directory for all files that follow +configured chunk name format, group them by detecting chunks with the same +base name and show group names as virtual composite files. +When a download is requested, chunker will transparently assemble compound +files by merging chunks in order. This method is more prone to missing chunk +errors (especially missing last chunk) than metadata-enabled formats. + + +### Hashsums + +Chunker supports hashsums only when a compatible metadata is present. +Thus, if you choose metadata format of `none` or `wdmrcompat`, chunker +will return `UNSUPPORTED` as hashsum. + +Please note that metadata is stored only for composite files. If a file +is small (smaller than configured chunk size), chunker will transparently +redirect hash requests to wrapped remote, so support depends on that. +You will see the empty string as a hashsum of requested type for small +files if the wrapped remote doesn't support it. + +Many storage backends support MD5 and SHA1 hash types, so does chunker. +Currently you can choose one or another but not both. +MD5 is set by default as the most supported type. +Since chunker keeps hashes for composite files and falls back to the +wrapped remote hash for small ones, we advise you to choose the same +hash type as wrapped remote, so your file listings look coherent. + +Normally, when a file is copied to chunker controlled remote, chunker +will ask its source for compatible file hash and revert to on-the-fly +calculation if none is found. This involves some CPU overhead but provides +a guarantee that given hashsum is available. Also, chunker will reject +a server-side copy or move operation if source and destination hashsum +types are different, resulting in the extra network bandwidth, too. +In some rare cases this may be undesired, so chunker provides two optional +choices: `sha1quick` and `md5quick`. If source does not have the primary +hash type and the quick mode is enabled, chunker will try to fall back to +the secondary type. This will save CPU and bandwidth but can result in empty +hashsums at destination. Beware of consequences: the `sync` command will +revert (sometimes silently) to time/size comparison if compatible hashsums +between source and target are not found. + + +### Modified time + +Chunker stores modification times using the wrapped remote so support +depends on that. For a small non-chunked file the chunker overlay simply +manipulates modification time of the wrapped remote file. +If file is large and metadata is present, then chunker will get and set +modification time of the metadata object on the wrapped remote. +If file is chunked but metadata format is `none` then chunker will +use modification time of the first chunk. + + +### Migrations + +The idiomatic way to migrate to a different chunk size, hash type or +chunk naming scheme is to: + +- Collect all your chunked files under a directory and have your + chunker remote point to it. +- Create another directory (possibly on the same cloud storage) + and configure a new remote with desired metadata format, + hash type, chunk naming etc. +- Now run `rclone sync oldchunks: newchunks:` and all your data + will be transparently converted at transfer. + This may take some time. +- After checking data integrity you may remove configuration section + of the old remote. + +If rclone gets killed during a long operation on a big composite file, +hidden temporary chunks may stay in the directory. They will not be +shown by the list command but will eat up your account quota. +Please note that the `deletefile` rclone command deletes only active +chunks of a file. As a workaround, you can use remote of the wrapped +file system to see them. +An easy way to get rid of hidden garbage is to copy littered directory +somewhere using the chunker remote and purge original directory. +The `copy` command will copy only active chunks while the `purge` will +remove everything including garbage. + + +### Caveats and Limitations + +Chunker requires wrapped remote to support server side `move` (or `copy` + +delete) operations, otherwise it will explicitly refuse to start. +This is because it internally renames temporary chunk files to their final +names when an operation completes successfully. + +Note that moves done using the copy-and-delete method may incur double +charging with some cloud storage providers. + +Chunker will not automatically rename existing chunks when you change the +chunk name format. Beware that in result of this some files which have been +treated as chunks before the change can pop up in directory listings as +normal files and vice versa. The same warning holds for the chunk size. +If you desperately need to change critical chunking setings, you should +run data migration as described in a dedicated section. + +If wrapped remote is case insensitive, the chunker overlay will inherit +that property (so you can't have a file called "Hello.doc" and "hello.doc" +in the same directory). + + + +### Standard Options + +Here are the standard options specific to chunker. + +#### --chunker-remote + +Remote to chunk/unchunk. +Normally should contain a ':' and a path, eg "myremote:path/to/dir", +"myremote:bucket" or maybe "myremote:" (not recommended). + +- Config: remote +- Env Var: RCLONE_CHUNKER_REMOTE +- Type: string +- Default: "" + +#### --chunker-chunk-size + +Files larger than chunk size will be split in chunks. + +- Config: chunk_size +- Env Var: RCLONE_CHUNKER_CHUNK_SIZE +- Type: SizeSuffix +- Default: 2G + +### Advanced Options + +Here are the advanced options specific to chunker (Transparently chunk/split large files). + +#### --chunker-name-format + +String format of chunk file names. +The two placeholders are: base file name (*) and chunk number (#...). +There must be one and only one asterisk and one or more consecutive hash characters. +If chunk number has less digits than the number of hashes, it is left-padded by zeros. +If there are more digits in the number, they are left as is. +Possible chunk files are ignored if their name does not match given format. + +- Config: name_format +- Env Var: RCLONE_CHUNKER_NAME_FORMAT +- Type: string +- Default: "*.rclone_chunk.###" + +#### --chunker-start-from + +Minimum valid chunk number. Usually 0 or 1. +By default chunk numbers start from 1. + +- Config: start_from +- Env Var: RCLONE_CHUNKER_START_FROM +- Type: int +- Default: 1 + +#### --chunker-meta-format + +Format of the metadata object or "none". By default "simplejson". +Metadata is a small JSON file named after the composite file. + +- Config: meta_format +- Env Var: RCLONE_CHUNKER_META_FORMAT +- Type: string +- Default: "simplejson" +- Examples: + - "none" + - Do not use metadata files at all. Requires hash type "none". + - "simplejson" + - Simple JSON supports hash sums and chunk validation. + - It has the following fields: size, nchunks, md5, sha1. + - "wdmrcompat" + - This format brings compatibility with WebDavMailRuCloud. + - It does not support hash sums or validation, most fields are ignored. + - It has the following fields: Name, Size, PublicKey, CreationDate. + - Requires hash type "none". + +#### --chunker-hash-type + +Choose how chunker handles hash sums. + +- Config: hash_type +- Env Var: RCLONE_CHUNKER_HASH_TYPE +- Type: string +- Default: "md5" +- Examples: + - "none" + - Chunker can pass any hash supported by wrapped remote + - for a single-chunk file but returns nothing otherwise. + - "md5" + - MD5 for multi-chunk files. Requires "simplejson". + - "sha1" + - SHA1 for multi-chunk files. Requires "simplejson". + - "md5quick" + - When a file is copied on to chunker, MD5 is taken from its source + - falling back to SHA1 if the source doesn't support it. Requires "simplejson". + - "sha1quick" + - Similar to "md5quick" but prefers SHA1 over MD5. Requires "simplejson". + +#### --chunker-fail-on-bad-chunks + +The list command might encounter files with missinng or invalid chunks. +This boolean flag tells what rclone should do in such cases. + +- Config: fail_on_bad_chunks +- Env Var: RCLONE_CHUNKER_FAIL_ON_BAD_CHUNKS +- Type: bool +- Default: false +- Examples: + - "true" + - Fail with error. + - "false" + - Silently ignore invalid object. + + diff --git a/docs/content/docs.md b/docs/content/docs.md index 953b1d03b..61ea727cc 100644 --- a/docs/content/docs.md +++ b/docs/content/docs.md @@ -26,6 +26,7 @@ See the following for detailed instructions for * [Backblaze B2](/b2/) * [Box](/box/) * [Cache](/cache/) + * [Chunker](/chunker/) - transparently splits large files for other remotes * [Crypt](/crypt/) - to encrypt other remotes * [DigitalOcean Spaces](/s3/#digitalocean-spaces) * [Dropbox](/dropbox/) diff --git a/docs/layouts/chrome/navbar.html b/docs/layouts/chrome/navbar.html index ea0d527af..4f5e4bd99 100644 --- a/docs/layouts/chrome/navbar.html +++ b/docs/layouts/chrome/navbar.html @@ -63,6 +63,7 @@
  • Backblaze B2
  • Box
  • Cache
  • +
  • Chunker (splits large files)
  • Crypt (encrypts the others)
  • Dropbox
  • FTP