Forcing distributed muted to raise when a notify reviewable job is running
leads to excessive errors in the logs under many conditions.
The new pattern
1. Optimises the counting of reviewables so it is a lot faster
2. Holds the distributed lock for 2 minutes (max)
The downside is the job queue can get blocked up when tons of notify
reviewables are running at the same time. However this should be very
rare in the real world, as we only notify when stuff is flagged which
is fairly infrequent.
This also give a fair bit more time for the notifications which may be
a little slow on large sites with tons of mods.
Under scenarios of extremely high load where large numbers of `Reviewable*` items are being created, it has been observed that multiple instances of the `NotifyReviewable` job may run simultaneously.
These jobs will work satisfactorily if the concurrency is limited to 1, and the different types of jobs (items reviewable by admins, vs moderators, vs particular groups, etc.) are run eventually.
This change introduces a new option to `DistributedMutex` which allows the `max_get_lock_attempts` to be specified. If the number is exceeded an error will be raised, which will cause Sidekiq to requeue the job. Sidekiq has existing logic to back-off on retry times for jobs that have failed multiple times.
When originally introduced, `attempts` was only used in the read-only check
context.
With the introduction of the exponential backoff in cda370db, `attempts` was
also used to count loop iterations, but was left inside the if block instead of
being incremented every loop, meaning the exponential backoff was only
happening when the site was recently readonly.
Co-authored-by: jbrw <jamie.wilson@discourse.org>
Polling every 0.001s can cause extreme load on the redis instance, especially in scenarios where multiple app instances are waiting on the same lock. This commit introduces an exponential backoff starting from 0.001s and reaching a maximum interval of 1s.
Previously `CHECK_READONLY_ATTEMPTS` was 10, and resulted in a block for 0.01s. Under the new logic, 10 attempts take more than 1s. Therefore CHECK_READONLY_ATTEMPTS is reduced to 5, bringing its total time to around 0.031s
Latest redis interoduces a block form of multi / pipelined, this was incorrectly
passed through and not namespaced.
Fix also updates logster, we held off on upgrading it due to missing functions
Accounting for fractional seconds, a distributed mutex can be held for
almost a full second longer than its validity.
For example: if we grab the lock at 10.5 seconds passed the epoch with a
validity of 5 seconds, the lock would be released at 16 seconds passed
the epoch. However, in this case assuming that all other processing
takes a negligible amount of time, the key would be expired at 15.5
seconds passed the epoch.
Using expireat, the key is now expired exactly when the lock is released.
Threadsafety
Since we use the same redis connection in multiple threads, a rogue
transaction in another thread can trample the connection state
(watched keys) that we need to acquire and release the lock properly.
This is fixed by preventing other threads from using the connection
when we are performing these actions.
Off-by-one error
A distributed mutex is now consistently determined to be expired if
the current time is strictly greater than the expire time.
Unwatch before transaction
Since the redis connection is used by so much of the code, it is
difficult to ensure that any watched keys have been cleared. In order
to defend against this rogue connection state, an unwatch has been
added before locking and unlocking.
Logging
Hopefully this log message is more clear.
Previously we would only hold the post process mutex for 1 minute, that is
not enough when processing a post with lots of images. This raises the bar
to 10 minutes.
It also cleans up error reporting around distributed mutexes expiring. We
used to double report.
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.
Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
If we detect redis is in readonly we can not correctly get a mutex
raise an exception to notify caller
When getting optimized images avoid the distributed mutex unless
for some reason it is the first call and we need to generate a thumb
In redis readonly no thumbnails will be generated