Commit Graph

78 Commits

Author SHA1 Message Date
Alan Guo Xiang Tan
9812407f76
FIX: Redo Sidekiq monitoring to restart stuck sidekiq processes (#30198)
This commit reimplements how we monitor Sidekiq processes that are
forked from the Unicorn master process. Prior to this change, we rely on
`Jobs::Heartbeat` to enqueue a `Jobs::RunHeartbeat` job every 3 minutes.
The `Jobs::RunHeartbeat` job then sets a Redis key with a timestamp. In
the Unicorn master process, we then fetch the timestamp that has been set
by the job from Redis every 30 minutes. If the timestamp has not been
updated for more than 30 minutes, we restart the Sidekiq process. The
fundamental flaw with this approach is that it fails to consider
deployments with multiple hosts and multiple Sidekiq processes. A
sidekiq process on a host may be in a bad state but the heartbeat check
will not restart the process because the `Jobs::RunHeartbeat` job is
still being executed by the working Sidekiq processes on other hosts.

In order to properly ensure that stuck Sidekiq processs are restarted,
we now rely on the [Sidekiq::ProcessSet](https://github.com/sidekiq/sidekiq/wiki/API#processes)
API that is supported by Sidekiq. The API provides us with "near real-time (updated every 5 sec)
info about the current set of Sidekiq processes running". The API
provides useful information like the hostname, pid and also when Sidekiq
last did its own heartbeat check. With that information, we can easily
determine if a Sidekiq process needs to be restarted from the Unicorn
master process.
2024-12-18 12:48:50 +08:00
Alan Guo Xiang Tan
f35128c6ed
DEV: Fix broken sidekiq logging due to eeb01ea0de (#30199) 2024-12-10 17:01:25 +08:00
Alan Guo Xiang Tan
eeb01ea0de
DEV: Remove unnecessary thread in Jobs::Base::JobInstrumenter take 2 (#30195)
This reverts commit 766ff723f8.

Ensure that we create the sidekiq log file first before opening it for
logging. This avoids any issue of the log file not being present when we
initialize an instance of the `Logger`.
2024-12-10 12:44:56 +08:00
Alan Guo Xiang Tan
766ff723f8
Revert "DEV: Remove unnecessary thread in Jobs::Base::JobInstrumenter (#30179)" (#30193)
This reverts commit 1670ffe82d.
2024-12-10 09:24:40 +08:00
Alan Guo Xiang Tan
1670ffe82d
DEV: Remove unnecessary thread in Jobs::Base::JobInstrumenter (#30179)
In `Jobs::Base::JobInstrumenter.raw_log`, we were creating an instance
of `Queue` and then pushing messages to the queue before popping it off
the queue in a thread. However, this complexity is not necessary when
we can just write directly to the logger without much overhead. This is
how all logging is done in other parts of the app as well.
2024-12-10 06:29:46 +08:00
Bianca Nenciu
5e734516db
DEV: Drop DISCOURSE_LIVE_SLOTS_SIDEKIQ_LIMIT (#29920)
This was used to track jobs that may leak memory, but proved to be too
noisy and not very useful.
2024-11-26 07:21:14 +11:00
Alan Guo Xiang Tan
c1f25cdf5b
FIX: Unicorn master and Sidekiq reopening logs at the same time (#29137)
In our production environment, we have been seeing Sidekiq processes
getting stuck randomly when a USR1 signal is sent to the Unicorn master
process. We have not been able to identify the root cause of why the
Sidekiq process gets stuck. We however noticed that when the Unicorn
master process receives a USR1 signal, it will reopen the logs for the
Unicorn master process first before sending a USR1 signal for the
Unicorn worker processes to reopen the logs. We figured that we should
do the same for the Sidekiq process as well when a USR1 signal.

In this commit, we introduce an arbitrary delay of 1 second before we
the Sidekiq process reopens its log files so as to allow enough time for the Unicorn
master to finish reopening it logs first.

We also do not send reopen logs for the Sidekiq process if the `DISCOURSE_LOG_SIDEKIQ`
env is not present because there is no need to reopen any logs.
2024-10-10 08:01:40 +08:00
Renato Atilio
54d6e52607
FIX: chat mailer log noise (#28616)
Fixes the log noise caused by a deprecation notice
2024-08-29 11:39:08 -03:00
Bianca Nenciu
95b09dd777
DEV: Log live slots of Sidekiq jobs (#28600)
Introduce a new log line for Sidekiq jobs that consume more than
`DISCOURSE_LIVE_SLOTS_SIDEKIQ_LIMIT` live slots. This is useful to
track down jobs that may leak memory.

This is enabled only when Sidekiq's job instrumenter is enabled (set
`DISCOURSE_LOG_SIDEKIQ` to `1`).
2024-08-29 12:23:27 +03:00
Alan Guo Xiang Tan
1a09d6b246
FEATURE: Add live_slots_(start|finish) for Sidekiq perf logging (#28260)
This information is helpful in debugging memory spikes when Sidekiq
processes jobs.
2024-08-07 15:48:24 +08:00
Daniel Waterworth
13083d03ae
DEV: Async category search for sidebar modal (#25686) 2024-02-20 11:24:30 -06:00
Alan Guo Xiang Tan
043ba1d179
DEV: Fix job cluster concurrency spec timing out (#25035)
Why this change?

On CI, we have been seeing the "handles job concurrency" job timing out
on CI after 45 seconds. Upon closer inspection of `Jobs::Base#perform`
when cluster concurrency has been set, we see that a thread is spun up
to extend the expiring of a redis key by 120 seconds every 60 seconds
while the job is still being executed. The thread looks like this before
the fix:

```
keepalive_thread =
  Thread.new do
    while parent_thread.alive? && !finished
      Discourse.redis.without_namespace.expire(cluster_concurrency_redis_key, 120)
      sleep 60
    end
  end
```

In an ensure block of `Jobs::Base#perform`, the thread is stop by doing
something like this:

```
finished = true
keepalive_thread.wakeup
keepalive_thread.join
```

If the thread is sleeping, `keepalive_thread.wakeup` will stop the
`sleep` method and run the next iteration causing the thread to
complete. However, there is a timing issue at play here. If
`keepalive_thread.wakeup` is called at a time when the thread is not
sleeping, it will have no effect and the thread may end up sleeping for
60 seconds which is longer than our timeout on CI of 45 seconds.

What does this change do?

1. Change `sleep 60` to sleep in intervals of 1 second checking if the
   job has been finished each time.

2. Add `use_redis_snapshotting` to `Jobs::Base` spec since Redis is
   involved in scheduling and we want to ensure we don't leak Redis
keys.

3. Add `ConcurrentJob.stop!` and `thread.join` to `ensure` block in "handles job concurrency"
   test since a failing expectation will cause us to not clean up the
thread we created in the test.
2023-12-26 14:47:03 +08:00
Sam
eb603b246b
PERF: limit anonymization to 1 per cluster (#21992)
Anonymization is among the most expensive operations we can perform with
extreme potential to impact the database. To mitigate risk we only allow a
single anonymization across the entire cluster concurrently.

This commit introduces support for `cluster_concurrency 1`. When you set that on a Job it will only allow 1 concurrent execution per cluster.
2023-06-14 08:30:23 +10:00
Daniel Waterworth
666536cbd1
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-20 12:52:49 -06:00
Alan Guo Xiang Tan
8a7b62b126
DEV: Fix threading error when running jobs immediately in system tests (#19811)
```
class Jobs::DummyDelayedJob < Jobs::Base
  def execute(args = {})
  end
end

RSpec.describe "Jobs.run_immediately!" do
  before { Jobs.run_immediately! }

  it "explodes" do
    current_user = Fabricate(:user)
    Jobs.enqueue_in(1.seconds, :dummy_delayed_job)
    sign_in(current_user)
  end
end
```

The test above will fail with the following error if `ActiveRecord::Base.connection_handler.clear_active_connections!` is called before the configured Capybara server checks out a connection from the connection pool.

```
     ActiveRecord::ActiveRecordError:
       Cannot expire connection, it is owned by a different thread: #<Thread:0x00007f437391df58@puma srv tp 001 /home/tgxworld/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/puma-6.0.2/lib/puma/thread_pool.rb:106 sleep_forever>. Current thread: #<Thread:0x00007f437d6cfc60 run>.
```

We're not exactly sure if this is an ActiveRecord bug or not but we've
invested too much time into investigating this problem. Fundamentally,
we also no longer understand why `ActiveRecord::Base.connection_handler.clear_active_connections!` is being called in an ensure block
within `Jobs::Base#perform` which was added in
ceddb6e0da 10 years ago. This
commit moves the logic for running jobs immediately out of the
`Jobs::Base#perform` method into another `Jobs::Base#perform_immediately` method such that
`ActiveRecord::Base.connection_handler.clear_active_connections!` is not
called. This change will only impact the test environment.
2023-01-10 13:41:25 +08:00
David Taylor
5a003715d3
DEV: Apply syntax_tree formatting to app/* 2023-01-09 14:14:59 +00:00
Alan Guo Xiang Tan
7d41e980c9
FIX: Uninitialized class variable error in sidekiq (#17227)
Follow-up to 4199ada1ce
2022-06-24 14:17:39 +10:00
Martin Brennan
3f5e19c62a
FIX: Typo in log_thread (#17226)
Follow up to 4199ada1ce
2022-06-24 12:12:30 +08:00
Alan Guo Xiang Tan
4199ada1ce
DEV: Ensure Sidekiq logging thread is always running (#17211) 2022-06-24 10:28:18 +08:00
Jarek Radosz
2fc70c5572
DEV: Correctly tag heredocs (#16061)
This allows text editors to use correct syntax coloring for the heredoc sections.

Heredoc tag names we use:

languages: SQL, JS, RUBY, LUA, HTML, CSS, SCSS, SH, HBS, XML, YAML/YML, MF, ICS
other: MD, TEXT/TXT, RAW, EMAIL
2022-02-28 20:50:55 +01:00
Alan Guo Xiang Tan
6f03b2694d
DEV: Fix typo. (#15857) 2022-02-08 09:04:53 +08:00
David Taylor
15cff27bfe
DEV: Stringify keys of nested hashes in job arguments (#15850)
This provides symmetry with the `.with_indifferent_access` usage in `Jobs#perform`, which is also recursive.
2022-02-07 20:28:45 +00:00
David Taylor
c8c23ba557
DEV: Introduce deprecation warning for non-json Job arguments (#15842)
This commit introduces our own handling and warning for Sidekiq's new 'non-json-serializable' warning. This decouples us from Sidekiq's own deprecation cycle, and allows us to use our own deprecation system. It also means that the dump/parse happens in test mode, which will help us to catch occurrences before they reach production.
2022-02-07 17:59:55 +00:00
David Taylor
f53d70ac63
DEV: Ensure delay_for and queue are not passed as job arguments (#15824)
This regressed in 3a85c4d680 because deep_stringify_keys makes a copy of the `opts` hash
2022-02-04 20:11:03 +00:00
David Taylor
3a85c4d680 DEV: Ensure Sidekiq job arguments have stringified keys
The latest version of Sidekiq introduced a warning when jobs are queued with arguments which 'do not stringify to JSON safely'. In the vast majority of cases, this is because a hash is passed with symbols as keys. When those args are passed to the job, the keys will be stringified.

Our job wrapper already takes care of this issue by calling '.with_indifferent_access' on the args before passing them to `#execute`, so we don't need to change anything about our use. All we need to do is satisfy Sidekiq's warning system by 'stringifying' all the keys before enqueuing the job.
2022-02-04 18:28:18 +00:00
Osama Sayegh
228264d17c
Revert "DEV: add routes_lazy_route to boost boot-up time (#14545)" (#14581)
This reverts commit f5cf647e57.

The gem breaks usage of Rails URL helpers when used outside views and
controllers, for example in
88ecb83382/app/models/upload.rb (L239-L242)
the `upload_short_path` method call fails with an undefined method
exception when this gem is enabled.
2021-10-12 17:30:38 +03:00
Sam
f5cf647e57
DEV: add routes_lazy_route to boost boot-up time (#14545)
The lazy route initialization cuts down boot time of rails.

On my local system it cuts out 200ms of boot time taking me from 3.2 to 3 seconds.

This is not a radically enormous amount of time, but paper cuts add up, and a faster boot in dev will make everyone happy.

TBD if we want to also include this in production.

Gem is heavily maintained by @amatsuda, last commit 3 days ago.
2021-10-11 13:22:13 +11:00
David Taylor
c69bb5d5be
DEV: Always enqueue sidekiq jobs after database transaction commit (#11293)
When jobs are enqueued inside a transaction, it's possible that they will be executed before the necessary data is available in the database. This commit ensures all jobs are enqueued in an ActiveRecord after_commit hook.

One potential downside here is if the job fails to enqueue, the transaction will no longer be aborted. However, the chance of that happening is reasonably low, and the impact is significantly lower than the current issue where jobs are scheduled before their data is ready.
2020-12-08 11:05:01 +11:00
Guo Xiang Tan
c6202af005
Update rubocop to 2.3.1. 2020-07-24 17:19:21 +08:00
David Taylor
8a3d9d7036
DEV: Run jobs sequentially in test mode (#9897)
When running jobs in tests, we use `Jobs.run_immediately!`. This means that jobs are run synchronously when they are enqueued. Jobs sometimes enqueue other jobs, which are also executed synchronously. This means that the outermost job will block until the inner jobs have finished executing. In some cases (e.g. process_post with hotlinked images) this can lead to a deadlock.

This commit changes the behavior slightly. Now we will never run jobs inside other jobs. Instead, we will queue them up and run them sequentially in the order they were enqueued. As a whole, they are still executed synchronously. Consider the example

```ruby
class Jobs::InnerJob < Jobs::Base
  def execute(args)
    puts "Running inner job"
  end
end

class Jobs::OuterJob < Jobs::Base
  def execute(args)
    puts "Starting outer job"
    Jobs.enqueue(:inner_job)
    puts "Finished outer job"
  end
end

Jobs.enqueue(:outer_job)
puts "All jobs complete"
```

The old behavior would result in:

```
Starting outer job
Running inner job
Finished outer job
All jobs complete
```

The new behavior will result in:
```
Starting outer job
Finished outer job
Running inner job
All jobs complete
```
2020-05-28 12:52:27 +01:00
Sam Saffron
28292d2759
PERF: avoid shelling to get hostname aggressively
Previously we had many places in the app that called `hostname` to get
hostname of a server. This commit replaces the pattern in 2 ways

1. We cache the result in `Discourse.os_hostname` so it is only ever called once

2. We prefer to use Socket.gethostname which avoids making a shell command

This improves performance as we are not spawning hostname processes throughout
the app lifetime
2020-02-18 15:13:19 +11:00
Dan Ungureanu
086b46051c
FIX: Zeitwerk-related fixes for jobs. (#8187) 2019-10-14 13:03:22 +03:00
Krzysztof Kotlarek
427d54b2b0 DEV: Upgrading Discourse to Zeitwerk (#8098)
Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. 

We no longer need to use Rails "require_dependency" anywhere and instead can just use standard 
Ruby patterns to require files.

This is a far reaching change and we expect some followups here.
2019-10-02 14:01:53 +10:00
Gerhard Schlager
b788948985 FEATURE: English locale with international date formats
Makes en_US the new default locale
2019-05-20 13:47:20 +02:00
Sam Saffron
30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Robin Ward
b58867b6e9 FEATURE: New 'Reviewable' model to make reviewable items generic
Includes support for flags, reviewable users and queued posts, with REST API
backwards compatibility.

Co-Authored-By: romanrizzi <romanalejandro@gmail.com>
Co-Authored-By: jjaffeux <j.jaffeux@gmail.com>
2019-03-28 12:45:10 -04:00
Robin Ward
fa5a158683 REFACTOR: Move queue_jobs out of SiteSetting
It is not a setting, and only relevant in specs. The new API is:

```
Jobs.run_later!        # jobs will be thrown on the queue
Jobs.run_immediately!  # jobs will run right away, avoid the queue
```
2019-03-14 10:47:38 -04:00
David Taylor
0a4562253e DEV: Add 'starting' event to sidekiq log when interval logging enabled 2019-03-08 10:56:36 +00:00
David Taylor
e2510d79cc DEV: Improve thread-safety of sidekiq logging 2019-03-08 10:31:49 +00:00
David Taylor
9db05a895a DEV: Add job_id to sidkiq log
This makes it easier to correlate 'pending' logs from the same job
2019-03-08 09:16:13 +00:00
David Taylor
df474bceee DEV: Further sidekiq logging stability improvements
- Open the log file in "append" mode. This avoids issues if the file does not exist (and matches standard rails log behavior)
- Correctly parse the interval logging environment variable
2019-03-06 12:50:15 +00:00
David Taylor
fe62de68dd DEV: Correct sidekiq logging to avoid thread leak 2019-03-06 10:11:31 +00:00
David Taylor
8b30ed5b7a DEV: Serialize the job parameters in sidekiq logs
Otherwise this can lead to some very large data structures when processing the logs later
2019-03-05 17:44:49 +00:00
David Taylor
8963f1af30
FEATURE: Optional detailed performance logging for Sidekiq jobs (#7091)
By default, this does nothing. Two environment variables are available:

- `DISCOURSE_LOG_SIDEKIQ`

  Set to `"1"` to enable logging. This will log all completed jobs to `log/rails/sidekiq.log`, along with various db/redis/network statistics. This is useful to track down poorly performing jobs.

- `DISCOURSE_LOG_SIDEKIQ_INTERVAL`

  (seconds) Check running jobs periodically, and log their current duration. They will appear in the logs with `status:pending`. This is useful to track down jobs which take a long time, then crash sidekiq before completing.
2019-03-05 11:19:11 +00:00
Sam
e08a3f719c FEATURE: push post rebake regular task to low priority queue
This allows us to run regular rebakes without starving the normal queue.

It additionally adds the ability to specify queue with `Jobs.enqueue` so
we can specifically queue a job with lower priority using the `queue` arg.
2019-01-09 08:57:20 +11:00
Sam
e4498d2a8a FIX: keep db and job correctly in multisite logs
This ensures we report job and db correctly, previously we were
only reporting this on default
2018-09-04 16:05:44 +10:00
Sam
44cf3cf975 FIX: queue heartbeats in readonly modes
If sidekiq is paused or Discourse is in readonly continue to queue
heartbeats

If we do not do that then a master process can end up reaping sidekiq
workers and causing various badness

This also impacts restore which can do weird stuff TM in cases like this
2018-08-29 12:36:59 +10:00
Sam
1f17b84b63 FEATURE: more context for error reporting on jobs fails 2018-08-16 12:38:49 +10:00
Neil Lalonde
4ad7ce70ce REFACTOR: extract scheduler to the mini_scheduler gem 2018-07-31 17:12:55 -04:00
Sam
7c5a71e929 DEV: allow queue_jobs = false in dev
your mileage may vary
2017-10-31 13:50:58 +11:00