- Reduce duplication of terms in post index from unlimited to 6. This will
result in reduced index size and reduced weighting for posts containing
a huge amount of duplicate terms. (Eg: a post containing "sam sam sam sam
sam sam sam sam", will index as "sam sam sam sam sam sam", only including
the word up to 6 times.) This corrects a flaw where title weighting could
be ignored.
- Prioritize exact matches of words in titles. Our search always performs
a prefix match. However we want to give special weight to exact title matches
meaning that a search for "sum" will find topics such as "the sum of us" vs
"summer in spring".
- Pick up fixes to our search algorithm which are missing from old indexes.
Specifically pick up the fix that indexes URLs properly. (`https://happy.com`
was stemmed to `happi` in keywords and then was not searchable)
see also:
https://meta.discourse.org/t/refinements-to-search-being-tested-on-meta/254158
Indexing will take a while and work in batches, in the background.
It's very easy to forget to add `require 'rails_helper'` at the top of every core/plugin spec file, and omissions can cause some very confusing/sporadic errors.
By setting this flag in `.rspec`, we can remove the need for `require 'rails_helper'` entirely.
Lots of changes but it's mostly a refactoring.
The interesting part that was fix are the 'load_problem_<model>_ids' methods.
They will now return records with no search data associated so they can be properly indexed for the search.
This "bad" state usually happens after a migration.
This does two things
1. Our "index grace period" has been wound down to 1 day, there is no point
keeping a bloated index for a week, usually when people delete stuff they
mean for it to be removed
2. We were never dropping deleted posts from the index, only posts from
deleted topics
These changes speed up search a tiny bit and reduce background work.
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.
Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
If the post ids keep loading, we might end up in a situations where
we're always loading the same post ids over and over again without
indexing anything new.
Follow up to daeda80ada.