During search indexing we "stuff" the index with additional keywords for
entities that look like domain names.
This allows searches for `cnn` to find URLs for `www.cnn.com`
The search stuffing attempted to keep indexes aligned at the correct positions
by remapping the indexed terms. However under certain edge cases a single
word can stem into 2 different lexemes. If this happened we had an off by
one which caused the entire indexing to fail.
We work around this edge case (and carry incorrect index positions) for cases
like this. It is unlikely to impact search quality at all given index position
makes almost no difference in the search algorithm.
If a post contains domain with a word that stems to a non prefix single
words will not match it.
For example: in happy.com, `happy` stems to `happi`. Thus searches for happy
will not find URLs with it included.
This bloats the index a tiny bit, but impact is limited.
Will require a full reindex of search to take effect.
When we are done refining search we can consider a full version bump.
Previous regex did not allow for cases where a lexeme contains a : (colon)
This can happen when parsing URLs. New algorithm allows for this.
Test was amended to more clearly call out index problems
* FEATURE: allow restricting duplication in search index
This introduces the site setting `max_duplicate_search_index_terms`.
Using this number we limit the amount of duplication in our search index.
This allows us to more correctly weight title searches, so bloated posts
don't unfairly bump to the top of search results.
This feature is completely disabled by default and behind a site setting
We will experiment with it first. Note entire search index must be rebuilt
for it to take effect.
---------
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
This makes it easier to find PMs involving a particular user, for
example by searching for `in:messages thisUser` (previously, that query
would only return results in posts where `thisUser` was in the post body).
The search_ignore_accents site setting can be used to make the search
indexer remove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
It's very easy to forget to add `require 'rails_helper'` at the top of every core/plugin spec file, and omissions can cause some very confusing/sporadic errors.
By setting this flag in `.rspec`, we can remove the need for `require 'rails_helper'` entirely.
This commit fixes a bug where we our `HTMLScrubber` was only searching
for emoji img tags which contains only the "emoji" class. However, our emoji image tags
may contain more than just the "emoji" class like "only-emoji" when an
emoji exists by itself on a single line.
Over the years we accrued many spelling mistakes in the code base.
This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change
- comments
- test descriptions
- other low risk areas
When the admin creates a new custom field they can specify if that field should be searchable or not.
That setting is taken into consideration for quick search results.
* PERF: avoid lookbehinds when indexing search
Previously we used a `EmailCook.url_regexp` this regex used lookbehinds
Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.
EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.
New implementation is far more naive and relies on the extra spaces search
indexer inserts.
In the near future, we will be swtiching to PG headlines to generate the
search blurb. As such, we need to replace audio and video links in the
raw data used for headline generation. This also means that we avoid
replacing links each time we need to generate the blurb.
We do prefix matching in search so there is no need to inject the extra
terms.
Before:
```
"'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
After:
```
"'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.
The following seems to work as expected...
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
to_tsvector
------------------------------------------------------
'2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```
However, once a path is present
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
to_tsvector
----------------------------------------------------------------------------------------------
'/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```
The lexeme contains both the path and the query string.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
alias | lexemes
-------+---------------------
host | {www.discourse.org}
discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
to_tsvector
-----------------------
'www.discourse.org':1
```
Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like
```
tsvector
---------------------------------------
'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
Previous to this fix is a post had the test www.test.com/abc it would fail
to index.
This also simplifies the rules to avoid full url parsing which can be
expensive
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.
Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
This commit fixes the follow quality issue with `PostSearchData#raw_data`:
1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.
`Post#cooked`:
```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```
`PostSearchData#raw_data` Before:
```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```
`PostSearchData#raw_data` After:
```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```
2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.
From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.
`Post#cooked`
```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```
`PostSearchData#raw_data` Before:
```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```
`PostSearchData#raw_data` After:
```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```
In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.