Commit Graph

34 Commits

Author SHA1 Message Date
Bianca Nenciu
c10df4b58d
FIX: Make HTML scrubber work with deep HTML (#12619)
SearchIndexer and ReindexSearch used to explode for posts with very
deep or invalid HTML content.
2021-04-07 17:02:00 +10:00
Sam
3c678df942
PERF: avoid lookbehinds when indexing search (#10862)
* PERF: avoid lookbehinds when indexing search

Previously we used a `EmailCook.url_regexp` this regex used lookbehinds

Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.

EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.

New implementation is far more naive and relies on the extra spaces search
indexer inserts.
2020-10-08 11:40:13 +11:00
Guo Xiang Tan
92b7fe4c62
PERF: Add partial index for non-pm search. 2020-08-18 15:55:08 +08:00
Guo Xiang Tan
255b0e9f14
PERF: Replace video and audio links in search blurb while indexing.
In the near future, we will be swtiching to PG headlines to generate the
search blurb. As such, we need to replace audio and video links in the
raw data used for headline generation. This also means that we avoid
replacing links each time we need to generate the blurb.
2020-08-06 12:25:03 +08:00
Guo Xiang Tan
15e9057ec5
FIX: Reduce number of terms injected for host lexeme.
We do prefix matching in search so there is no need to inject the extra
terms.

Before:
```
"'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```

After:
```
"'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
2020-07-27 15:29:59 +08:00
Guo Xiang Tan
0f53ad58c2
FIX: Improve regexp for matching version lexeme.
Follow up to b70f1084f7
2020-07-27 15:18:27 +08:00
Guo Xiang Tan
b70f1084f7
FIX: Don't inject extra terms for version lexeme. 2020-07-27 14:46:44 +08:00
Guo Xiang Tan
181c4eb760 PERF: Avoid parsing Post#cooked with Nokogiri for every search. 2020-07-24 10:43:09 +08:00
Guo Xiang Tan
609ba50fe8
DEV: Add more granularity to SearchIndexer versions.
Sometimes, we just want to reindex a specific model and not all the
things.
2020-07-23 14:24:06 +08:00
Guo Xiang Tan
2196d0b9ae
FIX: Strip query from URLs when indexing for search.
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.

The following seems to work as expected...

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```

However, once a path is present

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```

The lexeme contains both the path and the query string.
2020-07-14 15:32:40 +08:00
Guo Xiang Tan
5c230266d3
FIX: Inject extra lexemes for host lexeme.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1
```

Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like

```
               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
2020-07-14 15:32:40 +08:00
Sam Saffron
6428aa5b1f FIX: search indexer had various cases where it could fail
Previous to this fix is a post had the test www.test.com/abc it would fail
to index.

This also simplifies the rules to avoid full url parsing which can be
expensive
2019-06-04 16:21:03 +10:00
Gerhard Schlager
b788948985 FEATURE: English locale with international date formats
Makes en_US the new default locale
2019-05-20 13:47:20 +02:00
Sam Saffron
4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
Gerhard Schlager
876c4f20b3 FIX: Remove duplicate Emoji names from blurb
The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.
2019-04-29 17:26:39 +02:00
Gerhard Schlager
71d19f6e1f FIX: Reduce mentions in blurbs to @username or @groupname
The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names
2019-04-29 17:26:39 +02:00
Sam Saffron
45285f1477 DEV: remove update_attributes which is deprecated in Rails 6
See: https://github.com/rails/rails/pull/31998

update_attributes is a relic of the past, it should no longer be used.
2019-04-29 17:32:25 +10:00
Guo Xiang Tan
d8704c11ca PERF: Better use of index when queueing a topci for search reindex.
Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the
version is actually being used.
2019-04-02 09:53:37 +08:00
Guo Xiang Tan
2a69ab4a4c FIX: Keep alt and title in lightbox when indexing for search.
Follow up to cfd507822f
2019-04-01 16:20:33 +08:00
Guo Xiang Tan
16215f9d3b DEV: Correct spec added in cfd507822f.
Remove stub.
2019-04-01 10:32:25 +08:00
Guo Xiang Tan
cfd507822f
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
Guo Xiang Tan
daeda80ada
FIX: Don't index posts with empty Post#raw for search. (#7263)
* DEV: Remove unnecessary join in `Jobs::ReindexSearch`.

* FIX: Don't index posts with empty `Post#raw` for search.
2019-04-01 10:06:27 +08:00
Penar Musaraj
51e08feb7e DEV: Refactor icons used in lightbox HTML
Uses <svg> elements instead of hacky CSS pseudoelements

Adds a migration to mark posts with lightboxes as needing a rebake
2019-03-22 11:52:06 -04:00
Guo Xiang Tan
d808f36fc4 FIX: Reindex post for search when post is moved to a different topic.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.
2019-03-19 17:19:14 +08:00
Régis Hanol
4481836de2 FEATURE: new 'search_ignore_accents' site setting 2018-09-17 10:42:30 +02:00
Régis Hanol
30619c244c FIX: don't index urls to local files 2018-09-13 18:53:53 +02:00
Sam
9b7cab589a FIX: revert diacritic stripping
See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam
2018-08-31 11:46:55 +10:00
Régis Hanol
bc7b530b0a FIX: remove diacritics instead of transliterating 2018-08-24 00:38:44 +02:00
Régis Hanol
2fcf2b899e FIX: remove diacritics when tokenizing html for search 2018-08-23 17:13:52 +02:00
Bianca Nenciu
975a72ab7a FEATURE: Make links indexable. (#6285) 2018-08-20 10:39:19 +10:00
Sam
6676bbd38b FEATURE: index YouTube titles in search
Previously we omitted the titles for videos that YouTube provided
2018-04-26 15:46:52 +10:00
Sam
86d12bd44b FEATURE: search within title using in:title
Also

- Significantly improved search ranking, title is treated most strongly
- Adds tag names to the index
- Run search re-indexer more aggressively
- Re-index topic and all posts on category change
2018-02-20 14:41:21 +11:00
Erick Guan
6e59149a77 FIX: rebuild index when engine replaced (#5021) 2017-08-16 07:38:34 -04:00
Sam
0a78ae739d Remove SearchObserver, aim is to remove all observers
rails-observers gem is mostly unmaintained and is a pain to carry forward
new implementation contains significantly less magic as a bonus
2016-12-22 13:13:14 +11:00