discourse

mirror of https://github.com/discourse/discourse.git synced 2024-11-26 12:43:39 +08:00

Author	SHA1	Message	Date
Sam	3c678df942	PERF: avoid lookbehinds when indexing search (#10862 ) * PERF: avoid lookbehinds when indexing search Previously we used a `EmailCook.url_regexp` this regex used lookbehinds Unfortunately certain strings could lead to pathological behavior causing CPU to skyrocket and regex replace to take a very very long time. EmailCook still needs a fix, but it is less urgent cause it already splits to single lines. That said we will correct that as well in a seperate PR. New implementation is far more naive and relies on the extra spaces search indexer inserts.	2020-10-08 11:40:13 +11:00
Guo Xiang Tan	92b7fe4c62	PERF: Add partial index for non-pm search.	2020-08-18 15:55:08 +08:00
Guo Xiang Tan	255b0e9f14	PERF: Replace video and audio links in search blurb while indexing. In the near future, we will be swtiching to PG headlines to generate the search blurb. As such, we need to replace audio and video links in the raw data used for headline generation. This also means that we avoid replacing links each time we need to generate the blurb.	2020-08-06 12:25:03 +08:00
Guo Xiang Tan	15e9057ec5	FIX: Reduce number of terms injected for host lexeme. We do prefix matching in search so there is no need to inject the extra terms. Before: ``` "'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ``` After: ``` "'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ```	2020-07-27 15:29:59 +08:00
Guo Xiang Tan	0f53ad58c2	FIX: Improve regexp for matching version lexeme. Follow up to `b70f1084f7`	2020-07-27 15:18:27 +08:00
Guo Xiang Tan	b70f1084f7	FIX: Don't inject extra terms for version lexeme.	2020-07-27 14:46:44 +08:00
Guo Xiang Tan	181c4eb760	PERF: Avoid parsing `Post#cooked` with Nokogiri for every search.	2020-07-24 10:43:09 +08:00
Guo Xiang Tan	609ba50fe8	DEV: Add more granularity to `SearchIndexer` versions. Sometimes, we just want to reindex a specific model and not all the things.	2020-07-23 14:24:06 +08:00
Guo Xiang Tan	2196d0b9ae	FIX: Strip query from URLs when indexing for search. Indexing query strings in URLS produces inconsistent results in PG and pollutes the search data for really little gain. The following seems to work as expected... ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3'); to_tsvector ------------------------------------------------------ '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1 ``` However, once a path is present ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3'); to_tsvector ---------------------------------------------------------------------------------------------- '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1 ``` The lexeme contains both the path and the query string.	2020-07-14 15:32:40 +08:00
Guo Xiang Tan	5c230266d3	FIX: Inject extra lexemes for host lexeme. ``` discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org'); alias \| lexemes -------+--------------------- host \| {www.discourse.org} discourse_development=# SELECT TO_TSVECTOR('www.discourse.org'); to_tsvector ----------------------- 'www.discourse.org':1 ``` Given the above lexeme, we will inject additional lexeme by splitting the host on `.`. The actual tsvector stored will look something like ``` tsvector --------------------------------------- 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1 ```	2020-07-14 15:32:40 +08:00
Sam Saffron	6428aa5b1f	FIX: search indexer had various cases where it could fail Previous to this fix is a post had the test www.test.com/abc it would fail to index. This also simplifies the rules to avoid full url parsing which can be expensive	2019-06-04 16:21:03 +10:00
Gerhard Schlager	b788948985	FEATURE: English locale with international date formats Makes en_US the new default locale	2019-05-20 13:47:20 +02:00
Sam Saffron	4ea21fa2d0	DEV: use #frozen_string_literal: true on all spec This change both speeds up specs (less strings to allocate) and helps catch cases where methods in Discourse are mutating inputs. Overall we will be migrating everything to use #frozen_string_literal: true it will take a while, but this is the first and safest move in this direction	2019-04-30 10:27:42 +10:00
Gerhard Schlager	876c4f20b3	FIX: Remove duplicate Emoji names from blurb The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.	2019-04-29 17:26:39 +02:00
Gerhard Schlager	71d19f6e1f	FIX: Reduce mentions in blurbs to @username or @groupname The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names	2019-04-29 17:26:39 +02:00
Sam Saffron	45285f1477	DEV: remove update_attributes which is deprecated in Rails 6 See: https://github.com/rails/rails/pull/31998 update_attributes is a relic of the past, it should no longer be used.	2019-04-29 17:32:25 +10:00
Guo Xiang Tan	d8704c11ca	PERF: Better use of index when queueing a topci for search reindex. Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the version is actually being used.	2019-04-02 09:53:37 +08:00
Guo Xiang Tan	2a69ab4a4c	FIX: Keep `alt` and `title` in lightbox when indexing for search. Follow up to `cfd507822f`	2019-04-01 16:20:33 +08:00
Guo Xiang Tan	16215f9d3b	DEV: Correct spec added in `cfd507822f`. Remove stub.	2019-04-01 10:32:25 +08:00
Guo Xiang Tan	cfd507822f	PERF: Improve quality of `PostSearchData#raw_data`. (#7275 ) This commit fixes the follow quality issue with `PostSearchData#raw_data`: 1. URLs are being tokenized and links with similar href and characters are being duplicated in the raw data. `Post#cooked`: ``` <p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org ``` 2. Ligthbox being included in search pollutes the `PostSearchData#raw_data` unncessarily. From 28 March 2018 to 28 March 2019, searches for the term `image` on `meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for uploads that were pasted. `Post#cooked` ``` <p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000 ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image ``` In terms of indexing performance, we now have to parse the given HTML through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms to scrub plus the indexing takes place in a background job.	2019-04-01 10:14:29 +08:00
Guo Xiang Tan	daeda80ada	FIX: Don't index posts with empty `Post#raw` for search. (#7263 ) * DEV: Remove unnecessary join in `Jobs::ReindexSearch`. * FIX: Don't index posts with empty `Post#raw` for search.	2019-04-01 10:06:27 +08:00
Penar Musaraj	51e08feb7e	DEV: Refactor icons used in lightbox HTML Uses <svg> elements instead of hacky CSS pseudoelements Adds a migration to mark posts with lightboxes as needing a rebake	2019-03-22 11:52:06 -04:00
Guo Xiang Tan	d808f36fc4	FIX: Reindex post for search when post is moved to a different topic. * This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.	2019-03-19 17:19:14 +08:00
Régis Hanol	4481836de2	FEATURE: new 'search_ignore_accents' site setting	2018-09-17 10:42:30 +02:00
Régis Hanol	30619c244c	FIX: don't index urls to local files	2018-09-13 18:53:53 +02:00
Sam	9b7cab589a	FIX: revert diacritic stripping See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam	2018-08-31 11:46:55 +10:00
Régis Hanol	bc7b530b0a	FIX: remove diacritics instead of transliterating	2018-08-24 00:38:44 +02:00
Régis Hanol	2fcf2b899e	FIX: remove diacritics when tokenizing html for search	2018-08-23 17:13:52 +02:00
Bianca Nenciu	975a72ab7a	FEATURE: Make links indexable. (#6285 )	2018-08-20 10:39:19 +10:00
Sam	6676bbd38b	FEATURE: index YouTube titles in search Previously we omitted the titles for videos that YouTube provided	2018-04-26 15:46:52 +10:00
Sam	86d12bd44b	FEATURE: search within title using in:title Also - Significantly improved search ranking, title is treated most strongly - Adds tag names to the index - Run search re-indexer more aggressively - Re-index topic and all posts on category change	2018-02-20 14:41:21 +11:00
Erick Guan	6e59149a77	FIX: rebuild index when engine replaced (#5021 )	2017-08-16 07:38:34 -04:00
Sam	0a78ae739d	Remove SearchObserver, aim is to remove all observers rails-observers gem is mostly unmaintained and is a pain to carry forward new implementation contains significantly less magic as a bonus	2016-12-22 13:13:14 +11:00

33 Commits