Commit Graph

304 Commits

Author SHA1 Message Date
Bianca Nenciu
6eb3d658ca
FIX: Do not wrap unaccent around tsqueries (#16284)
tsqueries use quotes and having other characters that when unaccented
become quotes results in invalid tsqueries.
2022-03-25 19:10:05 +02:00
Bianca Nenciu
34b4b53bac
FEATURE: Use Postgres unaccent to ignore accents (#16100)
The search_ignore_accents site setting can be used to make the search
indexer remove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
2022-03-07 23:03:10 +02:00
Alan Guo Xiang Tan
930f51e175 FEATURE: Split up text segmentation for Chinese and Japanese.
* Chinese segmenetation will continue to rely on cppjieba
* Japanese segmentation will use our port of TinySegmenter
* Korean currently does not rely on segmentation which was dropped in c677877e4f
* SiteSetting.search_tokenize_chinese_japanese_korean has been split
into SiteSetting.search_tokenize_chinese and
SiteSetting.search_tokenize_japanese respectively
2022-02-07 09:21:14 +08:00
Alan Guo Xiang Tan
fff8b98485 SECURITY: Advanced group search did not respect visiblity of groups. 2022-01-10 13:49:26 +08:00
Penar Musaraj
d99deaf1ab
FEATURE: show recent searches in quick search panel (#15024) 2021-11-25 15:44:15 -05:00
Penar Musaraj
20f5474be9
FEATURE: Log only topic/post search queries in search log (#14994) 2021-11-18 09:21:12 +08:00
Alan Guo Xiang Tan
a03c48b720
FIX: Use the same mode for chinese search when indexing and querying. (#14780)
The `白名单` term becomes `名单 白名单` after it is processed by
cppjieba in :query mode. However, `白名单` is not tokenized as such by cppjieba when it
appears in a string of text. Therefore, this may lead to failed matches as
the search data generated while indexing may not contain all of the
terms generated by :query mode. We've decided to maintain parity for now
such that both indexing and querying uses the same :mix mode. This may
lead to less accurate search but our plan is to properly support CJK
search in the future.
2021-11-01 10:14:47 +08:00
Dan Ungureanu
f003e31e2f
PERF: Optimize search in private messages query (#14660)
* PERF: Remove JOIN on categories for PM search

JOIN on categories is not needed when searchin in private messages as
PMs are not categorized.

* DEV: Use == for string comparison

* PERF: Optimize query for allowed topic groups

There was a query that checked for all topics a user or their groups
were allowed to see. This used UNION between topic_allowed_users and
topic_allowed_groups which was very inefficient. That was replaced with
a OR condition that checks in either tables more efficiently.
2021-10-26 10:16:38 +03:00
Alan Guo Xiang Tan
6544e3b02a
DEV: Remove useless ordering when searching within a topic. (#14676)
Searching within a topic currently does not make use of PG search and
we're simply doing an `ilike` against the post raw. Furthermore,
`Post#post_number` is already unique within a topic so the other
ordering will never ever be used. This change simply makes the query
cleaner to read.
2021-10-22 10:38:21 +08:00
Penar Musaraj
e9b1d29d8b
UX: Revamp quick search (#14499)
Co-authored-by: Robin Ward <robin.ward@gmail.com>
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
2021-10-06 11:42:52 -04:00
Jean
34ff7bfeeb
FEATURE: Hide suspended users from site-wide search to regular users (#14245) 2021-09-06 09:59:35 -04:00
Bianca Nenciu
fbf7627c8e
FIX: Make search work with sub-sub-categories (#13901)
Searching in a category looked only one level down, ignoring the site
setting max_category_nesting. The user interface did not support the
third level of categories and did not display them in the "Categorized"
input of the advanced search options.
2021-08-02 14:04:13 +03:00
Penar Musaraj
8a470e508e
UX: Improve quick search suggestions (#13813) 2021-07-21 14:00:27 -04:00
Josh Soref
59097b207f
DEV: Correct typos and spelling mistakes (#12812)
Over the years we accrued many spelling mistakes in the code base. 

This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change 

- comments
- test descriptions
- other low risk areas
2021-05-21 11:43:47 +10:00
Krzysztof Kotlarek
e29605b79f
FEATURE: the ability to search users by custom fields (#12762)
When the admin creates a new custom field they can specify if that field should be searchable or not.

That setting is taken into consideration for quick search results.
2021-04-27 15:52:45 +10:00
Sam
5b342ae505
FIX: remove superfluous spaces from CJK blurbs (#12629)
Previously we used the raw data indexed to generate blurbs even for cases
when Chinese/Korean/Japanese text was used.

This caused superfluous spaces to show up in excerpts.
2021-04-12 12:46:42 +10:00
Alan Guo Xiang Tan
ebe4896e48 FEATURE: Change very high/low search priority to rank at absolute ends.
Prior to this change, we had weights for very_high, high, low and
very_low. This means there were 4 weights to tweak and what weights to
use for `very_high/high` and `very_low/low` pair was hard to explain.
This change makes it such that `very_high` search priority will always
ensure that the posts are ranked at the top while `very_low` search
priority will ensure that the posts are ranked at the very bottom.
2021-03-09 09:20:37 +08:00
Alan Guo Xiang Tan
4b3f65bb26 FIX: Select earliest post when aggregating posts in a topic for search.
This is a revert of
d8c796bc44
and
5bf0a0893b.

Linking to the post within a topic that has the highest rank was
confusing users and hard to explain because ranking is determined via
the PG ranking function. See the following meta topics for the
complaints after we switch to the new ordering:

1. https://meta.discourse.org/t/title-search-not-working-as-expected/157737
2. https://meta.discourse.org/t/search-results-should-prioritize-first-post-in-topic-when-title-matches-search-term/175154
2021-02-05 09:52:53 +08:00
Guo Xiang Tan
d10d296e92 FIX: Search topic title headline being truncated.
Need to apply the `HighlightAll` option in order to avoid topic titles
from truncated in headlines when displaying search results.
2020-12-22 09:09:47 +08:00
Sam
293b243aeb
FEATURE: special shortcut for searching for own posts (#11541)
You can now use `@me` to search for posts created by yourself, this is particularly handy if you have a long username.

`@me rainbow` will find all posts you created with the word rainbow.

Also cleans up test suite so it has no warnings.
2020-12-22 10:46:42 +11:00
Arpit Jalan
29c7655221
FEATURE: allow plugins to preload custom data on search (#11518)
This commit allows discourse-assign plugin to show assigned users on
search result topic list.
2020-12-17 21:59:10 +05:30
Dan Ungureanu
9b7525bb03
FIX: Handle uncaught exception (#11263)
After the search term is parsed for advanced search filters, the term may
become empty. Later, the same term will be passed to Discourse.route_for
which will raise an ArgumentError.

> URI(nil)
ArgumentError: bad argument (expected URI object or URI string)
2020-11-20 11:28:14 +02:00
Roman Rizzi
d815b95935
FEATURE: Search filter for searching all PMs on a site for admin. (#11280)
Admins can search all PMS on a site by using the `in:all-pms` advanced filter.
2020-11-19 13:56:19 -03:00
Guo Xiang Tan
68fc2a18b1 FIX: Properly handle quotes and backslash in Search.set_tsquery_weight_filter 2020-10-23 08:43:34 +08:00
Guo Xiang Tan
2607bb602e Fix broken spec.
Follow-up to 3c678df942
2020-10-08 10:52:46 +08:00
Sam
3c678df942
PERF: avoid lookbehinds when indexing search (#10862)
* PERF: avoid lookbehinds when indexing search

Previously we used a `EmailCook.url_regexp` this regex used lookbehinds

Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.

EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.

New implementation is far more naive and relies on the extra spaces search
indexer inserts.
2020-10-08 11:40:13 +11:00
Arpit Jalan
f7940b1d20
FEATURE: advanced search option for max posts count (#10761)
This commit adds an option to search for max posts count and updates
the UI for posts count search to show a min/max range in single line.
2020-09-28 21:34:16 +05:30
Arpit Jalan
4498c59085 FEATURE: add alias for min_post_count search filter 2020-09-28 16:07:44 +05:30
Arpit Jalan
cdf45f4fe6 Update regex for views search filter. 2020-09-24 17:05:55 +05:30
Arpit Jalan
0c5cd0d1ef FEATURE: advanced search filters for view count 2020-09-24 15:22:18 +05:30
Bianca Nenciu
4abbe3d361
FEATURE: Make search filters case insensitive (#10715) 2020-09-23 11:59:42 +03:00
Krzysztof Kotlarek
cb58cbbc2c
FEATURE: allow to extend topic_eager_loads in Search (#10625)
This additional interface is required by encrypt plugin
2020-09-14 11:58:28 +10:00
Guo Xiang Tan
e6ca1b4326
FIX: Admin search for PMs should only search own PMs.
In c6ceda8c, a bug was introduced where an admin searching for his own
private messages will actually end up searching through all private
messages on the site.

Follow-up to c6ceda8c4e
2020-09-10 11:37:18 +08:00
Dan Ungureanu
38c9c87128
FIX: Add to tags result set only visible tags (#10580) 2020-09-02 13:24:40 +03:00
Guo Xiang Tan
40c6d90df3 PERF: Create a partial regular post_search_data index on large sites.
With the addition of `PostSearchData#private_message`, a partial
index consisting of only search data from regular posts can be created.
The partial index helps to speed up searches on large sites since PG
will not have to do an index scan on the entire search data index which
has shown to be a bottle neck.
2020-08-27 13:42:00 +08:00
siriwatknp
1a2800ad07 fix: 🐛 category & tag search regex to support thai character 2020-08-25 16:12:26 +08:00
Guo Xiang Tan
05174df5c0
FIX: Restrict personal_messages: advanced search filter to admin.
The filter noops if an incorrect username is passed. This filter is not
exposed as part of the UI but is only used when an admin transitions
from a search within a user's personal messages to the full page search.

Follow-up to 4b30799054.
2020-08-24 13:53:48 +08:00
Guo Xiang Tan
c6ceda8c4e
PERF: Avoid extra subquery when searching within PMs for normal user.
Note the following query being generated where the filter for a user's
private messages is executed twice.

```sql
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (TS_RANK_CD(
  post_search_data.search_data,
  TO_TSQUERY('english', '''test'':*ABCD'),
  0|32
)
 * (
  CASE categories.search_priority
  WHEN 2
  THEN 0.6
  WHEN 3
  THEN 0.8
  WHEN 4
  THEN 1.2
  WHEN 5
  THEN 1.4
  ELSE
    CASE WHEN topics.closed
    THEN 0.9
    ELSE 1
    END
  END
)
) rank, topics.bumped_at topic_bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3) AND (topics.visible) AND (topics.archetype = 'private_message' AND post_search_data.private_message) AND (posts.topic_id IN (SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 99999
UNION ALL
SELECT tg.topic_id
FROM topic_allowed_groups tg
JOIN group_users gu ON gu.user_id = 99999 AND gu.group_id = tg.group_id
)) AND (post_search_data.search_data @@ TO_TSQUERY('english', '''test'':*ABCD')) AND (posts.topic_id IN (SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 99999
UNION ALL
SELECT tg.topic_id
FROM topic_allowed_groups tg
JOIN group_users gu ON gu.user_id = 99999 AND gu.group_id = tg.group_id
)) AND ((categories.id IS NULL) OR (NOT categories.read_restricted) OR (categories.id IN (999999))) ORDER BY rank DESC, topic_bumped_at DESC
```
2020-08-24 13:49:43 +08:00
Guo Xiang Tan
2f043dc89a
Fix lint. 2020-08-24 12:38:46 +08:00
Guo Xiang Tan
4b30799054
FIX: Correct personal_messages:<username> advanced search filter.
Renamed from `private_messages` to `personal_messages` without
deprecation because the `private_messages` advanced search filter never
worked in the first place when it was implemented.
2020-08-24 11:54:30 +08:00
Guo Xiang Tan
106a2f58a2
DEV: Drop support for deprecated in:private search filter. 2020-08-21 17:18:39 +08:00
Guo Xiang Tan
0684118008
DEV: Remove array_agg from search orders that does not need it. 2020-08-21 14:39:07 +08:00
Guo Xiang Tan
92b7fe4c62
PERF: Add partial index for non-pm search. 2020-08-18 15:55:08 +08:00
Guo Xiang Tan
248bebb8cd
PERF: Remove extra subquery in search.
I also noticed that removing the subquery helps the planner to plan
better.
2020-08-17 13:52:12 +08:00
Guo Xiang Tan
93f8396b4b
FIX: Limit PG headline based search blurb generation to 200 characters.
* Recovers omission characters '...' in blurb as well.
2020-08-12 15:34:27 +08:00
Guo Xiang Tan
053cbe3112
PERF: Limit characters used to generate headline for search blurb.
We determined using the following benchmark script that limiting to 2500 chars would mean a maximum of
25ms spent generating headlines.

```
require 'benchmark/ips'

string = <<~STRING
Far far away, behind the word mountains...
STRING

def sql_excerpt(string, l = 1000000)
  DB.query_single(<<~SQL)
  SELECT TS_HEADLINE('english', left('#{string}', #{l}), PLAINTO_TSQUERY('mountains'))
  SQL
end

def ruby_excerpt(string)
  output = DB.query_single("SELECT '#{string}'")[0]
  Search::GroupedSearchResults::TextHelper.excerpt(output, 'mountains', radius: 100)
end

puts "Ruby Excerpt: #{ruby_excerpt(string)}"
puts "SQL Excerpt: #{sql_excerpt(string)}"
puts

Benchmark.ips do |x|
  x.time = 10

  [1000, 2500, 5000, 10000, 20000, 50000].each do |l|
    short_string = string[0..l]

    x.report("ts_headline excerpt #{l}") do
      sql_excerpt(short_string, l)
    end

    x.report("actionview excerpt #{l}") do
      ruby_excerpt(short_string)
    end
  end

  x.compare!
end
```

```
actionview excerpt 1000:    20570.7 i/s
actionview excerpt 2500:    17863.1 i/s - 1.15x  (± 0.00) slower
actionview excerpt 5000:    14228.9 i/s - 1.45x  (± 0.00) slower
actionview excerpt 10000:    10906.2 i/s - 1.89x  (± 0.00) slower
actionview excerpt 20000:     6255.0 i/s - 3.29x  (± 0.00) slower
ts_headline excerpt 1000:     4337.5 i/s - 4.74x  (± 0.00) slower
actionview excerpt 50000:     3222.7 i/s - 6.38x  (± 0.00) slower
ts_headline excerpt 2500:     2240.4 i/s - 9.18x  (± 0.00) slower
ts_headline excerpt 5000:     1258.7 i/s - 16.34x  (± 0.00) slower
ts_headline excerpt 10000:      667.2 i/s - 30.83x  (± 0.00) slower
ts_headline excerpt 20000:      348.7 i/s - 58.98x  (± 0.00) slower
ts_headline excerpt 50000:      131.9 i/s - 155.91x  (± 0.00) slower
```
2020-08-07 14:36:52 +08:00
Guo Xiang Tan
e60c74d3c1
FEATURE: Use PG ts_headline for highlighting topic title in search. 2020-08-07 12:43:09 +08:00
Krzysztof Kotlarek
12a00d6dc5
FEATURE: add advanced order to search (#10385)
Similar to `advanced_filter` I introduced `advanced_order`.

I needed a new option because default orders are evaluated after advanced_filter so I couldn't use it.

Also, that part is a little bit more generic
```
elsif word =~ /order:\w+/
  @order = word.gsub('order:', '').to_sym
nil
```

After those changes, I can use them in plugins in this way:
```
Search.advanced_order(:votes) do |posts|
  posts.reorder("COALESCE((SELECT dvvc.counter FROM discourse_voting_vote_counters dvvc WHERE dvvc.topic_id = subquery.topic_id), 0) DESC")
end
```
2020-08-07 12:47:00 +10:00
Guo Xiang Tan
ab2b6f8dea
FIX: Specify config when generating tsquery using ts_headline. 2020-08-07 10:21:14 +08:00
Guo Xiang Tan
2193d02433
PERF: Use PG headlines for blurb generation and highlighting for search. 2020-08-06 14:56:29 +08:00