discourse/spec
Guo Xiang Tan cfd507822f
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
..
components DEV: Remove warning. 2019-04-01 10:11:08 +08:00
fabricators FIX: Admin search logs should filter by date instead of timestamp. 2019-03-29 11:50:25 +08:00
fixtures FIX: upload watched words should use UTF-8 2019-03-21 13:46:16 -04:00
helpers Fix the build. 2019-03-13 17:39:07 +08:00
import_export FIX: topic and category exporters were only exporting users who created the first post 2018-01-16 12:51:53 -05:00
integration FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
integrity FIX: Relative links in translations should work with subfolder 2018-11-08 23:31:05 +00:00
jobs FIX: Don't index posts with empty Post#raw for search. (#7263) 2019-04-01 10:06:27 +08:00
lib FEATURE: unconditionally update Topic updated_at when posts change in topic 2019-03-28 17:28:01 +11:00
mailers FIX: Keep original subject in emails to staged users 2019-01-18 11:07:54 +01:00
models FIX: Allow users with posts to be rejected 2019-03-29 13:53:46 -04:00
multisite DEV: Allow custom value when pausing sidekiq to aid in debugging. 2019-02-19 10:55:53 +08:00
requests FEATURE: Let users delete their own topics. (#7267) 2019-03-29 17:10:05 +01:00
serializers FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
services PERF: Improve quality of PostSearchData#raw_data. (#7275) 2019-04-01 10:14:29 +08:00
support FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
tasks suppress print output when running specs 2017-10-31 16:06:11 +05:30
views/omniauth_callbacks FEATURE: Use translated name for 'your email has been authenticated by' (#6649) 2018-11-22 19:12:04 +00:00
rails_helper.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00