discourse/spec/services
Guo Xiang Tan cfd507822f
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
..
anonymous_shadow_creator_spec.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
auto_silence_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
badge_granter_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
color_scheme_revisor_spec.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
destroy_task_spec.rb FIX: Allow rake destroy:topics to delete topics in sub-categories 2018-09-10 12:52:14 +01:00
flag_sockpuppets_spec.rb REFACTOR: Remove stubbed methods in tests 2019-02-04 15:06:00 -05:00
group_action_logger_spec.rb FEATURE: Add group settngs to allow users to leave a group freely. 2017-07-28 15:00:25 +09:00
group_mentions_updater_spec.rb REFACTOR: Move queue_jobs out of SiteSetting 2019-03-14 10:47:38 -04:00
group_message_spec.rb Rename "Blocked" to "Silenced" 2017-11-10 14:10:27 -05:00
i18n_interpolation_keys_finder_spec.rb FIX: Detect {{foo}} as interpolation key 2018-09-05 00:47:39 +02:00
notification_emailer_spec.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
post_action_notifier_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
post_alerter_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
post_owner_changer_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
push_notification_pusher_spec.rb FIX: Properly support defaults for upload site settings. 2019-03-13 16:36:57 +08:00
random_topic_selector_spec.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
search_indexer_spec.rb PERF: Improve quality of PostSearchData#raw_data. (#7275) 2019-04-01 10:14:29 +08:00
site_settings_spec.rb DEV: Improve tests to be more specific. 2018-11-13 15:02:46 +08:00
staff_action_logger_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
topic_status_updater_spec.rb Enable Lint/ShadowingOuterLocalVariable for Rubocop. 2018-09-04 10:16:42 +08:00
topic_timestamp_changer_spec.rb FEATURE: remove the timecop gem 2017-07-24 12:11:10 -04:00
trust_level_granter_spec.rb FIX: grant trust level when bulk adding users to group 2017-03-06 14:39:53 +05:30
user_activator_spec.rb FIX: Don't allow invalid email to be saved. 2016-12-21 17:47:11 +08:00
user_anonymizer_spec.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
user_authenticator_spec.rb FIX: apply automatic group rules when using social login providers 2018-05-23 02:26:07 +03:00
user_destroyer_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
user_merger_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
user_silencer_spec.rb FIX: When disagreeing with a flag that silenced a user, unsilence them 2019-02-08 08:50:50 -05:00
user_updater_spec.rb REFACTOR: Move redundant ignored user check into guardian (#7219) 2019-03-20 19:55:46 +00:00
username_changer_spec.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
username_checker_service_spec.rb FIX: Check for group name availability should skip reserved usernames. 2018-08-01 11:09:33 +08:00
wildcard_domain_checker_spec.rb FEATURE: allow multiple secrets for Discourse SSO provider 2018-10-15 16:03:53 +11:00
word_watcher_spec.rb fix the build ❤️ 2019-02-18 10:00:17 +05:30