Commit Graph

613 Commits

Author SHA1 Message Date
Régis Hanol
501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00
Sam Saffron
4f5ed8e781
DEV: pry-nav was holding back on pry upgrades
pry-nav is not yet supported on latest pry, this holds off on
upgrading pry, which in turn holds off on upgrading deps

Stripping pry-nav for now till it works with latest pry
2020-04-30 09:40:50 +10:00
David Taylor
6a9a7b56df
DEV: Bump Hashie and Faraday (#9583)
These were previously pinned due to a dependency in the zendesk plugin. That has now been resolved.
2020-04-29 12:55:30 +01:00
Blake Erickson
a93ef2926d
DEV: Add rswag to aid in api documention (#9546)
Adding in rswag will allow us to write spec files to document and test
our api.
2020-04-27 16:40:07 -06:00
Jarek Radosz
07e0490fe4
DEV: Update mocha (#9490)
The spec that was blocking the update was fixed in c08753dc34.
2020-04-21 18:32:42 +02:00
Daniel Waterworth
7876ee2d67 DEV: upgrade Rails
Latest version of Rails contains compatibility fixes for Ruby 2.7 and some
minor security fixes we would like to have

It also broke some of the multisite tests.

Rails tries to use the same connection for reading from a replica as writing
to the leader during tests, because, with everything happening in a
transaction, changes to the DB wouldn't otherwise be reflected in the
replica connection.

The difference now is that Rails tries to do this for connections opened
after the test has started which affected rails multisite connections.

The upshot of this is that, as things stand, you are likely to
experience problems if you try to connect to a different multisite DB in
a test when the `current_db` is not 'default'.
2020-04-20 12:55:53 +01:00
Jarek Radosz
7ff889574d
DEV: Add rubocop-rspec (#9288)
This adds rubocop-rspec, and enables some cops that were either already passing or are passing now, after fixing them in this commit.

Some new cops are disabled for now, with annotation: "TODO" or "To be decided". Those either need to be discussed first, or require manual changes, or the number of found and fixed offenses is too large to bundle them up in a single PR.

Includes:

* DEV: Update rubocop's `TargetRubyVersion` to 2.6
* DEV: Enable RSpec/VoidExpect
* DEV: Enable RSpec/SharedContext
* DEV: Enable RSpec/EmptyExampleGroup (Removed an obsolete empty spec file)
* DEV: Enable RSpec/ItBehavesLike
* DEV: Remove RSpec/ScatteredLet (It's too strict, as it doesn't recognize fab! as a let-like)
* DEV: Remove RSpec/MultipleExpectations
2020-03-27 17:35:40 +01:00
Sam Saffron
c7151f0fd6
Revert "DEV: upgrade Rails"
This reverts commit 5b3bb4b2f0.

This erratically breaks multisite operation, we need more debugging
2020-03-24 17:11:13 +11:00
Sam Saffron
5b3bb4b2f0
DEV: upgrade Rails
Latest version of Rails contains compatibility fixes for Ruby 2.7 and some
minor security fixes we would like to have
2020-03-24 16:47:40 +11:00
Sam Saffron
9726a0e0b4
DEV: upgrade json gem and add explicit dependency
json is shipped out of sync with Ruby. Even though we use OJ for many things
we still use the json gem sometimes, this ensures we use the latest

b8b29e79ad/config/initializers/100-oj.rb (L9-L9)
2020-03-24 15:21:50 +11:00
David Taylor
e9a3639b10
DEV: Pin hashie and faraday versions for zendesk api compatibility (#9214) 2020-03-19 19:52:31 +00:00
dependabot-preview[bot]
1b2019e7eb
Build(deps): Bump rack-mini-profiler from 1.1.6 to 2.0.1 (#9222)
* Build(deps): Bump rack-mini-profiler from 1.1.6 to 2.0.1

Bumps [rack-mini-profiler](https://github.com/MiniProfiler/rack-mini-profiler) from 1.1.6 to 2.0.1.
- [Release notes](https://github.com/MiniProfiler/rack-mini-profiler/releases)
- [Changelog](https://github.com/MiniProfiler/rack-mini-profiler/blob/master/CHANGELOG.md)
- [Commits](https://github.com/MiniProfiler/rack-mini-profiler/compare/v1.1.6...v2.0.1)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

* Enable rails patches

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: OsamaSayegh <asooomaasoooma90@gmail.com>
2020-03-17 14:09:45 +03:00
OsamaSayegh
b23c2437ae DEV: Revert rack-mini-profiler version bump
New version breaks site deploys. Will investigate and fix.
2020-03-11 22:16:15 +03:00
OsamaSayegh
c928287e0c DEV: Mini Profiler shouldn't be loaded in test environment 2020-03-11 21:31:57 +03:00
dependabot-preview[bot]
a4929661af
Build(deps): Bump rack-mini-profiler from 1.1.6 to 2.0.0 (#9168)
* Build(deps): Bump rack-mini-profiler from 1.1.6 to 2.0.0

Bumps [rack-mini-profiler](https://github.com/MiniProfiler/rack-mini-profiler) from 1.1.6 to 2.0.0.
- [Release notes](https://github.com/MiniProfiler/rack-mini-profiler/releases)
- [Changelog](https://github.com/MiniProfiler/rack-mini-profiler/blob/master/CHANGELOG.md)
- [Commits](https://github.com/MiniProfiler/rack-mini-profiler/compare/v1.1.6...v2.0.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

* Enable rails patches

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: OsamaSayegh <asooomaasoooma90@gmail.com>
2020-03-11 20:11:12 +03:00
Robin Ward
a3f0543f99
Support for transpiling .js files (#9160)
* Remove some `.es6` from comments where it does not matter

* Use a post processor for transpilation

This will allow us to eventually use the directory structure to
transpile rather than the extension.

* FIX: Some errors and clean up in confirm-new-email

It would throw an error if the webauthn element wasn't present.
Also I changed things so that no-module is not explicitly
referenced.

* Remove `no-module`

Instead we allow a magic comment: `// discourse-skip-module` to prevent
the asset pipeline from creating a module.

* DEV: Enable babel transpilation based on directory

If it's in `app/assets/javascripts/dicourse` it will be transpiled
even without the `.es6` extension.

* REFACTOR: Remove Tilt/ES6ModuleTranspiler
2020-03-11 09:43:55 -04:00
Sam Saffron
59a7afbde9
DEV: flag MRI specific gems
byebug, ruby-prof, better_errors and rbtrace are very MRI specific, flag
them as such

This helps move forward on potential jruby and truffleruby experiments
2020-02-18 11:04:56 +11:00
David Taylor
5919618a87
DEV: Drop legacy OpenID 2.0 support (#8894)
This is not used in core or official plugins, and has been printing a deprecation notice since v2.3.0beta4. All OpenID 2.0 code and dependencies have been dropped. The user_open_ids table remains for now, in case anyone has missed the deprecation notice, and needs to migrate their data.

Context at https://meta.discourse.org/t/-/113249
2020-02-07 17:32:35 +00:00
Jarek Radosz
53529a3427
DEV: Upgrade Ember to version 3.12.2 (#8753)
* DEV: Use Ember 3.12.2
* Add Ember version to ThemeField's DEPENDENT_CONSTANTS
* DEV: Use `id` instead of `elementId` (See: https://github.com/emberjs/ember.js/issues/18147)
* FIX: Don't leak event listeners (bug introduced in 999e2ff)
2020-02-05 14:51:00 +01:00
Sam Saffron
eb105ba79d DEV: revert upgrade of rack to version 2.0.8
We can not upgrade rack cause it breaks Sidekiq web.

I can not find a trivial fix short of disabling sessions in Sidekiq which
is a security concern.

We need to figure out how to reuse sessions with our Rails application in
Sidekiq.

This gets extra complex cause we use a special cookie store for sessions.

9e399b42b9/lib/discourse_cookie_store.rb (L3-L21)
2020-01-13 18:07:16 +11:00
Martin Brennan
beb91e7eff
FIX: require: false for rotp gem (#8540)
The ROTP gem is only used in a very small amount of places in the app, we don't need to globally require it.

Also set the Addressable gem to not have a specific version range, as it has not been a problem yet.

Some slight refactoring of UserSecondFactor here too to use SecondFactorManager to avoid code repetition
2019-12-17 10:33:51 +10:00
Martin Brennan
edbc356593
FIX: Replace deprecated URI.encode, URI.escape, URI.unescape and URI.unencode (#8528)
The following methods have long been deprecated in ruby due to flaws in their implementation per http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/29293?29179-31097:

URI.escape
URI.unescape
URI.encode
URI.unencode
escape/encode are just aliases for one another. This PR uses the Addressable gem to replace these methods with its own encode, unencode, and encode_component methods where appropriate.

I have put all references to Addressable::URI here into the UrlHelper to keep them corralled in one place to make changes to this implementation easier.

Addressable is now also an explicit gem dependency.
2019-12-12 12:49:21 +10:00
Sam Saffron
b6acfb7847 DEV: upgrade redis-namespace gem
New release has a few extra commands namespaced, nothing we use.

Also added a comment about why this is explicitly required.
2019-12-12 13:36:08 +11:00
Mark VanLandingham
06c6062ed2
DEV: Lock sassc gem at version 2.0.1 with note (#8523) 2019-12-11 06:22:39 -08:00
Sam Saffron
4de39f6596 DEV: hold back mocha upgrade
This breaks our test suite and we want to properly document this.
2019-12-11 12:43:07 +11:00
Sam Saffron
3e0454c97b DEV: add a note about sprockets being held back
We want to upgrade to version 4, but it does not work atm.
2019-12-10 12:31:16 +11:00
Sam Saffron
a06fccae1b DEV: update dependencies and add notes about exceptions
Previously it was unclear why certain gems are being held back cause Gemfile
had no comment explaining it.

I tried to add some explanation from memory and remove some exceptions that
seemed to be superfluous.

This upgrades shoulda to latest, it appears to work once a couple of assertions
are removed

Also update http accept language used to auto detect language from http header
this is tested

Zeitwerk small update seems fine
2019-12-06 13:00:28 +11:00
Arpit Jalan
cab9c7c77e Bump onebox version.
- FIX: use dedicated Vimeo onebox for all video types
2019-11-27 16:22:25 +05:30
Arpit Jalan
7543db086a Bump onebox version.
- FIX: Amazon video oneboxes were not working.
2019-11-20 14:47:59 +05:30
David Taylor
eaf6096890 DEV: Use rubocop-discourse gem to add custom chdir cop
Followup to b27e009655
2019-11-18 15:39:41 +00:00
Sam Saffron
26c0199c01 DEV: update Rails to version 6.0.1
This version of Rails eliminates a monkey patch that is no longer needed!

Additionally it preps us for Ruby 2.7 support.
2019-11-08 16:56:30 +11:00
Arpit Jalan
c5df853dea Bump onebox version.
- fix for gfycat onebox in email
2019-11-07 10:03:12 +05:30
Arpit Jalan
cb9702bf7a Bump onebox version.
- Remove native caching
- FIX: dropbox videos were not loading
2019-11-04 10:46:20 +05:30
Arpit Jalan
12409f63a0 Bump onebox version.
- FIX: Follow redirect returns url if response code is 200
- FIX: do not resize xkcd image
2019-10-22 12:26:01 +05:30
Krzysztof Kotlarek
858cf5836c
FIX: update Redis gem to version 4.1.3
I run our benchmark on commit with hiredis and redis-4.1.3

Results:
type | hidredis | redis 4.1.3 | percent
--- | --- | --- | ---
Categories-50 | 49 | 50 | 102.04%
Categories-75 | 51 | 51 | 100.00%
Categories-90 | 63 | 64 | 101.59%
Categories-99 | 86 | 85 | 98.84%
Home-50 | 55 | 55 | 100.00%
Home-75 | 56 | 57 | 101.79%
Home-90 | 68 | 69 | 101.47%
Home-99 | 102 | 104 | 101.96%
Topic-50 | 36 | 37 | 102.78%
Topic-75 | 37 | 37 | 100.00%
Topic-90 | 47 | 48 | 102.13%
Topic-99 | 60 | 61 | 101.67%
Categories-admin-50 | 124 | 117 | 94.35%
Categories-admin-75 | 130 | 129 | 99.23%
Categories-admin-90 | 147 | 143 | 97.28%
Categories-admin-99 | 204 | 199 | 97.55%
Home-admin-50 | 146 | 148 | 101.37%
Home-admin-75 | 150 | 152 | 101.33%
Home-admin-90 | 169 | 168 | 99.41%
Home-admin-99 | 232 | 223 | 96.12%
Topic-admin-50 | 60 | 61 | 101.67%
Topic-admin-75 | 64 | 63 | 98.44%
Topic-admin-90 | 76 | 73 | 96.05%
Topic-admin-99 | 124 | 94 | 75.81%
Load rails | 2412 | 2360 | 97.84%
rss | 290204 | 295828 | 101.94%
pss | 277948 | 283624 | 102.04%

Redis gem is manipulating Redis config https://github.com/redis/redis-rb/blob/master/lib/redis/client.rb#L95
therefore we cannot pass the frozen config object.

Pass of the copy of the object is protecting original config
2019-10-21 09:59:24 +11:00
Sam Saffron
ae2a56999e Revert "FIX: update Redis gem to version 4.1.3 (#8197)"
This reverts commit ab74a50d85.

We really want to upgrade redis, but discovered some edge cases
around failover we need to test.

Holding off on the upgrade till a bit more testing happens
2019-10-17 11:41:46 +11:00
Krzysztof Kotlarek
ab74a50d85 FIX: update Redis gem to version 4.1.3 (#8197)
* FIX: update Redis gem to version 4.1.3

I run our benchmark on commit with hiredis and redis-4.1.3

Results:
type | hidredis | redis 4.1.3 | percent
--- | --- | --- | ---
Categories-50 | 49 | 50 | 102.04%
Categories-75 | 51 | 51 | 100.00%
Categories-90 | 63 | 64 | 101.59%
Categories-99 | 86 | 85 | 98.84%
Home-50 | 55 | 55 | 100.00%
Home-75 | 56 | 57 | 101.79%
Home-90 | 68 | 69 | 101.47%
Home-99 | 102 | 104 | 101.96%
Topic-50 | 36 | 37 | 102.78%
Topic-75 | 37 | 37 | 100.00%
Topic-90 | 47 | 48 | 102.13%
Topic-99 | 60 | 61 | 101.67%
Categories-admin-50 | 124 | 117 | 94.35%
Categories-admin-75 | 130 | 129 | 99.23%
Categories-admin-90 | 147 | 143 | 97.28%
Categories-admin-99 | 204 | 199 | 97.55%
Home-admin-50 | 146 | 148 | 101.37%
Home-admin-75 | 150 | 152 | 101.33%
Home-admin-90 | 169 | 168 | 99.41%
Home-admin-99 | 232 | 223 | 96.12%
Topic-admin-50 | 60 | 61 | 101.67%
Topic-admin-75 | 64 | 63 | 98.44%
Topic-admin-90 | 76 | 73 | 96.05%
Topic-admin-99 | 124 | 94 | 75.81%
Load rails | 2412 | 2360 | 97.84%
rss | 290204 | 295828 | 101.94%
pss | 277948 | 283624 | 102.04%

* FIX: get rid of redis freedom patch
2019-10-17 08:49:23 +11:00
David Taylor
061c8874f5 FIX: Correct line count link in GitHub commit onebox
Bump onebox version
2019-10-15 23:52:59 +01:00
Sam Saffron
c3cc96084c FIX: remove hiredis gem which is no longer needed
Previously some local micro-benchmarks revealed it was not giving any perf
benefits.

Now that we upgraded to 2.6.5 we are seeing some segfaults.

No need to carry this dependency around anymore.

We can re-evaluate in future if it improves perf and fix the segfaults.
2019-10-15 18:17:14 +11:00
romanrizzi
9845963105 FEATURE: Use the 'ugc' rel attribute alongside 'nofollow' 2019-10-14 15:21:48 -03:00
David Taylor
939a746dcd UX: Use theme colors for GitHub issue labels
Bump onebox version to pull tag rendering bug fix
2019-10-09 12:28:48 +01:00
David Taylor
3edd514c72 FEATURE: Redesigned GitHub oneboxes
Bump onebox version, and add new styling

Commit, PR and Issue oneboxes are updated with a new design. Timestamps are now localized using local-dates (if installed).
2019-10-09 11:47:58 +01:00
David Taylor
e7cc7def8b UX: Stop using fixed-width font to render github issue description
Bump onebox version
2019-10-08 11:48:05 +01:00
David Taylor
615039f228 FEATURE: Improve GitHub commit, PR and issue onebox rendering
Bump onebox version to include new github rendering, and add relevant CSS

Avatars are reduced in size significantly, and icons are added to easily differentiate PRs and commits. The 'Issue:' prefix is removed from issue oneboxes, to make them consistent with commits and PRs.
2019-10-07 19:26:10 +01:00
Sam Saffron
8d5f47dded PREF: optimise preloading application
We preload to ensure as much memory as possible is reused from unicorn master
to various workers using copy-on-write (sidekiq, unicorn)

This migrates the preloading code into the Discourse module for easier
reuse and adds 3 notable preloading changes

1. We attempt to localize a string on each site, ensuring we warmup
the i18n

2. We preload all our templates (compiling .erb to class)

3. We warm-up our search tokenizer which uses cppjieba which is a large
memory consumer, this will only cause a warmup on CJK sites or sites with
the special site setting enabled.
2019-10-07 00:33:37 -04:00
Martin Brennan
68d35b14f4 FEATURE: Webauthn authenticator management with 2FA login (Security Keys) (#8099)
Adds 2 factor authentication method via second factor security keys over [web authn](https://developer.mozilla.org/en-US/docs/Web/API/Web_Authentication_API).

Allows a user to authenticate a second factor on login, login-via-email, admin-login, and change password routes. Adds registration area within existing user second factor preferences to register multiple security keys. Supports both external (yubikey) and built-in (macOS/android fingerprint readers).
2019-10-01 19:08:41 -07:00
Krzysztof Kotlarek
32b8a2ccff DEV: Upgrade Discourse to Rails 6 (#8083)
* Adjustments to pass specs on Rails 6.0.0
* Use classic autoloader instead of Zeitwerk
* Update Rails 6.0.0 deprecated methods
* Rails 6.0.0 not allowing column with integer name
* Drop freedom_patches/rails6.rb
* Default value for trigger_transactional_callbacks? is true
* Bump rspec-rails version to 4.0.0.beta2
2019-09-12 10:41:50 +10:00
Arpit Jalan
4195548a17 Bump onebox version.
- indicate and link to Flickr Album
2019-09-11 23:23:11 +05:30
Sam Saffron
ed00f35306 FEATURE: improve performance of anonymous cache
This commit introduces 2 features:

1. DISCOURSE_COMPRESS_ANON_CACHE (true|false, default false): this allows
you to optionally compress the anon cache body entries in Redis, can be
useful for high load sites with Redis that lives on a separate server to
to webs

2. DISCOURSE_ANON_CACHE_STORE_THRESHOLD (default 2), only pop entries into
redis if we observe them more than N times. This avoids situations where
a crawler can walk a big pile of topics and store them all in Redis never
to be used. Our default anon cache time for topics is only 60 seconds. Anon
cache is in place to avoid the "slashdot" effect where a single topic is
hit by 100s of people in one minute.
2019-09-04 17:18:32 +10:00
Arpit Jalan
e9c971ba77 Bump onebox version.
- allow oneboxing for `www.amazon.com.mx`
2019-08-26 16:44:10 +05:30