discourse/spec
Régis Hanol 501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00
..
components FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
fabricators DEV: Use dynamic/static fabricator attrs correctly (#9519) 2020-04-22 20:49:53 +02:00
fixtures FEATURE: Rake task to export groups (#9450) 2020-04-17 14:59:54 -07:00
helpers DEV: Fix some more flaky tests (#9384) 2020-04-08 12:46:43 +02:00
import_export FEATURE: Rake task to export groups (#9450) 2020-04-17 14:59:54 -07:00
initializers FIX: We need to skip users with associated reviewables when auto-approving (#9080) 2020-03-02 14:33:52 -05:00
integration DEV: enables and fixes multisite-spec (#9557) 2020-04-27 20:55:36 +02:00
integrity DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
jobs FEATURE: Add user_profile to user_archive CSV export (#9571) 2020-04-29 12:09:50 +03:00
lib DEV: Use a tmp directory for storing uploads in tests (#9554) 2020-04-28 14:03:04 +01:00
mailers DEV: Add rubocop-rspec (#9288) 2020-03-27 17:35:40 +01:00
models FIX: error customizing text for badges from plugins 2020-04-28 14:34:41 -04:00
multisite FIX: Change secure media to encompass attachments as well (#9271) 2020-03-26 07:16:02 +10:00
requests FEATURE: Allow user creation with admin api when local logins disabled (#9587) 2020-04-30 11:39:24 +10:00
serializers FIX: Rename all instances of bookmarkWithReminder to just bookmark (#9579) 2020-04-30 10:09:22 +10:00
services FEATURE: support SSO website and location overrides 2020-04-28 16:06:35 +10:00
support DEV: stop freezing frozen strings 2020-04-30 16:48:53 +10:00
tasks FEATURE: Promote bookmarks with reminders to core functionality (#9369) 2020-04-22 13:44:19 +10:00
views/omniauth_callbacks FEATURE: Use full page redirection for all external auth methods (#8092) 2019-10-08 12:10:43 +01:00
rails_helper.rb DEV: upgrade Rails 2020-04-20 12:55:53 +01:00
swagger_helper.rb DEV: Add rswag to aid in api documention (#9546) 2020-04-27 16:40:07 -06:00