discourse/Gemfile
Régis Hanol 501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00

258 lines
7.2 KiB
Ruby

# frozen_string_literal: true
source 'https://rubygems.org'
# if there is a super emergency and rubygems is playing up, try
#source 'http://production.cf.rubygems.org'
gem 'bootsnap', require: false, platform: :mri
def rails_master?
ENV["RAILS_MASTER"] == '1'
end
if rails_master?
gem 'arel', git: 'https://github.com/rails/arel.git'
gem 'rails', git: 'https://github.com/rails/rails.git'
else
# NOTE: Until rubygems gives us optional dependencies we are stuck with this needing to be explicit
# this allows us to include the bits of rails we use without pieces we do not.
#
# To issue a rails update bump the version number here
gem 'actionmailer', '6.0.2.2'
gem 'actionpack', '6.0.2.2'
gem 'actionview', '6.0.2.2'
gem 'activemodel', '6.0.2.2'
gem 'activerecord', '6.0.2.2'
gem 'activesupport', '6.0.2.2'
gem 'railties', '6.0.2.2'
gem 'sprockets-rails'
end
gem 'json'
# TODO: At the moment Discourse does not work with Sprockets 4, we would need to correct internals
# This is a desired upgrade we should get to.
gem 'sprockets', '3.7.2'
# this will eventually be added to rails,
# allows us to precompile all our templates in the unicorn master
gem 'actionview_precompiler', require: false
gem 'seed-fu'
gem 'mail', require: false
gem 'mini_mime'
gem 'mini_suffix'
gem 'redis'
# This is explicitly used by Sidekiq and is an optional dependency.
# We tell Sidekiq to use the namespace "sidekiq" which triggers this
# gem to be used. There is no explicit dependency in sidekiq cause
# redis namespace support is optional
# We already namespace stuff in DiscourseRedis, so we should consider
# just using a single implementation in core vs having 2 namespace implementations
gem 'redis-namespace'
# NOTE: AM serializer gets a lot slower with recent updates
# we used an old branch which is the fastest one out there
# are long term goal here is to fork this gem so we have a
# better maintained living fork
gem 'active_model_serializers', '~> 0.8.3'
gem 'onebox'
gem 'http_accept_language', require: false
# Ember related gems need to be pinned cause they control client side
# behavior, we will push these versions up when upgrading ember
gem 'ember-rails', '0.18.5'
gem 'discourse-ember-source', '~> 3.12.2'
gem 'ember-handlebars-template', '0.8.0'
gem 'barber'
gem 'message_bus'
gem 'rails_multisite'
gem 'fast_xs', platform: :mri
# may move to xorcist post: https://github.com/fny/xorcist/issues/4
gem 'fast_xor', platform: :mri
gem 'fastimage'
gem 'aws-sdk-s3', require: false
gem 'aws-sdk-sns', require: false
gem 'excon', require: false
gem 'unf', require: false
gem 'email_reply_trimmer'
# Forked until https://github.com/toy/image_optim/pull/162 is merged
# https://github.com/discourse/image_optim
gem 'discourse_image_optim', require: 'image_optim'
gem 'multi_json'
gem 'mustache'
gem 'nokogiri'
gem 'css_parser', require: false
gem 'omniauth'
gem 'omniauth-facebook'
gem 'omniauth-twitter'
gem 'omniauth-instagram'
gem 'omniauth-github'
gem 'omniauth-oauth2', require: false
gem 'omniauth-google-oauth2'
gem 'oj'
gem 'pg'
gem 'mini_sql'
gem 'pry-rails', require: false
gem 'r2', require: false
gem 'rake'
gem 'thor', require: false
gem 'diffy', require: false
gem 'rinku'
gem 'sidekiq'
gem 'mini_scheduler'
gem 'execjs', require: false
gem 'mini_racer'
# TODO: determine why highline is being held back and upgrade to latest
gem 'highline', '~> 1.7.0', require: false
# TODO: Upgrading breaks Sidekiq Web
# This is a bit of a hornets nest cause in an ideal world we much prefer
# if Sidekiq reused session and CSRF mitigation with Discourse on the
# _forum_session cookie instead of a rack.session cookie
gem 'rack', '2.0.8'
gem 'rack-protection' # security
gem 'cbor', require: false
gem 'cose', require: false
gem 'addressable'
# Gems used only for assets and not required in production environments by default.
# Allow everywhere for now cause we are allowing asset debugging in production
group :assets do
gem 'uglifier'
gem 'rtlit', require: false # for css rtling
end
group :test do
gem 'webmock', require: false
gem 'fakeweb', require: false
gem 'minitest', require: false
gem 'simplecov', require: false
gem "test-prof"
end
group :test, :development do
gem 'rspec'
gem 'mock_redis'
gem 'listen', require: false
gem 'certified', require: false
gem 'fabrication', require: false
gem 'mocha', require: false
gem 'rb-fsevent', require: RUBY_PLATFORM =~ /darwin/i ? 'rb-fsevent' : false
# TODO determine if we can update this to 0.10, API changes happened
# we would like to upgrade it if possible
gem 'rb-inotify', '~> 0.9', require: RUBY_PLATFORM =~ /linux/i ? 'rb-inotify' : false
# TODO once 4.0.0 is released upgrade to it, at time of writing 3.9.0 is latest
gem 'rspec-rails', '4.0.0.beta2', require: false
gem 'shoulda-matchers', require: false
gem 'rspec-html-matchers'
gem 'byebug', require: ENV['RM_INFO'].nil?, platform: :mri
gem 'rubocop', require: false
gem "rubocop-discourse", require: false
gem "rubocop-rspec", require: false
gem 'parallel_tests'
gem 'rswag-specs'
end
group :development do
gem 'ruby-prof', require: false, platform: :mri
gem 'bullet', require: !!ENV['BULLET']
gem 'better_errors', platform: :mri
gem 'binding_of_caller'
gem 'yaml-lint'
gem 'annotate'
end
# this is an optional gem, it provides a high performance replacement
# to String#blank? a method that is called quite frequently in current
# ActiveRecord, this may change in the future
gem 'fast_blank', platform: :mri
# this provides a very efficient lru cache
gem 'lru_redux'
gem 'htmlentities', require: false
# IMPORTANT: mini profiler monkey patches, so it better be required last
# If you want to amend mini profiler to do the monkey patches in the railties
# we are open to it. by deferring require to the initializer we can configure discourse installs without it
gem 'flamegraph', require: false
gem 'rack-mini-profiler', require: ['enable_rails_patches']
gem 'unicorn', require: false, platform: :mri
gem 'puma', require: false
gem 'rbtrace', require: false, platform: :mri
gem 'gc_tracer', require: false, platform: :mri
# required for feed importing and embedding
gem 'ruby-readability', require: false
gem 'stackprof', require: false, platform: :mri
gem 'memory_profiler', require: false, platform: :mri
gem 'cppjieba_rb', require: false
gem 'lograge', require: false
gem 'logstash-event', require: false
gem 'logstash-logger', require: false
gem 'logster'
# NOTE: later versions of sassc are causing a segfault, possibly dependent on processer architecture
# and until resolved should be locked at 2.0.1
gem 'sassc', '2.0.1', require: false
gem "sassc-rails"
gem 'rotp', require: false
gem 'rqrcode'
gem 'rubyzip', require: false
gem 'sshkey', require: false
gem 'rchardet', require: false
gem 'lz4-ruby', require: false, platform: :mri
if ENV["IMPORT"] == "1"
gem 'mysql2'
gem 'redcarpet'
# NOTE: in import mode the version of sqlite can matter a lot, so we stick it to a specific one
gem 'sqlite3', '~> 1.3', '>= 1.3.13'
gem 'ruby-bbcode-to-md', git: 'https://github.com/nlalonde/ruby-bbcode-to-md'
gem 'reverse_markdown'
gem 'tiny_tds'
gem 'csv'
end
gem 'webpush', require: false
gem 'colored2', require: false
gem 'maxminddb'