mirror of
https://github.com/discourse/discourse.git
synced 2024-11-22 11:44:49 +08:00
501b19b6e0
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML <tables> to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist <pre> elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear. I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts 1. remove_not_allowed! The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes that have no "visitor" (eg. a 'visit_<tag>' method) defined. 2. remove_hidden! Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. 3. hoist_line_breaks! The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying. The <br> tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>". The latter being much more easy to process than the former, so that's what this method is doing. The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element. 4. remove_whitespaces! The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well - remove_whitespaces! - is_inline? - collapse_spaces! - remove_trailing_space! The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods. For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively. The "is_inline?" method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline. The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags. For example, if we have " Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42". Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown FIX: added support for <img> dimensions when converting from HTML to Markdown FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for <acronym> when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
258 lines
7.2 KiB
Ruby
258 lines
7.2 KiB
Ruby
# frozen_string_literal: true
|
|
|
|
source 'https://rubygems.org'
|
|
# if there is a super emergency and rubygems is playing up, try
|
|
#source 'http://production.cf.rubygems.org'
|
|
|
|
gem 'bootsnap', require: false, platform: :mri
|
|
|
|
def rails_master?
|
|
ENV["RAILS_MASTER"] == '1'
|
|
end
|
|
|
|
if rails_master?
|
|
gem 'arel', git: 'https://github.com/rails/arel.git'
|
|
gem 'rails', git: 'https://github.com/rails/rails.git'
|
|
else
|
|
# NOTE: Until rubygems gives us optional dependencies we are stuck with this needing to be explicit
|
|
# this allows us to include the bits of rails we use without pieces we do not.
|
|
#
|
|
# To issue a rails update bump the version number here
|
|
gem 'actionmailer', '6.0.2.2'
|
|
gem 'actionpack', '6.0.2.2'
|
|
gem 'actionview', '6.0.2.2'
|
|
gem 'activemodel', '6.0.2.2'
|
|
gem 'activerecord', '6.0.2.2'
|
|
gem 'activesupport', '6.0.2.2'
|
|
gem 'railties', '6.0.2.2'
|
|
gem 'sprockets-rails'
|
|
end
|
|
|
|
gem 'json'
|
|
|
|
# TODO: At the moment Discourse does not work with Sprockets 4, we would need to correct internals
|
|
# This is a desired upgrade we should get to.
|
|
gem 'sprockets', '3.7.2'
|
|
|
|
# this will eventually be added to rails,
|
|
# allows us to precompile all our templates in the unicorn master
|
|
gem 'actionview_precompiler', require: false
|
|
|
|
gem 'seed-fu'
|
|
|
|
gem 'mail', require: false
|
|
gem 'mini_mime'
|
|
gem 'mini_suffix'
|
|
|
|
gem 'redis'
|
|
|
|
# This is explicitly used by Sidekiq and is an optional dependency.
|
|
# We tell Sidekiq to use the namespace "sidekiq" which triggers this
|
|
# gem to be used. There is no explicit dependency in sidekiq cause
|
|
# redis namespace support is optional
|
|
# We already namespace stuff in DiscourseRedis, so we should consider
|
|
# just using a single implementation in core vs having 2 namespace implementations
|
|
gem 'redis-namespace'
|
|
|
|
# NOTE: AM serializer gets a lot slower with recent updates
|
|
# we used an old branch which is the fastest one out there
|
|
# are long term goal here is to fork this gem so we have a
|
|
# better maintained living fork
|
|
gem 'active_model_serializers', '~> 0.8.3'
|
|
|
|
gem 'onebox'
|
|
|
|
gem 'http_accept_language', require: false
|
|
|
|
# Ember related gems need to be pinned cause they control client side
|
|
# behavior, we will push these versions up when upgrading ember
|
|
gem 'ember-rails', '0.18.5'
|
|
gem 'discourse-ember-source', '~> 3.12.2'
|
|
gem 'ember-handlebars-template', '0.8.0'
|
|
|
|
gem 'barber'
|
|
|
|
gem 'message_bus'
|
|
|
|
gem 'rails_multisite'
|
|
|
|
gem 'fast_xs', platform: :mri
|
|
|
|
# may move to xorcist post: https://github.com/fny/xorcist/issues/4
|
|
gem 'fast_xor', platform: :mri
|
|
|
|
gem 'fastimage'
|
|
|
|
gem 'aws-sdk-s3', require: false
|
|
gem 'aws-sdk-sns', require: false
|
|
gem 'excon', require: false
|
|
gem 'unf', require: false
|
|
|
|
gem 'email_reply_trimmer'
|
|
|
|
# Forked until https://github.com/toy/image_optim/pull/162 is merged
|
|
# https://github.com/discourse/image_optim
|
|
gem 'discourse_image_optim', require: 'image_optim'
|
|
gem 'multi_json'
|
|
gem 'mustache'
|
|
gem 'nokogiri'
|
|
gem 'css_parser', require: false
|
|
|
|
gem 'omniauth'
|
|
gem 'omniauth-facebook'
|
|
gem 'omniauth-twitter'
|
|
gem 'omniauth-instagram'
|
|
gem 'omniauth-github'
|
|
|
|
gem 'omniauth-oauth2', require: false
|
|
|
|
gem 'omniauth-google-oauth2'
|
|
|
|
gem 'oj'
|
|
gem 'pg'
|
|
gem 'mini_sql'
|
|
gem 'pry-rails', require: false
|
|
gem 'r2', require: false
|
|
gem 'rake'
|
|
|
|
gem 'thor', require: false
|
|
gem 'diffy', require: false
|
|
gem 'rinku'
|
|
gem 'sidekiq'
|
|
gem 'mini_scheduler'
|
|
|
|
gem 'execjs', require: false
|
|
gem 'mini_racer'
|
|
|
|
# TODO: determine why highline is being held back and upgrade to latest
|
|
gem 'highline', '~> 1.7.0', require: false
|
|
|
|
# TODO: Upgrading breaks Sidekiq Web
|
|
# This is a bit of a hornets nest cause in an ideal world we much prefer
|
|
# if Sidekiq reused session and CSRF mitigation with Discourse on the
|
|
# _forum_session cookie instead of a rack.session cookie
|
|
gem 'rack', '2.0.8'
|
|
|
|
gem 'rack-protection' # security
|
|
gem 'cbor', require: false
|
|
gem 'cose', require: false
|
|
gem 'addressable'
|
|
|
|
# Gems used only for assets and not required in production environments by default.
|
|
# Allow everywhere for now cause we are allowing asset debugging in production
|
|
group :assets do
|
|
gem 'uglifier'
|
|
gem 'rtlit', require: false # for css rtling
|
|
end
|
|
|
|
group :test do
|
|
gem 'webmock', require: false
|
|
gem 'fakeweb', require: false
|
|
gem 'minitest', require: false
|
|
gem 'simplecov', require: false
|
|
gem "test-prof"
|
|
end
|
|
|
|
group :test, :development do
|
|
gem 'rspec'
|
|
gem 'mock_redis'
|
|
gem 'listen', require: false
|
|
gem 'certified', require: false
|
|
gem 'fabrication', require: false
|
|
gem 'mocha', require: false
|
|
|
|
gem 'rb-fsevent', require: RUBY_PLATFORM =~ /darwin/i ? 'rb-fsevent' : false
|
|
|
|
# TODO determine if we can update this to 0.10, API changes happened
|
|
# we would like to upgrade it if possible
|
|
gem 'rb-inotify', '~> 0.9', require: RUBY_PLATFORM =~ /linux/i ? 'rb-inotify' : false
|
|
|
|
# TODO once 4.0.0 is released upgrade to it, at time of writing 3.9.0 is latest
|
|
gem 'rspec-rails', '4.0.0.beta2', require: false
|
|
|
|
gem 'shoulda-matchers', require: false
|
|
gem 'rspec-html-matchers'
|
|
gem 'byebug', require: ENV['RM_INFO'].nil?, platform: :mri
|
|
gem 'rubocop', require: false
|
|
gem "rubocop-discourse", require: false
|
|
gem "rubocop-rspec", require: false
|
|
gem 'parallel_tests'
|
|
|
|
gem 'rswag-specs'
|
|
end
|
|
|
|
group :development do
|
|
gem 'ruby-prof', require: false, platform: :mri
|
|
gem 'bullet', require: !!ENV['BULLET']
|
|
gem 'better_errors', platform: :mri
|
|
gem 'binding_of_caller'
|
|
gem 'yaml-lint'
|
|
gem 'annotate'
|
|
end
|
|
|
|
# this is an optional gem, it provides a high performance replacement
|
|
# to String#blank? a method that is called quite frequently in current
|
|
# ActiveRecord, this may change in the future
|
|
gem 'fast_blank', platform: :mri
|
|
|
|
# this provides a very efficient lru cache
|
|
gem 'lru_redux'
|
|
|
|
gem 'htmlentities', require: false
|
|
|
|
# IMPORTANT: mini profiler monkey patches, so it better be required last
|
|
# If you want to amend mini profiler to do the monkey patches in the railties
|
|
# we are open to it. by deferring require to the initializer we can configure discourse installs without it
|
|
|
|
gem 'flamegraph', require: false
|
|
gem 'rack-mini-profiler', require: ['enable_rails_patches']
|
|
|
|
gem 'unicorn', require: false, platform: :mri
|
|
gem 'puma', require: false
|
|
gem 'rbtrace', require: false, platform: :mri
|
|
gem 'gc_tracer', require: false, platform: :mri
|
|
|
|
# required for feed importing and embedding
|
|
gem 'ruby-readability', require: false
|
|
|
|
gem 'stackprof', require: false, platform: :mri
|
|
gem 'memory_profiler', require: false, platform: :mri
|
|
|
|
gem 'cppjieba_rb', require: false
|
|
|
|
gem 'lograge', require: false
|
|
gem 'logstash-event', require: false
|
|
gem 'logstash-logger', require: false
|
|
gem 'logster'
|
|
|
|
# NOTE: later versions of sassc are causing a segfault, possibly dependent on processer architecture
|
|
# and until resolved should be locked at 2.0.1
|
|
gem 'sassc', '2.0.1', require: false
|
|
gem "sassc-rails"
|
|
|
|
gem 'rotp', require: false
|
|
gem 'rqrcode'
|
|
|
|
gem 'rubyzip', require: false
|
|
|
|
gem 'sshkey', require: false
|
|
|
|
gem 'rchardet', require: false
|
|
gem 'lz4-ruby', require: false, platform: :mri
|
|
|
|
if ENV["IMPORT"] == "1"
|
|
gem 'mysql2'
|
|
gem 'redcarpet'
|
|
|
|
# NOTE: in import mode the version of sqlite can matter a lot, so we stick it to a specific one
|
|
gem 'sqlite3', '~> 1.3', '>= 1.3.13'
|
|
gem 'ruby-bbcode-to-md', git: 'https://github.com/nlalonde/ruby-bbcode-to-md'
|
|
gem 'reverse_markdown'
|
|
gem 'tiny_tds'
|
|
gem 'csv'
|
|
end
|
|
|
|
gem 'webpush', require: false
|
|
gem 'colored2', require: false
|
|
gem 'maxminddb'
|