Commit Graph

31 Commits

Author SHA1 Message Date
Gary Pendergast
ec30b6f6c6
FIX: Inline oneboxes should obey the locale. (#30664)
Following on from f369db5ae9, we need to apply a similar fix to inline oneboxes, since they use a different code path to retrieve the onebox provider data.

This change ensures the Accept-Language header is sent by inline onebox requests, too.
2025-01-09 17:22:22 +11:00
Jarek Radosz
affe26f0dd
DEV: Update nokogiri to 1.18.1 (#30554)
Nokogiri/libxml is now more strict in terms of params it receives.

It uses kwargs vs options object (I fixed an issue there in #30545) doesn't accept nil/blank html (fixed here) and most importantly handles encoding in a different way. It seems to require explicitly specifying UTF8.

* Build(deps): Bump nokogiri from 1.16.8 to 1.18.1

Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.16.8 to 1.18.1.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.16.8...v1.18.1)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-07 12:05:39 +01:00
Jarek Radosz
74011232e9
FIX: Request html when fetching inline onebox data (#24674)
We do expect to receive html
2023-12-04 11:36:42 +10:00
Ted Johansson
54e813e964
FIX: Don't error out when trying to retrieve title and URL won't encode (#24660) 2023-12-01 15:03:06 +08:00
Ted Johansson
59867cc091
DEV: Gracefully handle user avatar download SSRF errors (#21523)
### Background

When SSRF detection fails, the exception bubbles all the way up, causing a log alert. This isn't actionable, and should instead be ignored. The existing `rescue` does already ignore network errors, but fails to account for SSRF exceptions coming from `FinalDestination`.

### What is this change?

This PR does two things.

---

Firstly, it introduces a common root exception class, `FinalDestination::SSRFError` for SSRF errors. This serves two functions: 1) it makes it easier to rescue both errors at once, which is generally what one wants to do and 2) prevents having to dig deep into the class hierarchy for the constant.

This change is fully backwards compatible thanks to how inheritance and exception handling works.

---

Secondly, it rescues this new exception in `UserAvatar.import_url_for_user`, which is causing sporadic errors to be logged in production. After this SSRF errors are handled the same as network errors.
2023-05-12 15:32:02 +08:00
Daniel Waterworth
666536cbd1
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-20 12:52:49 -06:00
David Taylor
6417173082
DEV: Apply syntax_tree formatting to lib/* 2023-01-09 12:10:19 +00:00
Ted Johansson
06db264f24
FIX: Gracefully handle DNS issued from SSRF lookup when inline oneboxing (#19631)
There is an issue where chat message processing breaks due to
unhandles `SocketError` exceptions originating in the SSRF check,
specifically in `FinalDestination::Resolver`.

This change gives `FinalDestination::SSRFDetector` a new error class
to wrap the `SocketError` in, and haves the `RetrieveTitle` class
handle that error gracefully.
2022-12-28 10:30:20 +08:00
Sam
60685e6984
DEV: improve comment (#18041)
improves comment about caught exception to match implementation
2022-08-23 15:14:24 +10:00
Sam
df04462475
FIX: ignore malformed HTML for title extraction (#18040)
Certain HTML can be rejected by nokogumbo, specifically cases where there
are enormous amounts of attributes

This ensures that malformed HTML is simply skipped instead of leaking out
an exception and terminating downstream processes.
2022-08-23 15:03:57 +10:00
Sérgio Saquetim
300f835703
DEV: Supress logs when RetrieveTitle.crawl fails with Net::ReadTimeout errors (#16971)
This PR changes the rescue block to rescue only Net::TimeoutError exceptions and removes the log line to prevent clutter the logs with errors that are ignored. Other errors can bubble up because they're errors we probably want to know about
2022-06-09 16:30:22 -03:00
Osama Sayegh
d15867463f
FEATURE: Site setting for blocking onebox of URLs that redirect (#16881)
Meta topic: https://meta.discourse.org/t/prevent-to-linkify-when-there-is-a-redirect/226964/2?u=osama.

This commit adds a new site setting `block_onebox_on_redirect` (default off) for blocking oneboxes (full and inline) of URLs that redirect. Note that an initial http → https redirect is still allowed if the redirect location is identical to the source (minus the scheme of course). For example, if a user includes a link to `http://example.com/page` and the link resolves to `https://example.com/page`, then the link will onebox (assuming it can be oneboxed) even if the setting is enabled. The reason for this is a user may type out a URL (i.e. the URL is short and memorizable) with http and since a lot of sites support TLS with http traffic automatically redirected to https, so we should still allow the URL to onebox.
2022-05-23 13:52:06 +03:00
Dan Ungureanu
8e9cbe9db4
FIX: Do not raise if title cannot be crawled (#16247)
If the crawled page returned an error, `FinalDestination#safe_get`
yielded `nil` for `uri` and `chunk` arguments. Another problem is that
`get` did not handle the case when `safe_get` failed and did not return
the `location` and `set_cookie` headers.
2022-03-22 20:13:27 +02:00
Osama Sayegh
ddde94f925
FIX: return nil when RetrieveTitle.crawl fails (#16167) 2022-03-11 23:53:10 +03:00
Osama Sayegh
b0656f3ed0
FIX: Apply onebox blocked domain checks on every redirect (#16150)
The `blocked onebox domains` setting lets site owners change what sites
are allowed to be oneboxed. When a link is entered into a post,
Discourse checks the domain of the link against that setting and blocks
the onebox if the domain is blocked. But if there's a chain of
redirects, then only the final destination website is checked against
the site setting.

This commit amends that behavior so that every website in the redirect
chain is checked against the site setting, and if anything is blocked
the original link doesn't onebox at all in the post. The
`Discourse-No-Onebox` header is also checked in every response and the
onebox is blocked if the header is set to "1".

Additionally, Discourse will now include the `Discourse-No-Onebox`
header with every response if the site requires login to access content.
This is done to signal to a Discourse instance that it shouldn't attempt
to onebox other Discourse instances if they're login-only. Non-Discourse
websites can also use include that header if they don't wish to have
Discourse onebox their content.

Internal ticket: t59305.
2022-03-11 09:18:12 +03:00
Krzysztof Kotlarek
803fd7289d
FIX: inline onebox for github (#15859)
Increase size of downloaded HTML for Github when getting title for
inline Onebox.
2022-02-09 22:53:27 +01:00
Natalie Tay
aac9f43038
Only block domains at the final destination (#15689)
In an earlier PR, we decided that we only want to block a domain if 
the blocked domain in the SiteSetting is the final destination (/t/59305). That 
PR used `FinalDestination#get`. `resolve` however is used several places
 but blocks domains along the redirect chain when certain options are provided.

This commit changes the default options for `resolve` to not do that. Existing
users of `FinalDestination#resolve` are
- `Oneboxer#external_onebox`
- our onebox helper `fetch_html_doc`, which is used in amazon, standard embed 
and youtube
  - these folks already go through `Oneboxer#external_onebox` which already
  blocks correctly
2022-01-31 15:35:12 +08:00
Natalie Tay
f5ea00c73f
FIX: Respect blocked domains list when redirecting (#15656)
Our previous implementation used a simple `blocked_domain_array.include?(hostname)`
so some values were not matching. Additionally, in some configurations like ours, we'd used
"cat.*.dog.com" with the assumption we'd support globbing.

This change implicitly allows globbing by blocking "http://a.b.com" if "b.com" is a blocked 
domain but does not actively do anything for "*".

An upcoming change might include frontend validation for values that can be inserted.
2022-01-20 14:12:34 +08:00
Arpit Jalan
763f48abc7
FIX: increase chunk size to fetch title tag correctly (#14144) 2021-09-03 13:15:58 +05:30
Arpit Jalan
0adeddde61
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 19:53:39 +05:30
Osama Sayegh
558e9dd310
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625)
* FIX: Inline onebox should use encoding from Content-Type header when present

* Use Regexp.last_match(1)

Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com>
2021-01-04 22:32:08 +03:00
Krzysztof Kotlarek
9bff0882c3
FEATURE: Nokogumbo (#9577)
* FEATURE: Nokogumbo

Use Nokogumbo HTML parser.
2020-05-05 13:46:57 +10:00
Penar Musaraj
067696df8f DEV: Apply Rubocop redundant return style 2019-11-14 15:10:51 -05:00
Krzysztof Kotlarek
427d54b2b0 DEV: Upgrading Discourse to Zeitwerk (#8098)
Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. 

We no longer need to use Rails "require_dependency" anywhere and instead can just use standard 
Ruby patterns to require files.

This is a far reaching change and we expect some followups here.
2019-10-02 14:01:53 +10:00
Sam Saffron
30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Guo Xiang Tan
ad5082d969 Make rubocop happy again. 2018-06-07 13:28:18 +08:00
Sam
f946db4afe FIX: inline oneboxer min title length of 2
also: cache mini onebox misses as well to cut down traffic
2018-01-30 08:40:04 +11:00
Sam
fa5880e04f PERF: ability to crawl for titles without extra HEAD req
Also, introduces a much more aggressive timeout for title crawling
and introduces gzip to body that is crawled
2018-01-29 15:40:12 +11:00
Robin Ward
07e84a3afa FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 09:30:22 -04:00
Sam
f6bc572fb8 FEATURE: option to enable inline oneboxes for all domains
Also, change to prefer title over open graph which is often way too sparse
2017-08-02 14:27:31 -04:00
Robin Ward
2f8f2aa1dd FEATURE: Whitelists for inline oneboxing 2017-07-21 15:41:47 -04:00