Commit Graph

65 Commits

Author SHA1 Message Date
Roman Rizzi
53abcd825d
FIX: Canonical URLs may be relative (#14825)
FinalDestination's follow_canonical mode used for embedded topics should work when canonical URLs are relative, as specified in [RFC 6596](https://datatracker.ietf.org/doc/html/rfc6596)
2021-11-05 14:20:14 -03:00
jbrw
978a005a42
FIX: resolve responses of 103 should be retried using small_get (#14773)
If the initial `get`/`head` response within `resolve` returns a status code of `103`, attempt to fetch the same URL with the alternative `small_get` method.
2021-10-29 14:51:56 -04:00
Roman Rizzi
4c2d5158c5
FIX: Follow the canonical URL when importing a remote topic. (#14489)
FinalDestination now supports the `follow_canonical` option, which will perform an initial GET request, parse the canonical link if present, and perform a HEAD request to it.

We use this mode during embeds to avoid treating URLs with different query parameters as different topics.
2021-10-01 12:48:21 -03:00
jbrw
2f28ba318c
FEATURE: Onebox can match engines based on the content_type (#13876)
* FEATURE: Onebox can match engines based on the content_type

`FinalDestination` now returns the `content_type` of a resolved URL.

`Oneboxer` passes this value to `Onebox` itself. Onebox engines can now specify a `matches_content_type` regex of content_types that the engine can handle, regardless of the URL.

`ImageOnebox` will match URLs with a content type of `image/png`, `jpg`, `gif`, `bmp`, `tif`, etc.

This will allow images that exist at a URL without a file type extension to be correctly rendered, assuming a valid `content_type` is returned.
2021-07-30 13:36:30 -04:00
jbrw
09d23a37a5
DEV: Add default Accept-Language to FinalDestination requests (#13817)
Not specifying an `Accept-Language` should be equivalent to specifying an `Accept-Language` of `*`, however some webservers seem to prefer it if we are explicit about being able to handle a response of content in any language.
2021-07-22 15:49:59 +10:00
Joffrey JAFFEUX
e50b7e9111
SECURITY: ensures timeouts are correctly used on connect (#13455) 2021-06-21 17:34:01 +02:00
jbrw
19182b1386
DEV: Oneboxer wildcard subdomains (#13015)
* DEV: Allow wildcards in Oneboxer optional domain Site Settings

Allows a wildcard to be used as a subdomain on Oneboxer-related SiteSettings, e.g.:

- `force_get_hosts`
- `cache_onebox_response_body_domains`
- `force_custom_user_agent_hosts`

* DEV: fix typos

* FIX: Try doing a GET after receiving a 500 error from a HEAD

By default we try to do a `HEAD` requests. If this results in a 500 error response, we should try to do a `GET`

* DEV: `force_get_hosts` should be a hidden setting

* DEV: Oneboxer Strategies

Have an alternative oneboxing ‘strategy’ (i.e., set of options) to use when an attempt to generate a Onebox fails. Keep track of any non-default strategies that were used on a particular host, and use that strategy for that host in the future.

Initially, the alternate strategy (`force_get_and_ua`) forces the FinalDestination step of Oneboxing to do a `GET` rather than `HEAD`, and forces a custom user agent.

* DEV: change stubbed return code

The stubbed status code needs to be a value not recognized by FinalDestination
2021-05-13 15:48:35 -04:00
Joffrey JAFFEUX
64dda7112d
FIX: correctly use timeouts in FileHelper and FinalDestination (#12921)
Previous refactors have lost usage of read_timeout in `FileHelper.download` and `FinalDestination` was incorrectly using `Net::HTTP.start` by setting `open_timeout` in the block instead of directly during the invocation.

Couldn't figure how to write a good test for this without slowing the spec.
2021-05-03 09:21:11 +02:00
jbrw
68d0916eb5
FEATURE: Oneboxer cache response body (#12562)
* FEATURE: Cache successful HTTP GET requests during Oneboxing

Some oneboxes may fail if when making excessive and/or odd requests against the target domains. This change provides a simple mechanism to cache the results of succesful GET requests as part of the oneboxing process, with the goal of reducing repeated requests and ultimately improving the rate of successful oneboxing.

To enable:

Set `SiteSetting.cache_onebox_response_body` to `true`

Add the domains you’re interesting in caching to `SiteSetting. cache_onebox_response_body_domains` e.g. `example.com|example.org|example.net`

Optionally set `SiteSetting.cache_onebox_user_agent` to a user agent string of your choice to use when making requests against domains in the above list.

* FIX: Swap order of duration and value in redis call

The correct order for `setex` arguments is `key`, `duration`, and `value`.

Duration and value had been flipped, however the code would not have thrown an error because we were caching the value of `1.day.to_i` for a period of 1 seconds… The intention appears to be to set a value of 1 (purely as a flag) for a period of 1 day.
2021-03-31 13:19:34 -04:00
jbrw
331236d6d7
Onebox improved error handling and support for Instagram Access Tokens (#11253)
* FEATURE: display error if Oneboxing fails due to HTTP error

- display warning if onebox URL is unresolvable
- display warning if attributes are missing

* FEATURE: Use new Instagram oEmbed endpoint if access token is configured

Instagram requires an Access Token to access their oEmbed endpoint. The requirements (from https://developers.facebook.com/docs/instagram/oembed/) are as follows:

- a Facebook Developer account, which you can create at developers.facebook.com
- a registered Facebook app
- the oEmbed Product added to the app
- an Access Token
- The Facebook app must be in Live Mode

The generated Access Token, once added to SiteSetting.facebook_app_access_token, will be passed to onebox. Onebox can then use this token to access the oEmbed endpoint to generate a onebox for Instagram.

* DEV: update user agent string

* DEV: don’t do HEAD requests against news.yahoo.com

* DEV: Bump onebox version from 2.1.5 to 2.1.6

* DEV: Avoid re-reading templates

* DEV: Tweaks to onebox mustache templates

* DEV: simplified error message for missing onebox data

* Apply suggestions from code review
Co-authored-by: Gerhard Schlager <mail@gerhard-schlager.at>
2020-11-18 12:55:16 -05:00
Krzysztof Kotlarek
e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Martin Brennan
edbc356593
FIX: Replace deprecated URI.encode, URI.escape, URI.unescape and URI.unencode (#8528)
The following methods have long been deprecated in ruby due to flaws in their implementation per http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/29293?29179-31097:

URI.escape
URI.unescape
URI.encode
URI.unencode
escape/encode are just aliases for one another. This PR uses the Addressable gem to replace these methods with its own encode, unencode, and encode_component methods where appropriate.

I have put all references to Addressable::URI here into the UrlHelper to keep them corralled in one place to make changes to this implementation easier.

Addressable is now also an explicit gem dependency.
2019-12-12 12:49:21 +10:00
Joffrey JAFFEUX
0d3d2c43a0
DEV: s/\$redis/Discourse\.redis (#8431)
This commit also adds a rubocop rule to prevent global variables.
2019-12-03 10:05:53 +01:00
Arpit Jalan
00c406520e FEATURE: allow FinalDestination to use custom user agent for specific hosts 2019-11-07 14:47:51 +05:30
Arpit Jalan
e90aac11cb fix the build 2019-08-07 16:39:58 +05:30
Arpit Jalan
b0e781e2d4 FIX: do not follow redirect on same host with path /login or /session 2019-08-07 16:26:55 +05:30
Sam Saffron
7429700389 FIX: ensure we can download maxmind without redis or db config
This also corrects FileHelper.download so it supports "follow_redirect"
correctly (it used to always follow 1 redirect) and adds a `validate_url`
param that will bypass all uri validation if set to false (default is true)
2019-05-28 10:28:57 +10:00
Sam Saffron
30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Gerhard Schlager
92df6890df FIX: GET request didn't use headers 2019-03-08 21:36:49 +01:00
Sam
cfddfa6de2 SECURITY: bypass long GET requests
In some rare cases we would check URLs with very large payloads
this ensures we always bypass and do not read entire payloads
2019-02-27 14:51:28 +11:00
Arpit Jalan
1ab91f0474 FIX: preserve github fragment URL 2018-12-19 12:34:47 +05:30
Guo Xiang Tan
8dc1463ab3 Enable Lint/ShadowingOuterLocalVariable for Rubocop. 2018-09-04 10:16:42 +08:00
Bianca Nenciu
b6963b8ffb FIX: Ignore OneBox blacklisted domains. 2018-08-27 20:40:55 +02:00
Régis Hanol
de92913bf4 FIX: store the topic links using the cooked upload url 2018-08-14 12:23:32 +02:00
Robin Ward
7058205f70 FIX: Broken specs 2018-07-24 12:00:34 -04:00
Robin Ward
236243f38a SECURITY: Consider 0.0.0.0 a private IP 2018-07-24 11:16:27 -04:00
Guo Xiang Tan
d43895e2a0 Don't log 404s for FinalDestination.
* We can't do anything about 404s
2018-05-25 10:11:16 +08:00
Guo Xiang Tan
142571bba0 Remove use of rescue nil.
* `rescue nil` is a really bad pattern to use in our code base.
  We should rescue errors that we expect the code to throw and
  not rescue everything because we're unsure of what errors the
  code would throw. This would reduce the amount of pain we face
  when debugging why something isn't working as expexted. I've
  been bitten countless of times by errors being swallowed as a
  result during debugging sessions.
2018-04-02 13:52:51 +08:00
Guo Xiang Tan
ee69d58a59 FIX: Tests could get stucked in infinite loop if it fails to resolve IP of a hostname. 2018-03-28 14:49:05 +08:00
Gerhard Schlager
4a54c09e46 FIX: Retry with GET request when HEAD fails with error 400 2018-02-27 12:07:16 +01:00
Régis Hanol
0559a4736a FIX: don't double request when downloading a file 2018-02-24 12:35:57 +01:00
Gerhard Schlager
b6277e208b FIX: Cookies header didn't have the right format 2018-02-19 12:46:57 +01:00
Sam
fa5880e04f PERF: ability to crawl for titles without extra HEAD req
Also, introduces a much more aggressive timeout for title crawling
and introduces gzip to body that is crawled
2018-01-29 15:40:12 +11:00
Gerhard Schlager
e30851e45a Move escape_uri method to a more suitable place 2017-12-12 20:17:46 +01:00
Régis Hanol
de037da731 FIX: FinalDestination's small_get method wasn't using proper request headers 2017-11-17 17:24:35 +01:00
Régis Hanol
aebcd56300 FIX: try a GET for error code 406 2017-11-17 16:59:51 +01:00
Régis Hanol
221ff24418 SQL != Ruby 2017-11-17 16:12:20 +01:00
Régis Hanol
a0fc8bd924 don't log 404s to gravatar.com 2017-11-17 15:38:26 +01:00
Sam
3ac7d041ae UX: generic onebox treats all square images as avatars and renders them smaller 2017-11-13 11:21:19 +11:00
Gerhard Schlager
d1f257d275 FinalDestination should only log when verbose is enabled 2017-10-31 17:16:59 +01:00
Gerhard Schlager
8c27f28dcb add more logging to FinalDestination 2017-10-31 12:26:35 +01:00
Sam Saffron
8185b8cb06 FEATURE: cache https redirects per hostname
If a hostname does an https redirect we cache that so next
lookup does not incur it.

Also, only rate limit per ip once per final destination

Raise final destination protection to 1000 ip lookups an hour
2017-10-17 16:22:54 +11:00
Sam
70bb2aa426 FEATURE: allow specifying s3 config via globals
This refactors handling of s3 so it can be specified via GlobalSetting

This means that in a multisite environment you can configure s3 uploads
without actual sites knowing credentials in s3

It is a critical setting for situations where assets are mirrored to s3.
2017-10-06 16:20:01 +11:00
Sam
8ecf313a81 FIX: correctly raise errors when downloads fail
This corrects an issue where we are hitting Gravatar for 404 over and over

Also ensures file download properly reports errors
2017-09-28 16:35:43 +10:00
Guo Xiang Tan
5324c01209 FIX: Don't raise an error if reading from URL timeout. 2017-09-27 14:53:22 +08:00
Guo Xiang Tan
367fb1c524 FIX: Onebox fails on encoded URL.
https://meta.discourse.org/t/onebox-breaks-if-theres-chinese-text-in-url/67364
2017-09-26 18:34:54 +08:00
Joffrey JAFFEUX
6cd8203686 FIX: allows onebox to force GET hosts returning wrong headers on HEAD 2017-08-08 11:44:27 +02:00
Arpit Jalan
b059a0f789 extract url escaping to a dedicated class method and improved tests 2017-07-29 22:16:51 +05:30
Arpit Jalan
1fe553873c FIX: preserve fragment identifier when escaping url 2017-07-29 17:22:45 +05:30
Guo Xiang Tan
5012d46cbd Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00