fish-shell/tests/checks/locale.fish

#RUN: %fish -C "set fish %fish" %s
# This hangs when running on github actions with tsan for unknown reasons,
# see #7934.
#REQUIRES: test -z "$GITHUB_WORKFLOW"

# We typically try to force a utf8-capable locale,
# this turns that off.
set -gx fish_allow_singlebyte_locale 1

# A function to display bytes, necessary because GNU and BSD implementations of `od` have different output.
# We used to use xxd, but it's not available everywhere. See #3797.
#
# We use the lowest common denominator format, `-b`, because it should work in all implementations.
# I wish we could use the `-t` flag but it isn't available in every OS we're likely to run on.
#
function display_bytes
    od -b | sed -e 's/  */ /g' -e 's/  *$//'
end

# Verify that our UTF-8 locale produces the expected output.
echo -n A\u00FCA | display_bytes
#CHECK: 0000000 101 303 274 101
#CHECK: 0000004

# Verify that exporting a change to the C locale produces the expected output.
# The output should include the literal byte \xFC rather than the UTF-8 sequence for \u00FC.
begin
    set -lx LC_ALL C
    echo -n B\u00FCB | display_bytes
end
#CHECK: 0000000 102 374 102
#CHECK: 0000003

# Since the previous change was localized to a block it should no
# longer be in effect and we should be back to a UTF-8 locale.
echo -n C\u00FCC | display_bytes
#CHECK: 0000000 103 303 274 103
#CHECK: 0000004

# Verify that setting a non-exported locale var doesn't affect the behavior.
# The output should include the UTF-8 sequence for \u00FC rather than that literal byte.
# Just like the previous test.
begin
    set -l LC_ALL C
    echo -n D\u00FCD | display_bytes
end
#CHECK: 0000000 104 303 274 104
#CHECK: 0000004

# Verify that fish can pass through non-ASCII characters in the C/POSIX
# locale. This is to prevent regression of
# https://github.com/fish-shell/fish-shell/issues/2802.
#
# These tests are needed because the relevant standards allow the functions
# mbrtowc() and wcrtomb() to treat bytes with the high bit set as either valid
# or invalid in the C/POSIX locales. GNU libc treats those bytes as invalid.
# Other libc implementations (e.g., BSD) treat them as valid. We want fish to
# always treat those bytes as valid.

# The fish in the middle of the pipeline should be receiving a UTF-8 encoded
# version of the unicode from the echo. It should pass those bytes thru
# literally since it is in the C locale. We verify this by first passing the
# echo output directly to the `xxd` program then via a fish instance. The
# output should be "58c3bb58" for the first statement and "58c3bc58" for the
# second.
echo -n X\u00FBX | display_bytes
echo X\u00FCX | env LC_ALL=C $fish -c 'read foo; echo -n $foo' | display_bytes
#CHECK: 0000000 130 303 273 130
#CHECK: 0000004
#CHECK: 0000000 130 303 274 130
#CHECK: 0000004

# The next tests deliberately spawn another fish instance to test inheritance of env vars.

# This test is subtle. Despite the presence of the \u00fc unicode char (a "u"
# with an umlaut) the fact the locale is C/POSIX will cause the \xfc byte to
# be emitted rather than the usual UTF-8 sequence \xc3\xbc. That's because the
# few single-byte unicode chars (that are not ASCII) are generally in the
# ISO 8859-x char sets which are encompassed by the C locale. The output should
# be "59fc59".
env LC_ALL=C $fish -c 'echo -n Y\u00FCY' | display_bytes
#CHECK: 0000000 131 374 131
#CHECK: 0000003

# The user can specify a wide unicode character (one requiring more than a
# single byte). In the C/POSIX locales we substitute a question-mark for the
# unencodable wide char. The output should be "543f54".
env LC_ALL=C $fish -c 'echo -n T\u01FDT' | display_bytes
#CHECK: 0000000 124 077 124
#CHECK: 0000003

string match ö \Xc3\Xb6
#CHECK: ö

math 5 \X2b 5
#CHECK: 10

math 7 \x2b 7
#CHECK: 14

echo \xc3\xb6
# CHECK: ö
echo \Xc3\Xb6
# CHECK: ö
tests: filter control sequences only when interactive This demonstrates that we only write control sequences when interactive. 2024-04-12 18:19:32 +08:00			`#RUN: %fish -C "set fish %fish" %s`
tests: Disable locale.fish on Github Actions Sometimes hangs with tsan. Works around #7934. 2021-04-15 23:25:39 +08:00			`# This hangs when running on github actions with tsan for unknown reasons,`
			`# see #7934.`
Fix skipping locale tests on Github Actions 2021-04-16 14:53:27 +08:00			`#REQUIRES: test -z "$GITHUB_WORKFLOW"`

Try to set LC_CTYPE to something UTF-8 capable (#8031) * Try to set LC_CTYPE to something UTF-8 capable When fish is started with LC_CTYPE=C (even just effectively, often via LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII characters with a C locale unless we want to write our locale-independent replacements for all of the system functions. Since we're not going to do that, let's try to find some locale for LC_CTYPE. We already do that in __fish_setlocale, but that's - a bit of a weird thing that reads unstandardized system configuration files - allows setting locale to C explicitly So it's still easily possible to end up in a broken configuration. Now, the issue with this is that there is (AFAICT) no portable way to get a list of all allowed locales and C.UTF-8 is not standardized, so we have no one locale to fall back on and are forced to try a few. The list we have here is quite arbitrary, but it's a start. Python does something similar and only tries C.UTF-8, C.utf8 and "UTF-8". Once C.UTF-8 is (hopefully) standardized, that will just start working (tm). Note that we do not export the fixed LC_CTYPE variable, so external programs still have to deal with the C locale, but we have no real business messing with the user's environment. To turn it off: $fish_allow_singlebyte_locale, if set to something true (like "1"), will re-run the locale initialization and skip the bit where we force LC_CTYPE to be utf8-capable. This is mainly used in our tests, but might also be useful if people are trying to do something weird. 2021-06-06 15:28:32 +08:00			`# We typically try to force a utf8-capable locale,`
			`# this turns that off.`
			`set -gx fish_allow_singlebyte_locale 1`
tests: Disable locale.fish on Github Actions Sometimes hangs with tsan. Works around #7934. 2021-04-15 23:25:39 +08:00
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			# A function to display bytes, necessary because GNU and BSD implementations of `od` have different output.
			`# We used to use xxd, but it's not available everywhere. See #3797.`
			`#`
			# We use the lowest common denominator format, `-b`, because it should work in all implementations.
			# I wish we could use the `-t` flag but it isn't available in every OS we're likely to run on.
			`#`
			`function display_bytes`
			`od -b \| sed -e 's/ / /g' -e 's/ $//'`
			`end`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
			`# Verify that our UTF-8 locale produces the expected output.`
remove dependency on `xxd` Fixes #3797 2017-02-01 10:44:02 +08:00			`echo -n A\u00FCA \| display_bytes`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 101 303 274 101`
			`#CHECK: 0000004`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
			`# Verify that exporting a change to the C locale produces the expected output.`
			`# The output should include the literal byte \xFC rather than the UTF-8 sequence for \u00FC.`
			`begin`
Reindent functions to remove useless quotes This does not include checks/function.fish because that currently includes a "; end" in a message that indent would remove, breaking the test. 2020-03-10 02:36:12 +08:00			`set -lx LC_ALL C`
			`echo -n B\u00FCB \| display_bytes`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00			`end`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 102 374 102`
			`#CHECK: 0000003`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
			`# Since the previous change was localized to a block it should no`
			`# longer be in effect and we should be back to a UTF-8 locale.`
remove dependency on `xxd` Fixes #3797 2017-02-01 10:44:02 +08:00			`echo -n C\u00FCC \| display_bytes`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 103 303 274 103`
			`#CHECK: 0000004`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
			`# Verify that setting a non-exported locale var doesn't affect the behavior.`
			`# The output should include the UTF-8 sequence for \u00FC rather than that literal byte.`
			`# Just like the previous test.`
			`begin`
Reindent functions to remove useless quotes This does not include checks/function.fish because that currently includes a "; end" in a message that indent would remove, breaking the test. 2020-03-10 02:36:12 +08:00			`set -l LC_ALL C`
			`echo -n D\u00FCD \| display_bytes`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00			`end`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 104 303 274 104`
			`#CHECK: 0000004`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
fix handling of non-ASCII chars in C locale The relevant standards allow the mbtowc/mbrtowc functions to reject non-ASCII characters (i.e., chars with the high bit set) when the locale is C or POSIX. The BSD libraries (e.g., on OS X) don't do this but the GNU libraries (e.g., on Linux) do. Like most programs we need the C/POSIX locales to allow arbitrary bytes. So explicitly check if we're in a single-byte locale (which would also include ISO-8859 variants) and simply pass-thru the chars without encoding or decoding. Fixes #2802. 2016-03-11 10:17:39 +08:00			`# Verify that fish can pass through non-ASCII characters in the C/POSIX`
			`# locale. This is to prevent regression of`
			`# https://github.com/fish-shell/fish-shell/issues/2802.`
			`#`
			`# These tests are needed because the relevant standards allow the functions`
			`# mbrtowc() and wcrtomb() to treat bytes with the high bit set as either valid`
			`# or invalid in the C/POSIX locales. GNU libc treats those bytes as invalid.`
			`# Other libc implementations (e.g., BSD) treat them as valid. We want fish to`
			`# always treat those bytes as valid.`

			`# The fish in the middle of the pipeline should be receiving a UTF-8 encoded`
			`# version of the unicode from the echo. It should pass those bytes thru`
			`# literally since it is in the C locale. We verify this by first passing the`
			# echo output directly to the `xxd` program then via a fish instance. The
			`# output should be "58c3bb58" for the first statement and "58c3bc58" for the`
			`# second.`
remove dependency on `xxd` Fixes #3797 2017-02-01 10:44:02 +08:00			`echo -n X\u00FBX \| display_bytes`
Replace references to ".../test/root/bin/fish" in the checks 2020-02-08 18:06:36 +08:00			`echo X\u00FCX \| env LC_ALL=C $fish -c 'read foo; echo -n $foo' \| display_bytes`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 130 303 273 130`
			`#CHECK: 0000004`
			`#CHECK: 0000000 130 303 274 130`
			`#CHECK: 0000004`
fix handling of non-ASCII chars in C locale The relevant standards allow the mbtowc/mbrtowc functions to reject non-ASCII characters (i.e., chars with the high bit set) when the locale is C or POSIX. The BSD libraries (e.g., on OS X) don't do this but the GNU libraries (e.g., on Linux) do. Like most programs we need the C/POSIX locales to allow arbitrary bytes. So explicitly check if we're in a single-byte locale (which would also include ISO-8859 variants) and simply pass-thru the chars without encoding or decoding. Fixes #2802. 2016-03-11 10:17:39 +08:00
Amend typos and grammar errors 2019-11-25 19:03:25 +08:00			`# The next tests deliberately spawn another fish instance to test inheritance of env vars.`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00
fix handling of non-ASCII chars in C locale The relevant standards allow the mbtowc/mbrtowc functions to reject non-ASCII characters (i.e., chars with the high bit set) when the locale is C or POSIX. The BSD libraries (e.g., on OS X) don't do this but the GNU libraries (e.g., on Linux) do. Like most programs we need the C/POSIX locales to allow arbitrary bytes. So explicitly check if we're in a single-byte locale (which would also include ISO-8859 variants) and simply pass-thru the chars without encoding or decoding. Fixes #2802. 2016-03-11 10:17:39 +08:00			`# This test is subtle. Despite the presence of the \u00fc unicode char (a "u"`
			`# with an umlaut) the fact the locale is C/POSIX will cause the \xfc byte to`
			`# be emitted rather than the usual UTF-8 sequence \xc3\xbc. That's because the`
			`# few single-byte unicode chars (that are not ASCII) are generally in the`
remove unset vars from the environment Remove vars from the environment that are no longer set. Simplify the code by removing an unnecessary loop. Add some tests. Fixes #3124 2016-06-14 10:00:30 +08:00			`# ISO 8859-x char sets which are encompassed by the C locale. The output should`
fix handling of non-ASCII chars in C locale The relevant standards allow the mbtowc/mbrtowc functions to reject non-ASCII characters (i.e., chars with the high bit set) when the locale is C or POSIX. The BSD libraries (e.g., on OS X) don't do this but the GNU libraries (e.g., on Linux) do. Like most programs we need the C/POSIX locales to allow arbitrary bytes. So explicitly check if we're in a single-byte locale (which would also include ISO-8859 variants) and simply pass-thru the chars without encoding or decoding. Fixes #2802. 2016-03-11 10:17:39 +08:00			`# be "59fc59".`
Replace references to ".../test/root/bin/fish" in the checks 2020-02-08 18:06:36 +08:00			`env LC_ALL=C $fish -c 'echo -n Y\u00FCY' \| display_bytes`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 131 374 131`
			`#CHECK: 0000003`
fix handling of non-ASCII chars in C locale The relevant standards allow the mbtowc/mbrtowc functions to reject non-ASCII characters (i.e., chars with the high bit set) when the locale is C or POSIX. The BSD libraries (e.g., on OS X) don't do this but the GNU libraries (e.g., on Linux) do. Like most programs we need the C/POSIX locales to allow arbitrary bytes. So explicitly check if we're in a single-byte locale (which would also include ISO-8859 variants) and simply pass-thru the chars without encoding or decoding. Fixes #2802. 2016-03-11 10:17:39 +08:00
			`# The user can specify a wide unicode character (one requiring more than a`
			`# single byte). In the C/POSIX locales we substitute a question-mark for the`
			`# unencodable wide char. The output should be "543f54".`
Replace references to ".../test/root/bin/fish" in the checks 2020-02-08 18:06:36 +08:00			`env LC_ALL=C $fish -c 'echo -n T\u01FDT' \| display_bytes`
Port locale test to littlecheck 2020-02-08 16:38:23 +08:00			`#CHECK: 0000000 124 077 124`
			`#CHECK: 0000003`
Decode multibyte escapes immediately We forgot to decode (i.e. turn into nice wchar_t codepoints) "byte_literal" escape sequences. This meant that e.g. ```fish string match ö \Xc3\Xb6 math 5 \X2b 5 ``` didn't work, but `math 5 \x2b 5` did, and would print the wonderful error: ``` math: Error: Missing operator '5 + 5' ^ ``` So, instead, we decode eagerly. 2022-09-29 22:53:16 +08:00
			`string match ö \Xc3\Xb6`
			`#CHECK: ö`

			`math 5 \X2b 5`
			`#CHECK: 10`

			`math 7 \x2b 7`
			`#CHECK: 14`
Make \x the same as \X Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f. That's kind of annoying and useless. A subtle change is that `\xHH` now represents the character (if any) encoded by the byte value "HH", so even for values <= 0x7f if that's not the same as the ASCII value we would diverge. I do not believe anyone has ever run fish on a system where that distinction matters. It isn't a thing for UTF-8, it isn't a thing for ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS. I am reasonably certain we are making that same assumption in other places. Fixes #1352 2022-09-30 01:27:18 +08:00
			`echo \xc3\xb6`
			`# CHECK: ö`
			`echo \Xc3\Xb6`
			`# CHECK: ö`