Threaded Mode | Linear Mode

parisse · 07-17-2020, 08:09 PM

(07-17-2020 08:27 AM)ijabbott Wrote: Would it make more sense for the string to character code sequence (and vice versa) functions to deal with unicode character numbers instead of UTF-8 bytes?

Currently:

asc("πi") -> {207, 128, 105} # UTF-8 sequence for U+03C0, U+0069
char({207, 128, 105}) -> "πi"
ord("πi") -> 207 # only the first byte of the UTF-8 sequence for U+03C0
char(207) -> "" # empty string due to partial UTF-8 sequence

('ord' on a non-ASCII character is particularly useless.)

Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"

This is certainly not an area where I will make changes, because that certainly means a lot of potential errors. And it's not clear for me that Unicode 16 bits encoding is really better than UTF8, both encoding have their advantages and drawbacks, it depends what you are doing.