Threaded Mode | Linear Mode

ijabbott · 07-18-2020, 02:38 PM

(07-17-2020 08:09 PM)parisse Wrote:
(07-17-2020 08:27 AM)ijabbott Wrote: Would it make more sense for the string to character code sequence (and vice versa) functions to deal with unicode character numbers instead of UTF-8 bytes?

Currently:

asc("πi") -> {207, 128, 105} # UTF-8 sequence for U+03C0, U+0069
char({207, 128, 105}) -> "πi"
ord("πi") -> 207 # only the first byte of the UTF-8 sequence for U+03C0
char(207) -> "" # empty string due to partial UTF-8 sequence

('ord' on a non-ASCII character is particularly useless.)

Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"
This is certainly not an area where I will make changes, because that certainly means a lot of potential errors. And it's not clear for me that Unicode 16 bits encoding is really better than UTF8, both encoding have their advantages and drawbacks, it depends what you are doing.

I wasn't really suggesting switching to UTF-16 or some other encoding. I was suggesting that the functions could convert between the UTF encoding and UCS. The UCS numbers are the actual Unicode character numbers, and the UTF code sequence is the representation of that UCS character number as a sequence of one or more storage units (8-bit bytes in the case of UTF-8).