Threaded Mode | Linear Mode

compsystems · 07-14-2020, 01:56 AM

CAS MODE

asc("≤") [enter] {32,8804,32} ?
[up][up][copy]
asc("≤") [enter] {32,32,8804,32,32} ?
[up][up][copy]
asc("≤") [enter]
{32,32,32,8804,32,32,32}

HOME MODE OK

ijabbott · (This post was last modified: 07-14-2020 09:05 AM by ijabbott.)

(07-14-2020 01:56 AM)compsystems Wrote: CAS MODE

asc("≤") [enter] {32,8804,32} ?
[up][up][copy]
asc("≤") [enter] {32,32,8804,32,32} ?
[up][up][copy]
asc("≤") [enter]
{32,32,32,8804,32,32,32}

HOME MODE OK

It seems to be inserting spaces around any "≤" characters in the string before evaluating the function.

Also, the real calculator in CAS mode seems to treat the string as a sequence of UTF-8 byte codes instead of the actual character codes, and so gives a different answer: {32,226,137,164,32}. But in HOME mode it produces the answer: {8804}.

parisse · 07-14-2020, 05:59 PM

The CAS is working with UTF8. Spaces are indeed added by the lexer to insure that inequations are recognized, even if there is no spaces in the input (that's because the lexer is working in 8-bits mode).

DrD · 07-14-2020, 06:56 PM

Notice that repeated calls to that character add additional spaces:

[CAS]
ASC("≤"); ==> {32, 8804, 32}
ASC("≤"); ==> {32, 32, 8804, 32, 32}
ASC("≤"); ==> {32, 32, 32, 8804, 32, 32, 32}

And so forth ...

Joe Horn · 07-14-2020, 07:13 PM

(07-14-2020 09:01 AM)ijabbott Wrote: Also, the real calculator in CAS mode seems to treat the string as a sequence of UTF-8 byte codes instead of the actual character codes, and so gives a different answer: {32,226,137,164,32}. But in HOME mode it produces the answer: {8804}.

Also, CAS treats ASC() and asc() differently for non-ASCII characters. Try both on the Greek lower-case pi character (Shift 3):
ASC("π") --> {960}
asc("π") --> {207, 128}

parisse · 07-17-2020, 05:01 AM

ASC is not a CAS command. CAS commands are objects, therefore you can in most situations detect CAS commands by typing their name alone.
asc returns 'asc'
ASC errors.

ijabbott · 07-17-2020, 07:49 AM

(07-14-2020 05:59 PM)parisse Wrote: The CAS is working with UTF8. Spaces are indeed added by the lexer to insure that inequations are recognized, even if there is no spaces in the input (that's because the lexer is working in 8-bits mode).

Is there a way to create a quoted string without the string content being tampered with by the lexer?

ijabbott · 07-17-2020, 08:27 AM

Would it make more sense for the string to character code sequence (and vice versa) functions to deal with unicode character numbers instead of UTF-8 bytes?

Currently:

asc("πi") -> {207, 128, 105} # UTF-8 sequence for U+03C0, U+0069
char({207, 128, 105}) -> "πi"
ord("πi") -> 207 # only the first byte of the UTF-8 sequence for U+03C0
char(207) -> "" # empty string due to partial UTF-8 sequence

('ord' on a non-ASCII character is particularly useless.)

Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"

parisse · 07-17-2020, 08:05 PM

(07-17-2020 07:49 AM)ijabbott Wrote: Is there a way to create a quoted string without the string content being tampered with by the lexer?

Unfortunately no. It would require a change in set_lexer_string in the file input_lexer.ll (more precisely this function should determine if the special characters are inside or outside a string).

parisse · 07-17-2020, 08:09 PM

(07-17-2020 08:27 AM)ijabbott Wrote: Would it make more sense for the string to character code sequence (and vice versa) functions to deal with unicode character numbers instead of UTF-8 bytes?

Currently:

asc("πi") -> {207, 128, 105} # UTF-8 sequence for U+03C0, U+0069
char({207, 128, 105}) -> "πi"
ord("πi") -> 207 # only the first byte of the UTF-8 sequence for U+03C0
char(207) -> "" # empty string due to partial UTF-8 sequence

('ord' on a non-ASCII character is particularly useless.)

Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"

This is certainly not an area where I will make changes, because that certainly means a lot of potential errors. And it's not clear for me that Unicode 16 bits encoding is really better than UTF8, both encoding have their advantages and drawbacks, it depends what you are doing.

ijabbott · 07-18-2020, 02:38 PM

(07-17-2020 08:09 PM)parisse Wrote:
(07-17-2020 08:27 AM)ijabbott Wrote: Would it make more sense for the string to character code sequence (and vice versa) functions to deal with unicode character numbers instead of UTF-8 bytes?

Currently:

asc("πi") -> {207, 128, 105} # UTF-8 sequence for U+03C0, U+0069
char({207, 128, 105}) -> "πi"
ord("πi") -> 207 # only the first byte of the UTF-8 sequence for U+03C0
char(207) -> "" # empty string due to partial UTF-8 sequence

('ord' on a non-ASCII character is particularly useless.)

Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"
This is certainly not an area where I will make changes, because that certainly means a lot of potential errors. And it's not clear for me that Unicode 16 bits encoding is really better than UTF8, both encoding have their advantages and drawbacks, it depends what you are doing.

I wasn't really suggesting switching to UTF-16 or some other encoding. I was suggesting that the functions could convert between the UTF encoding and UCS. The UCS numbers are the actual Unicode character numbers, and the UTF code sequence is the representation of that UCS character number as a sequence of one or more storage units (8-bit bytes in the case of UTF-8).

Claudio L. · 07-23-2020, 07:08 PM

(07-17-2020 08:27 AM)ijabbott Wrote: Preferable (IMHO):

asc("πi") -> {960, 105} # UCS sequence U+03C0, U+0069
char({960, 105}) -> "πi"
ord("πi") -> 960 # UCS U+03C0
char(960) -> "π"

It is somewhat better, I implemented that in newRPL but...
Unfortunately, doesn't solve the whole issue, since some characters are composed of more than one code point, so a string with 2 symbols may return a list with 3 codes, and list(2) doesn't necessarily have the second character. This right there renders the list of codes format pretty much useless as a way of accessing string characters.
To make it even more complex, in some cases the same symbol may be represented as a unique code or as a sequence of codes, so 2 strings may look exactly the same but produce 2 different lists of codes.
Ideally you would have Unicode-aware routines that let you do string(2) and guarantee to return the second character (which may in itself be a string of several codes).
Then you also need a Unicode-aware comparison that can do NFC normalization so it can detect the case when characters are the same expressed differently.
Converting to a list to use generic list functions will never give you a perfect answer, hence there's not a lot of effort put into that conversion.
ASCII was easy, Unicode is no easy subject.

ijabbott · 07-24-2020, 10:54 PM

(07-23-2020 07:08 PM)Claudio L. Wrote: To make it even more complex, in some cases the same symbol may be represented as a unique code or as a sequence of codes, so 2 strings may look exactly the same but produce 2 different lists of codes.
Ideally you would have Unicode-aware routines that let you do string(2) and guarantee to return the second character (which may in itself be a string of several codes).
Then you also need a Unicode-aware comparison that can do NFC normalization so it can detect the case when characters are the same expressed differently.
Converting to a list to use generic list functions will never give you a perfect answer, hence there's not a lot of effort put into that conversion.
ASCII was easy, Unicode is no easy subject.

It seems that the Prime doesn't support combining character sequences anyway, so normalisation shouldn't be an issue.