Threaded Mode | Linear Mode

John Keith · (This post was last modified: 10-07-2018 03:11 PM by John Keith.)

Reading this post in the HHC 2018 Programming Contests thread reminded me that I had always thought of HEAD as being slow, so I decided to run some tests. HEAD is slightly faster than 1. GET (by about 1 millisecond) but both slow down significantly on large lists. The longer (and much less elegant!) sequence 1. 1. SUB EVAL is only slightly faster than HEAD on small lists, but note the timings on a list of 1000 items:

Code:

HEAD           22ms

1. GET         23ms

1. 1. SUB EVAL 3.1ms

This is shocking to me since HEAD is one of the fundamental list processing commands and one would expect it to be highly optimized. There are situations where 1. 1. SUB EVAL does not give the same result as HEAD (such as a list of programs), but I still can't see why HEAD is so much slower.

Edited to add that 2. OVER SIZE SUB is slower than TAIL (49ms vs 28ms on a list of 1000 integers). Curiouser and curiouser! I suspect that garbage collection is partially responsible for the slowness on large lists but I still don't know why HEAD is that much slower than SUB.

3298 · (This post was last modified: 10-07-2018 09:17 PM by 3298.)

That's curious indeed.
As some forum regulars might remember (despite my relatively low post frequency) I have some SysRPL experience, and with the essentials for on-board SysRPL and ASM development installed, it was easy for me to peek inside the guts of the commands and understand most of it.
HEAD is, as mentioned in the post you linked, a romptr. That's SysRPL talk for a library command; several standard UserRPL commands are like that. I'll make it short and just say that this incurs a certain performance penalty, but that penalty does not depend on what data is on the stack. A garbage collection can be triggered while resolving the romptr, but I don't think that's the reason for your measurements. You'll see why in just a moment.

Most SysRPL commands do not check if they get correct arguments - there are dedicated SysRPL commands to ensure that, and the typical implementation of a UserRPL command consists of a bunch of these checking commands followed by the commands that actually perform the operation. In HEAD's case, after confirming that at least one object is on the stack, and it is a list, this command sequence is run:

Code:

::

  DUPLENCOMP

  #0=case

  PTR 1933C

  CARCOMP

;

where CARCOMP is the command that actually does the work, and the unnamed program at address 1933C is:

Code:

::

  # 501

  ERROROUT

  (some irrelevant stuff follows; it's only here due to extreme code size optimization and cannot be reached normally due to the never-returning error handler launched by the previous command)

;

Translated into UserRPL commands, HEAD would therefore look approximately like this:

Code:

\<<

  DUP SIZE

  IF 0. ==

  THEN #501h DOERR

  END

  (actually retrieve the first element of the list here)

\>>

The crucial bit is the error checking code for empty lists. It counts the elements in the list and checks if that number is 0. In RPL a list is just a sequence of objects terminated by an end-of-list token (called SEMI, its address in the 49g and later is 0312B), similar to a null-terminated character string in C. Getting the list length has to be done like the C function strlen: walking through the list and adding 1 for any element encountered until the end token is reached. That obviously takes longer the more elements the list has.
In the end, even though HEAD should only need to access the first element in the list, it walks through the entire list.
You might wonder why it doesn't just pull the first element and then checks if it's SEMI ... the answer is "don't do that", because a SEMI on the stack can cause serious trouble, e.g. when it's decompiled for displaying or editing (the decompiler goes "what the heck ended here" and might very well crash or wreak havoc in your RAM). The proper solution would be the NULLCOMP? SysRPL command which does the zero-length check in ASM and only pushes the result of the comparison to the stack.

GET is a similar case. If it gets a list and a real number (it can process much more, e.g. a matrix and a list which should contain one or two elements, but we'll ignore that for now), then it checks if the number is 0 or lower (that's an error, of course), and finally if the number is at most the list length. How does it get the list length, you ask? Why, DUPLENCOMP of course! So the performance issue for long lists is the same as with HEAD.
The faster way would have been to use ASM code to walk through the list until either the desired index or the end is reached, which obviously avoids walking all the way to the end of a long list when the index is small. Unfortunately that ASM code would be significantly longer, and the error-generating subprogram of HEAD already showed us a glimpse of how tight the size limits were and what mad tricks the developers used to keep the size down. (As a matter of fact, some runstream abuse would have allowed writing this procedure in SysRPL as well, but at a similar size to the ASM version, and slower for high indices too.)

SUB is a rare beast. Its list-and-two-reals version just converts the numbers to system binary integers and refers to a SysRPL command without prior index-checking. Instead (believe it or not - if you don't, SUBCOMP is the command to read up about), the ASM code contained in that command performs the checks while it's doing its duty, similar to how I outlined in the paragraph about how I would have liked GET to be implemented.
With that knowledge in mind, maybe GET should have just called the internal version of SUB... that would have been possible in the same size GET's list-and-real version has now, but I don't think we'll get another ROM version, so that's a moot point anyway.

Edit: forgot to explain TAIL vs. 2. OVER SIZE SUB. While TAIL is a romptr like HEAD whereas none of the commands in the replacement sequence are, the performance difference is much easier to explain. TAIL actually does the SysRPL equivalent of 2. 1048575. SUB (note: larger-than-length stop index doesn't bother the internal version of SUB, and the UserRPL command SUB even inherits that trait), so the difference is obviously your use of OVER SIZE. Again, measuring the size of a list involves walking all the way through it, so SIZE takes a while to finish on your long list, whereas the large number used by TAIL is just a constant which gets put on the stack without any lengthy calculation. (By the way, the exact value of this constant has a meaning, it's the largest number representable by a 20-bit unsigned integer, which is what system binary integers really are.)

pier4r · 10-07-2018, 09:19 PM

nice input John, and thanks 3298 for the info!

Pretty sure DavidM may add something too (even to confirm things).

John Keith · (This post was last modified: 10-07-2018 09:31 PM by John Keith.)

Thanks for your detailed reply, 3298. I have only a vague knowledge of the internal structure of objects but I had assumed that a list had its size encoded within it somehow. The user command SIZE seems to be quite fast even on large lists.

I am aware of the compromises necessary to fit the operating system and commands into a relatively small amount of memory- "Life is short and ROM is full." It still irks me that such basic commands as HEAD and TAIL were not better thought out.

DavidM · 10-08-2018, 03:54 AM

(10-07-2018 09:19 PM)pier4r Wrote: nice input John, and thanks 3298 for the info!

Pretty sure DavidM may add something too (even to confirm things).

There's nothing to add! Thanks to 3298 for the excellent analysis and explanation.

pier4r · (This post was last modified: 10-08-2018 09:56 AM by pier4r.)

(10-07-2018 09:31 PM)John Keith Wrote: It still irks me that such basic commands as HEAD and TAIL were not better thought out.

With nowadays experience that a lot of software improves over time, I find it ok to first have a solution (the HEAD, TAIL commands) that spares me the time to code a similar routine. The optimizations can come after.
This is possible with products that are subject to upgrades, with a discontinued product one has to accept what one has.

Or one waits for gold code like the one from DavidM (or one does his own code).

3298 · (This post was last modified: 10-08-2018 10:17 AM by 3298.)

Yes you could add something! ... well, to your list commands library.
Proposed name: LHEAD
Proposed implementation:

Code:

::

  CK1&Dispatch

  BINT5

  ::

    DUPNULLCOMP?

    case

    ::

      # 501

      ERROROUT

    ;

    CARCOMP

  ;

;

If you want to be even more efficient, you could investigate how stable PTR 1933C is across ROM versions (or at least across the ROM versions you want to support), and possibly use that in place of the innermost embedded subprogram. With that modification LHEAD would be pretty much identical to a theoretical fixed version of HEAD, except it's missing the string handling part (but because RPL strings have a size field, HEAD's length-checking approach is fine for those; no need to duplicate that code).
I only replaced two 5-nibble commands by two other 5-nibble commands, (DULENCOMP #0=case became DUPNULLCOMP? case), so an adventurous person could even fix their ROM in situ without much headache, but I don't think there are many people who would actually do that.
As a library command LHEAD would obviously have the performance penalty and the length of a romptr, but since HEAD is one too, we don't exactly lose anything there.

TAIL is already as good as it can get. Well, apart from the romptr thing, but if the developers were out of space in the permanently mapped part of ROM, that's one of two solutions, the other being the flashptr introduced with the 49g: 12 nibbles long compared to the romptr's 11 nibbles, and slightly faster (though for long lists this becomes irrelevant because then TAIL spends most of its time walking through it; for TAIL this is unavoidable because it needs to copy all list elements except for the first one to a new list).

Werner · 10-08-2018, 11:03 AM

That's why, in my GX library for the SX, I implemented HEAD as

Code:

ASSEMBLE

        CON(1)  8

RPL

xNAME HEAD

::

  CK1&Dispatch

  FIVE

  ::

    >R

    ticR

    ITE

    RDROP

    INVDIMERR

  ;

  THREE

  ::

    DUPLEN$ #0=case INVDIMERR

    CAR$

    CHR>$

  ;

;

Where INVDIMERR is

Code:

NULLNAME INVDIMERR

:: # 501 ERROROUT ;

Cheers, Werner

pier4r · 10-08-2018, 02:08 PM

In the listExt there should be already LPOP or LPOPR to have a sort of head behavior, isn't it?

DavidM · 10-08-2018, 03:36 PM

(10-08-2018 10:06 AM)3298 Wrote: Yes you could add something! ... well, to your list commands library.
...
If you want to be even more efficient, you could investigate how stable PTR 1933C is across ROM versions (or at least across the ROM versions you want to support), and possibly use that in place of the innermost embedded subprogram.

(10-08-2018 11:03 AM)Werner Wrote: That's why, in my GX library for the SX, I implemented HEAD as
...

^ErrBadDim (FPTR 6 92) appears to be a supported flash pointer that performs the # 501 ERROROUT sequence, at least on the supported platforms for ListExt.

I certainly don't mind including a list-focused HEAD replacement in ListExt ('LHEAD' definitely follows the established naming convention Smile

). Both of the above approaches are able to check for an empty list and isolate the first element fairly efficiently, and a preliminary check on an emulated system shows them to perform similarly. I'll follow that up with some tests on a real 50g when I get a moment, but I'd be surprised if there were any meaningful performance differences between the two approaches. 3298's is a couple of bytes smaller, so I'd lean in that direction if there's truly no performance difference on real hardware.

(10-08-2018 02:08 PM)pier4r Wrote: In the listExt there should be already LPOP or LPOPR to have a sort of head behavior, isn't it?

The big difference with LPOP/R is that they leave the residual list elements on the stack. This means that they both have a need of traversing and copying the entire list in order to return the complete results, thus incurring the performance penalties both of those steps require. The main benefit of something like LHEAD is that it would only need to deal with the first element, and thus it can operate much faster than something that has to scan/rebuild the entire list. If you don't need the residual list elements, LHEAD would be significantly faster than LPOP NIP.

Nice discussion!

David Hayden · 10-08-2018, 03:42 PM

(10-08-2018 10:06 AM)3298 Wrote: Yes you could add something! ... well, to your list commands library.
Proposed name: LHEAD
Proposed implementation:

How about this instead:

Code:

::

  CK1&Dispatch

  BINT5

  ::

    DUPNULLCOMP?

    NOTcase CARCOMP

    # 501

    ERROROUT

  ;

;

This is shorter and probably faster in the expected case because it doesn't have to skip the secondary for the error case.

DavidM · 10-08-2018, 04:12 PM

(10-08-2018 03:42 PM)David Hayden Wrote: How about this instead:

Code:

:: CK1&Dispatch BINT5 :: DUPNULLCOMP? NOTcase CARCOMP # 501 ERROROUT ; ;
This is shorter and probably faster in the expected case because it doesn't have to skip the secondary for the error case.

That's a nice optimization!

It's similar to what I tested on the emulator when I posted the previous message, just in a reversed form:

Code:

::

  CK1&Dispatch

  list ::

    DUPNULLCOMP? case FLASHPTR ErrBadDim

    CARCOMP

  ;

;

For comparison to a similarly-modified version of Werner's code which doesn't include string support:

Code:

::

  CK1&Dispatch

  list ::

    >R

    ticR ITE

      RDROP

      FLASHPTR ErrBadDim

  ;

;

At least on my emulated 50g, their performance is equivalent. I'll check on real hardware later today (I hope).

Any other optimizations jump out?

pier4r · (This post was last modified: 10-08-2018 05:24 PM by pier4r.)

(10-08-2018 03:36 PM)DavidM Wrote: The big difference with LPOP/R is that they leave the residual list elements on the stack. This means that they both have a need of traversing and copying the entire list in order to return the complete results, thus incurring the performance penalties both of those steps require. The main benefit of something like LHEAD is that it would only need to deal with the first element, and thus it can operate much faster than something that has to scan/rebuild the entire list. If you don't need the residual list elements, LHEAD would be significantly faster than LPOP NIP.

Point! Then go for it!
Obviously LTAIL follows.

3298 · 10-08-2018, 05:55 PM

I had missed ^ErrBadDim because it's not filed under the Error Handling category in chapter 2 ("General SysRPL Entries") of SDIAG's EBR, but instead under the Errors category in chapter 4 ("The HP49G CAS"), which I totally forgot about. DavidM, thanks for reminding me about that - I think some of my SysRPL projects could do with an ^ErrBadDim makeover as well. I know for sure that my L856 (unreleased, don't bother trying to find it) has some conditions for raising that error.

I've run some tests on real hardware, with ^ErrBadDim in all versions, and under as equal as possible conditions even in terms of garbage collection influence and memory available. I came to the conclusion that while the three variants still in discussion (mine is out, DavidM's is an obvious improvement over it) are pretty close to each other, Werner's return-stack-based variant is definitely the slowest one. DavidM's somehow managed to beat David Hayden's by a small margin - but consistently. No clue why, because David Hayden's reasoning about skipping appears solid to me. Maybe because SKIPOB has been migrated to ARM code?

Just for fun, I tested vanilla HEAD under the same conditions. Even on the two-element list I used for testing, it was quite a bit slower than the slowest of the three. And when I tested the actual command HEAD as romptr without recalling its contents using ROMPTR@, the time went off the charts. On these small test cases a romptr apparently has a significant impact.

Using PTR 1933C in place of ^ErrBadDim speeds up DavidM's and Werner's versions a bit more. David Hayden's version obviously doesn't benefit from that at all, because it is neither executed nor skipped there. It's also smaller and should be slightly faster in case of an error (maybe it will get caught and handled in some application?), but ultimately its down to DavidM as maintainer of ListExt to decide between that and the guarantees of a stable entry point.

My testing code:

Code:

::

  ' ::

    10000 ZERO_DO

      { %1 %2 }

      OVER EVAL DROP

    LOOP

  ;

  xTEVAL

;

Remember to put the test subject on the stack before running the testing program, it's not secured against empty stack and other bad situations - my thinking behind that was to spend as few CPU cycles as possible outside the test subject for better measurement.

(10-08-2018 05:24 PM)pier4r Wrote: Point! Then go for it!
Obviously LTAIL follows.

I don't think there's much point in trying to improve on TAIL... that would only be for symmetry.

DavidM · 10-08-2018, 07:47 PM

(10-08-2018 05:55 PM)3298 Wrote: I've run some tests on real hardware, with ^ErrBadDim in all versions, and under as equal as possible conditions even in terms of garbage collection influence and memory available. I came to the conclusion that while the three variants still in discussion (mine is out, DavidM's is an obvious improvement over it) are pretty close to each other, Werner's return-stack-based variant is definitely the slowest one.

Thanks for doing that -- you saved me a lot of time!

(10-08-2018 05:55 PM)3298 Wrote: DavidM's somehow managed to beat David Hayden's by a small margin - but consistently. No clue why, because David Hayden's reasoning about skipping appears solid to me. Maybe because SKIPOB has been migrated to ARM code?

NOTcase has one extra jump in it compared to case (presumably to use GONC instead of GOVLNG), but that should be a nit in the final timing. That's the only reason I can think of for a difference, though.

(10-08-2018 05:55 PM)3298 Wrote: Just for fun, I tested vanilla HEAD under the same conditions. Even on the two-element list I used for testing, it was quite a bit slower than the slowest of the three. And when I tested the actual command HEAD as romptr without recalling its contents using ROMPTR@, the time went off the charts. On these small test cases a romptr apparently has a significant impact.

Which also underscores why short-and-fast library routines are much more heavily impacted by where the library is stored: the access/load time for the command code is a much greater percentage of the total execution time, so the system overhead is a more significant part of the overall performance.

(10-08-2018 05:55 PM)3298 Wrote: Using PTR 1933C in place of ^ErrBadDim speeds up DavidM's and Werner's versions a bit more. David Hayden's version obviously doesn't benefit from that at all, because it is neither executed nor skipped there. It's also smaller and should be slightly faster in case of an error (maybe it will get caught and handled in some application?), but ultimately its down to DavidM as maintainer of ListExt to decide between that and the guarantees of a stable entry point.

As far as ListExt is concerned, I'm OK using PTR 1933C if it is stable in all ROM versions of the 49G as well as the released versions of the 49g+/50g/48gII up through 2.15. Since I don't know of a sure-fire way to verify that, I'll probably just stick with ErrBadDim.

(10-08-2018 05:55 PM)3298 Wrote:
(10-08-2018 05:24 PM)pier4r Wrote: Point! Then go for it!
Obviously LTAIL follows.
I don't think there's much point in trying to improve on TAIL... that would only be for symmetry.

I could see perhaps creating a symbolic link for an LTAIL command that simply maps to TAIL, just for the symmetry in naming. Although now that I'm thinking about it, how much time would be saved by not having the dispatch defined for a string object? It strikes me that I've never actually gone to the trouble to see what the incremental performance impact is for each additional defined dispatch type of a single command. More things to check!

John Keith · 10-08-2018, 08:27 PM

I favor not including LTAIL if there is no significant speed improvement. Better not to bulk up the Library or its menus unnecessarily, IMHO.

pier4r · (This post was last modified: 10-10-2018 11:08 AM by pier4r.)

yes I was thinking about LTAIL as super quick like the SUB command. But likely TAIL goes traversing the list as well, so it is not needed. What I think it is needed is a mention in the documentation.
"LTAIL --> use TAIL".

DavidM · 10-10-2018, 02:56 PM

(10-08-2018 08:27 PM)John Keith Wrote: I favor not including LTAIL if there is no significant speed improvement.

(10-10-2018 11:07 AM)pier4r Wrote: What I think it is needed is a mention in the documentation.

I was originally thinking that I could include a symbolic link entry (tNAME) in the ListExt catalog that would go by the name LTAIL but would actually compile to a reference to TAIL (including it in the library's menu wouldn't be necessary). RPLCOMP appears perfectly happy with that situation, but SLOAD sees the ROMPTR as a duplicate entry and thus won't load the reference. So in simpler terms, "it doesn't work".

That means the only way I could manage it would be to create a short object that simply jumps to the TAIL code. That works, but because it is essentially just an extra jump in the execution chain, it can only slow things down.

This is a long way to say that I agree that even just creating an "alias" (LTAIL->TAIL) isn't worthwhile.

As for the documentation, I would think listing "TAIL (standard RPL command)" in the "See also:" section for LHEAD should be adequate.

3298 · 10-10-2018, 05:41 PM

(10-10-2018 11:07 AM)pier4r Wrote: yes I was thinking about LTAIL as super quick like the SUB command. But likely TAIL goes traversing the list as well, so it is not needed. What I think it is needed is a mention in the documentation.
"LTAIL --> use TAIL".

Pretty much that.

TAIL is almost as quick as SUB because it internally uses the same SysRPL command as SUB (that one being SUBCOMP). The only reason it's slower is the romptr: TAIL is one, SUB isn't, but SUB needs two numbers to be equivalent to TAIL, the first being 2 and the second being at least as large as the list length. Let's take a closer look at what benefits each command has.

In terms of size, TAIL should always win. As much as I wish TAIL wasn't a romptr, 5.5 bytes is better than the 7.5 bytes minimum for SUB, because for SUB you need two real numbers as parameters. SUB itself takes 2.5 bytes, the 2. you want as start of the range takes another 2.5 bytes (any single-digit whole number is compiled to a pointer into ROM, so you just need the size of a pointer instead of the size of the actual object), and the end of the range has to be at least as large as the length of the longest list your code needs to handle. If that maximum length is 9 or less, that's another 2.5 bytes (for 7.5 bytes total), other solutions are at least as long because you need at least one command to yank an appropriate number to the correct position on the stack (level 1 on entering SUB).

In terms of speed, the matter is more difficult. TAIL has the overhead of a romptr, but TAIL can also take a while because it has to convert the real numbers (BCD-encoded floating point) to system binary integers (hexadecimal). My tests indicate that if you just plug the real number 1048575. in there (this is the highest value a system binary integer can hold; TAIL uses a system binary integer with that value as a parameter to SUBCOMP), SUB is actually going to lose because the conversion of that number (and the 2. accompanying it) takes a bit longer than the romptr overhead. The less digits the number has, the faster the conversion, so if you know the list is 9 elements or shorter and the two additional bytes compared to TAIL don't bother you, 2. 9. SUB wins. For bigger numbers SUB's advantage diminishes bit by bit, and you'll have to accept either the heavy hit of a real number object (2. 99. SUB is 2.5+10.5+2.5=15.5 bytes long!) or put in a short calculation for an upper bound of the length. Though be alert: OVER SIZE may make the entire TAIL-replacing snippet just 10 bytes long, but SIZE walks through the entire list, which makes this snippet not only longer but at the same time slower than TAIL, especially for huge lists. The best I can think of would be taking the next power of 10 that's larger than the longest expected list length (e.g. 1000), and combining the exponent (which will be a single-digit number, in this example 3) with the command ALOG. That also makes for a 10-byte snippet, but it might still be slightly faster than TAIL.

Oh, and when you want to chop off more than one element at the begin of a list, don't chain multiple TAIL calls. That takes obviously about n times as long as SUB (where n is the number of TAIL calls), and SUB can do it at basically no time penalty by increasing the 2 in our snippet by n.

The bottom line is that if I'd use UserRPL more, I'd indeed prefer TAIL due to its acceptable tradeoff between size and performance. If the list is known to be 9 elements or shorter and performance is a major concern, the SUB alternative might be worth exploring - but if speed is so important, SysRPL is definitely the way to go. If I need to chop off more than one element, SUB is the right tool for the job.

---

A library could win against TAIL if you copy the TAIL code and store the library in port 0, because that port gets favorable treatment. It's part of the RAM area that's permanently mapped into the Saturn processor's address space, so the code will not suddenly be gone when something changes the mapping in the section of the address space where other ports or ROM sections are mapped. The library command loading code is smart enough to know that, so instead of copying the code into TEMPOB (also part of the permanently mapped area; the only part of the mapped area not listed so far that's able to store objects is USEROB a.k.a. HOME and its subdirectories) it just runs it in place. I still don't think it's worth adding it to ListExt, even though one should probably install it into port 0 to get the most out of DavidM's speed optimization.

DavidM · (This post was last modified: 10-20-2018 06:06 PM by DavidM.)

(10-10-2018 05:41 PM)3298 Wrote: A library could win against TAIL if you copy the TAIL code and store the library in port 0, because that port gets favorable treatment.
...

All of this discussion about HEAD and TAIL prompted me to try something I've wanted to do for a while now. Note: this isn't a realistic approach to a TAIL replacement, but was done more out of satisfying a curiosity than anything else. I could perhaps see doing something like this in a very specific application, but not for general use (reasons will be shown below):

Code:

!NO CODE

!RPL

::

   CK1NOLASTWD

   CK&DISPATCH1

   list ::

      ROMPTR2 ~INTEMPOB? ITE

      ( list is in TEMPOB )

      CODE

         % save RPL registers

         SAVE

         % D0 -> first list element OR SEMI if empty list

         A=DAT1 A

         R1=A A      % R1.A -> SL1 object (initially the source list)

         AD0EX

         D0+5

         % exit with no change if list is empty

         A=DAT0 A

         LC(5)SEMI

         ?A=C A ->{ LOADRPL }

         % C.A = size of first list element

         GOSBVL =SKIPOB

         CD0EX

         A=R1 A

         A+5 A

         C-A A

         % move first element back 5 nibbles

         D0=A

         A-5 A

         D1=A           

         GOSBVL =MOVEDOWN

         % write the new list header

         LC(5)DOLIST

         DAT1=C A

         % R2.A -> TAIL list

         AD1EX

         R2=A A

         % load original RPL registers

         LOAD

         % update SL1 pointer to point to new list header

         A=R2 A

         DAT1=A A

         % return to RPL

         RPL

      ENDCODE

      ::

         ( list is NOT in TEMPOB - use standard method )

         DUPNULLCOMP? ?SEMI

         CDRCOMP

      ;

   ;

;

@

The above code can be summarized as follows:

Code:

If list argument is in TEMPOB:

   If list is empty:

      exit with no changes

   else (list has at least one element):

      swap the first element and the original list prologue

      update the SL1 pointer to reference the new location of the list prologue

else (list is NOT in TEMPOB):

   perform normal SysRPL TAIL function

This only provides a benefit if the list argument is in TEMPOB, otherwise it simply does something similar to the regular TAIL command. Objects reside in TEMPOB when they have been created/edited solely as stack entities and haven't been saved into a global variable yet. Further, this doesn't provide any benefit vs. TAIL unless the element count in the source list is big enough -- I don't have a real 50g to test this on at present so I'm not sure where the breakeven point is. An emulated 50g on my laptop shows that it reaches parity with TAIL when the element count reaches about 10 elements, but I would think the count is actually lower than that on real hardware. The execution time is more dependent on the size of the first list element than on the actual count of list elements, which is where its true advantage comes into play. TAIL slows down as the list element count increases, but the above program should execute at near constant speed regardless of the list size.

So why is this not practical? Because it commits a cardinal sin: it alters the list object in place. This is a big no-no for general RPL commands, and breaks some expected functionality such as LASTARG/UNDO, saved stacks, and error recovery constructs. Perhaps one of the most likely problems that could result from this type of operation is that any other references to the original list (usually in the form of stack duplicates) will now simply have the value of the first element instead of the original list. This is clearly not what most would expect to happen, and as such I wouldn't recommend trying something like this except in a very controlled situation.

This was an interesting exercise, though. Unfortunately the TEMPOB check is another ROMPTR, so that negates some of the potential benefit that the program would have otherwise.

Edit:
I tried replacing the ROMPTR ~INTEMPOB? with another embedded code segment, and found that the above program with that change was indeed faster than TAIL for all tests (at least on the emulator -- still don't have access to my 50g).

Also, the SysRPL handler for non-TEMPOB lists doesn't need the null comp test, since CDRCOMP gives that result anyway. So the "DUPNULLCOMP? ?SEMI" line can be removed entirely, which in turn negates the need to encapsulate CDRCOMP in a secondary. Double the savings! Smile