Post Reply 
New Saturn asm "add loop" benchmark for the HP48G
10-26-2023, 05:24 PM (This post was last modified: 11-04-2023 01:17 PM by Jonathan Busby.)
Post: #1
New Saturn asm "add loop" benchmark for the HP48G
--------------------------------------------------------------------------------
WARNING : The first attachment, addloop-0.01b.zip, has a bugged ISR
in addloop and corrupts system time. These problems have been
corrected in version 0.02a . If you run 0.01b, it won't crash your calc ( at
least I don't think ), but it *can* generate bogus results.
--------------------------------------------------------------------------------

Please reference this thread for a list of other implementations.

I just recently wrote a new implementation of pier4r's "add loop" benchmark for the HP48G series.

The code overrides the interrupt system, sets TIMER2 for a 60-second countdown, and then executes :

( The following further optimized add loop was provided by Werner )
Code:
        P=      5
        C=0     W
l1      C=C+1   A
        GONC    l1
        C=C+1   P
        GONC    l1

The count I get is :

4383185

Not as much of a speedup as I had thought previous to coding this benchmark.

Note that the code, just due to my laziness, *cannot* be run on an HP48GX with a merged RAM card -- this limitation can be fixed with a =MOVEDOWN to ADISP. Also, and this is very important :

**Due to the nature of how the code reconfigures IRAM, this code is only known to work on HP48G/GX/G+ version R ROM models**

Have fun! Smile

Jonathan

EDIT #1 : Corrected funky grammar-typo...

EDIT #2 : I just noticed, that due to a minor mistake in the omission of a "P= 7" instruction, my code corrupts the system time. I'll upload a fixed version tomorrow...

EDIT #3 : Version 0.02a hopefully corrects the aforementioned "P= 7" mistake and the bug(s) in the ISR... Credit goes to Werner for the improved inner add loop...

EDIT #4 : Added latest version as attachment with none, 4x and 8x inner add loop unrolling...


Attached File(s)
.zip  addloop-0.01b.zip (Size: 3.67 KB / Downloads: 8)
.zip  addloop-0.02a.zip (Size: 3.52 KB / Downloads: 6)
.zip  addloop-0.03a.zip (Size: 5.04 KB / Downloads: 5)

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
10-26-2023, 06:11 PM
Post: #2
RE: New Saturn asm "add loop" benchmark for the HP48G
I realize this isn't the Prime forum, but I was curious what a Prime G2 could do with just a straight PPL program:

EXPORT ADDD()
BEGIN
A:=0;
WHILE 1 DO
A:=A+1;
END;
END;

Result was 4,979,849
Find all posts by this user
Quote this message in a reply
10-27-2023, 12:49 AM
Post: #3
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 06:11 PM)Xorand Wrote:  I realize this isn't the Prime forum, but I was curious what a Prime G2 could do with just a straight PPL program:

EXPORT ADDD()
BEGIN
A:=0;
WHILE 1 DO
A:=A+1;
END;
END;

Result was 4,979,849

Sorry, close by no cigar. The Prime has ADHD, not ADDD.

--Bob Prosperi
Find all posts by this user
Quote this message in a reply
10-27-2023, 12:56 AM
Post: #4
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 06:11 PM)Xorand Wrote:  [...] I was curious what a Prime G2 could do with just a straight PPL program:[...] Result was 4,979,849

Speaking of curiosity, running the equivalent program written in plain old RPN on an old Free42 version running on a 12-year-old iPad 2 gives 5,235,602.

V.

  
All My Articles & other Materials here:  Valentin Albillo's HP Collection
 
Visit this user's website Find all posts by this user
Quote this message in a reply
10-27-2023, 06:51 AM (This post was last modified: 10-27-2023 07:06 AM by Gjermund Skailand.)
Post: #5
RE: New Saturn asm "add loop" benchmark for the HP48G
Update
DM42 -USB connected screen turned off

01 LBL "xxx"
02 "RefLCD"
03 ASTO ST L
04 RCL IND ST L
05 STO 00
06 0 STO IND ST L
07 1E7
08 ENTER
09 LBL 00
10 DSE ST X
11 GTO 00
12 ASTO ST L
13 RCL 00
14 STO IND ST L
15 END

Manual timing and stop
screen off: 373166 ( screen on, 357200) Sad
PS to reset the screen refresh , manual goto line 12 and continue, see manual for other options for screen update
br Gjermund
Find all posts by this user
Quote this message in a reply
10-27-2023, 07:26 AM
Post: #6
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 05:24 PM)Jonathan Busby Wrote:  
Code:
        P=      0
        C=0     W
l1      C=C+1   WP
        GONC    l1
        P=P+1
        C=C+1   P
        GONC    l1

Wouldn't using the A-field be faster?

Code:
  P=5
  C=0 W
- C=C+1 A
  GONC -
  C=C+1 P
  GONC -
  P=P+1
  GONC -
Cheers, Werner

41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE
Find all posts by this user
Quote this message in a reply
10-27-2023, 08:19 AM
Post: #7
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 07:26 AM)Werner Wrote:  
(10-26-2023 05:24 PM)Jonathan Busby Wrote:  
Code:
        P=      0
        C=0     W
l1      C=C+1   WP
        GONC    l1
        P=P+1
        C=C+1   P
        GONC    l1

Wouldn't using the A-field be faster?

Code:
  P=5
  C=0 W
- C=C+1 A
  GONC -
  C=C+1 P
  GONC -
  P=P+1
  GONC -
Cheers, Werner

I'm not sure this code will work correctly, after you incremented P to 7.
My solution would be something like this:

Code:
        C=0     W
l1      C=C+1   B
        GONC    l1
        C=C+1   XS
        GONC    l1
        C=C+1   M
        GONC    l1

that spends most time incrementing the B-field. Not tested though.

J-F
Visit this user's website Find all posts by this user
Quote this message in a reply
10-27-2023, 08:35 AM
Post: #8
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 08:19 AM)J-F Garnier Wrote:  I'm not sure this code will work correctly, after you incremented P to 7.

You are right ;-) but we'll normally not get there, 2^24 = 16*10^6, and the count was in the order of 4*10^6. But we can't know up front, so:

Code:
 A=0 W
 C=0 W
 P=5
 C=C+1 P
- A=A+1 A
 GONC -
 A=A+C W
 GONC -

Should be marginally faster than yours and Jonathan's.

Cheers, Werner

41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE
Find all posts by this user
Quote this message in a reply
10-31-2023, 01:31 AM
Post: #9
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 07:26 AM)Werner Wrote:  
(10-26-2023 05:24 PM)Jonathan Busby Wrote:  
Code:
        P=      0
        C=0     W
l1      C=C+1   WP
        GONC    l1
        P=P+1
        C=C+1   P
        GONC    l1

Wouldn't using the A-field be faster?

Code:
  P=5
  C=0 W
- C=C+1 A
  GONC -
  C=C+1 P
  GONC -
  P=P+1
  GONC -

D'oh! Blush Very clever Smile Once again I am blinded by my monomania, this time directed towards the custom interrupt routine -- I hadn't given the inner add loop a second thought in terms of performance. In a perfect world, the "C=C+1 WP" instruction would only take two cycles to decode -- and that's what I overlooked. Already at P = 4 , the cycle time of "C=C+1 WP" equals the 7 cycles ( ignoring memory access considerations ) of the "C=C+1 A" instruction. If you really want to go nuts, you could unroll the increment code for the B field in a variation on J-F Garnier's code Tongue

Your original A field code brings the count up to around 4382026. So, a few hundred thousand more, but I was hoping for something more dramatic considering the code for an unoptimized HP-71B assembly language add loop, which runs ats 640KHz, reaches over 1E6 in 60 seconds.

Anyways, it seems my ISR code is bugged ( the IRAM reconfiguration code is flawless ) in some strange way which I still trying to work out ( it doesn't help that I have chronic sleep deprivation ( really Smile ) and I got up at around 4 AM today ). So, I left out checking for C.A overflow when TIMER2 generates an interrupt. This means it's *possible* that an interrupt could corrupt the count, but it's unlikely, and you'd know it if it happened -- I plan on having this fixed soon.

Quote:Cheers, Werner

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-01-2023, 03:49 PM (This post was last modified: 11-05-2023 02:04 AM by Jonathan Busby.)
Post: #10
RE: New Saturn asm "add loop" benchmark for the HP48G
I'm a little bit worried that my code is flawed Sad

( EDIT #1 : But not for the reasons originally posted here Smile )

Here's my logic : The inner add loop is

Code:
-       C=C+1   A
        GONC    -

We can ignore the "C=C+1 P" as it only executes four times.

Now, according to this , the inner loop should take 2̶4̶.̶5̶ 49 cycles per iteration. If that's true, then we have (3900000*60)/2̶4̶.̶5̶ 49 ~ 9̶5̶5̶1̶0̶2̶0̶ ̶ 4775510. T̶h̶i̶s̶ ̶i̶s̶ ̶a̶ ̶l̶i̶t̶t̶l̶e̶ ̶o̶v̶e̶r̶ ̶t̶w̶i̶c̶e̶ ̶t̶h̶e̶ ̶f̶i̶n̶a̶l̶ ̶c̶o̶u̶n̶t̶ ̶e̶v̶e̶n̶ ̶w̶i̶t̶h̶ ̶W̶e̶r̶n̶e̶r̶'̶s̶ ̶o̶p̶t̶i̶m̶i̶z̶e̶d̶ ̶i̶n̶n̶e̶r̶ ̶a̶d̶d̶ ̶l̶o̶o̶p̶.̶ ̶I̶ ̶k̶n̶o̶w̶ ̶t̶h̶a̶t̶ ̶l̶e̶a̶v̶i̶n̶g̶ ̶T̶I̶M̶E̶R̶2̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶s̶u̶c̶k̶s̶ ̶a̶ ̶l̶i̶t̶t̶l̶e̶ ̶b̶i̶t̶ ̶o̶f̶ ̶C̶P̶U̶ ̶t̶i̶m̶e̶ ̶d̶u̶e̶ ̶t̶o̶ ̶k̶e̶y̶b̶o̶a̶r̶d̶ ̶s̶c̶a̶n̶n̶i̶n̶g̶,̶ ̶b̶u̶t̶ ̶n̶o̶t̶ ̶*̶T̶H̶A̶T̶*̶ ̶m̶u̶c̶h̶.̶

S̶o̶,̶ ̶s̶h̶o̶u̶l̶d̶ ̶I̶ ̶b̶e̶ ̶w̶o̶r̶r̶i̶e̶d̶?̶ ̶D̶o̶e̶s̶ ̶a̶n̶y̶o̶n̶e̶ ̶h̶a̶v̶e̶ ̶a̶n̶ ̶e̶x̶p̶l̶a̶n̶a̶t̶i̶o̶n̶?̶

I've confirmed that my code runs the add loop for *exactly* 60 seconds -- even used a stopwatch. If there's a problem, I don't know what it is.

Regards,

Jonathan

EDIT #1 : I mistakenly divided the cycle count from Cycles du Saturn by two, therefore rendering my conclusion incorrect -- it's been corrected.

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-01-2023, 06:48 PM
Post: #11
RE: New Saturn asm "add loop" benchmark for the HP48G
[Possibly memory controller behavior. The memory controller is not well doscribed in any public documentation, but it matches the 4-bit Satun bus to 8-bit memory, and depending on the memory access pattern can introdue delays. Perhaps a tight loop is a case where that happens.

You might try unrolling the loop a bit, perhaps by putting two or four increments consecutively, and see whether the overhead is lower.
Find all posts by this user
Quote this message in a reply
11-01-2023, 07:30 PM (This post was last modified: 11-01-2023 08:09 PM by Jonathan Busby.)
Post: #12
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-01-2023 06:48 PM)brouhaha Wrote:  Possibly memory controller behavior.

Well, the cycle reference I used was supposed to take into account the SRAM instruction and data fetch and the extra cycles incurred by the Saturn bus.

Quote:The memory controller is not well doscribed in any public documentation,

I thought the Saturn bus was pretty well described in the HP-71B various internal design specs, including the daisy-chained memory controllers. But, if your talking about the internals of the Yorke memory comntroller(s) and eg. why there are quarter cycles, then there's very little information.

Quote:but it matches the 4-bit Satun bus to 8-bit memory,

Well, the minimum read cycle time for the Yorke is 1000nS . But, since bytes are fetched, the effective frequency is 2MHz which is the speed of the memory controllers and the Saturn bus.

Quote:and depending on the memory access pattern can introdue delays.

Well, the access pattern on the Saturn bus for the inner add loop is :

  1. NCD -> low
  2. PC READ command driven
  3. NSTR "dummy strobe"
  4. Read data
  5. NCD -> low (1)
  6. PC READ command driven (1)
  7. NSTR "dummy" strobe (1)
  8. Read data
  9. NCD -> low
  10. LOAD PC command driven
  11. New PC address driven
  12. Command auto-switch to PC READ
  13. Read data


(1) I'm not sure if these bus operations take place -- if the Yorke Saturn is designed well then they shouldn't, which means my final cycle count could be off by six...

So, in addition to the execution time, the Saturn bus introduces five ~2MHz cycles of overhead, which equates to ten ~4MHz cycles.

Quote:Perhaps a tight loop is a case where that happens.

Thanks for the tip Smile I need to investigate this...

Quote:You might try unrolling the loop a bit, perhaps by putting two or four increments consecutively, and see whether the overhead is lower.

Good idea Smile I think I'll do a 256 instruction unroll of the B field increment operation with a modified version of J-F Garnier's code.

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-02-2023, 05:45 PM (This post was last modified: 11-02-2023 05:47 PM by Jonathan Busby.)
Post: #13
RE: New Saturn asm "add loop" benchmark for the HP48G
Well, I changed the inner add loop to :

Code:
l1      C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GONC    l1
l2      C=C+1   P
        GONC    l1

and I got a *HUGE* speedup. Now the count is :

6429914

Big Grin

Something weird is going on...

Regards,

Jonathan


Attached File(s)
.zip  addloop.zip (Size: 410 bytes / Downloads: 5)

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-02-2023, 06:14 PM
Post: #14
RE: New Saturn asm "add loop" benchmark for the HP48G
I think we're hitting the law of diminishing returns.

With the following inner add loop :

Code:
l1      C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GONC    l1
l2      C=C+1   P
        GONC    l1

The count is now 6784080 .

Regards,

Jonathan


Attached File(s)
.zip  addloop-2x.zip (Size: 423 bytes / Downloads: 3)

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-03-2023, 04:28 PM (This post was last modified: 11-04-2023 01:45 AM by Jonathan Busby.)
Post: #15
RE: New Saturn asm "add loop" benchmark for the HP48G
Well, possibly due to some obscure, arcane, subtle hardware peculiarity or bug, when I run the non-unrolled add loop code, I sometimes get a count of :

8765045

As far as I can tell, there are no obvious bugs in my code and the add loop always runs for exactly 60 seconds. It's then interrupted and control returns to the OS when a HXS of the count is pushed to the stack.

Any idea what's going on?

Regards,

Jonathan

EDIT #1 : Corrected horrible style problems...

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-04-2023, 12:36 AM
Post: #16
RE: New Saturn asm "add loop" benchmark for the HP48G
Is the timing different when tge loop starts at even or odd addresses?
Find all posts by this user
Quote this message in a reply
11-04-2023, 02:46 AM
Post: #17
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-04-2023 12:36 AM)brouhaha Wrote:  Is the timing different when tge loop starts at even or odd addresses?

Indeed, many Saturn commands executed in an external 8-bit memory device have a different timing when they are executed on an even or odd address.
Visit this user's website Find all posts by this user
Quote this message in a reply
11-04-2023, 02:20 PM (This post was last modified: 11-05-2023 02:15 AM by Jonathan Busby.)
Post: #18
RE: New Saturn asm "add loop" benchmark for the HP48G
For all practical purposes, only the very inner portion of the add loop need be investigated since the outer portion of the loop is only executed about four times :

Code:
l1      C=C+1   A
        GONC    l1

The "GONC" branch instruction takes up the bulk of the CPU time. When the above code starts on an odd address, a total of around 5̶ ( wrong ) 10 additional cycles are required. If X is the total cycle time of the above inner loop, then we have :

\(\displaystyle \dfrac{(X + 10)}{X} \approx \dfrac{8765045}{4383185} \)

( EDIT #1 : "10" was "5" above )

then

\(\displaystyle X \approx \dfrac{-10}{1 - \left (\dfrac{8765045}{4383185} \right )} \)

( EDIT #1 : "-10" was "-5" above )

\(\displaystyle \approx 10 \)

( EDIT #1 : "10" was "5" above )

The above is nonsensical : to get the speedup indicated by the aberrant count, the whole inner loop would have to take just 5̶ 10 cycles, which is impossible. I think, therefore, we can conclude that the speedup, if it isn't a bug in my code, is not due to memory address parity.

Regards,

Jonathan

EDIT #1 : My calculation is in error ( although it doesn't change the conclusion ) since I mistakenly divided the values from Cycles du Saturn by two -- it's been corrected.

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
11-04-2023, 03:37 PM
Post: #19
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-02-2023 05:45 PM)Jonathan Busby Wrote:  Well, I changed the inner add loop to :

Code:
l1      C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GOC     l2
        C=C+1   A
        GONC    l1
l2      C=C+1   P
        GONC    l1

and I got a *HUGE* speedup. Now the count is :

6429914

Big Grin

Something weird is going on...

Regards,

Jonathan

This speedup, at least, is to be expected? GOC and GONC have different execution times when they jump or not. Taking from the cycles.pdf in your other post, it's either 11 or 30 cycles (I wonder why you divide those by 2 btw?), so the original loop times 5 was 5*49=245 cycles, and the unrolled one 4*30+49 = 169. The ratio of 169/245 matches 4383185/6429914, well, almost.

Cheers, Werner

41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE
Find all posts by this user
Quote this message in a reply
11-04-2023, 05:23 PM
Post: #20
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-04-2023 03:37 PM)Werner Wrote:  This speedup, at least, is to be expected? GOC and GONC have different execution times when they jump or not. Taking from the cycles.pdf in your other post, it's either 11 or 30 cycles (I wonder why you divide those by 2 btw?), so the original loop times 5 was 5*49=245 cycles, and the unrolled one 4*30+49 = 169. The ratio of 169/245 matches 4383185/6429914, well, almost.

Cheers, Werner

Oh my Blush Have I made a stupid arithmetic error? I thought I was supposed to divide the counts by two to get the cycle count. I don't read or speak French and I was confused by the document Blush

If I don't divide by two, then the cycle count for the inner add loop is 49. This gives :

\(\displaystyle \dfrac {\left ( 3900000 \cdot 60 \right )}{49} \approx 4775510 \)

Sorry people Blush

This still doesn't explain the errant counts I sometimes get...

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
Post Reply 




User(s) browsing this thread: 3 Guest(s)