New Saturn asm "add loop" benchmark for the HP48G
|
10-26-2023, 05:24 PM
(This post was last modified: 11-04-2023 01:17 PM by Jonathan Busby.)
Post: #1
|
|||
|
|||
New Saturn asm "add loop" benchmark for the HP48G
--------------------------------------------------------------------------------
WARNING : The first attachment, addloop-0.01b.zip, has a bugged ISR in addloop and corrupts system time. These problems have been corrected in version 0.02a . If you run 0.01b, it won't crash your calc ( at least I don't think ), but it *can* generate bogus results. -------------------------------------------------------------------------------- Please reference this thread for a list of other implementations. I just recently wrote a new implementation of pier4r's "add loop" benchmark for the HP48G series. The code overrides the interrupt system, sets TIMER2 for a 60-second countdown, and then executes : ( The following further optimized add loop was provided by Werner ) Code: P= 5 The count I get is : 4383185 Not as much of a speedup as I had thought previous to coding this benchmark. Note that the code, just due to my laziness, *cannot* be run on an HP48GX with a merged RAM card -- this limitation can be fixed with a =MOVEDOWN to ADISP. Also, and this is very important : **Due to the nature of how the code reconfigures IRAM, this code is only known to work on HP48G/GX/G+ version R ROM models** Have fun! Jonathan EDIT #1 : Corrected funky grammar-typo... EDIT #2 : I just noticed, that due to a minor mistake in the omission of a "P= 7" instruction, my code corrupts the system time. I'll upload a fixed version tomorrow... EDIT #3 : Version 0.02a hopefully corrects the aforementioned "P= 7" mistake and the bug(s) in the ISR... Credit goes to Werner for the improved inner add loop... EDIT #4 : Added latest version as attachment with none, 4x and 8x inner add loop unrolling... Aeternitas modo est. Longa non est, paene nil. |
|||
10-26-2023, 06:11 PM
Post: #2
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
I realize this isn't the Prime forum, but I was curious what a Prime G2 could do with just a straight PPL program:
EXPORT ADDD() BEGIN A:=0; WHILE 1 DO A:=A+1; END; END; Result was 4,979,849 |
|||
10-27-2023, 12:49 AM
Post: #3
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 06:11 PM)Xorand Wrote: I realize this isn't the Prime forum, but I was curious what a Prime G2 could do with just a straight PPL program: Sorry, close by no cigar. The Prime has ADHD, not ADDD. --Bob Prosperi |
|||
10-27-2023, 12:56 AM
Post: #4
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 06:11 PM)Xorand Wrote: [...] I was curious what a Prime G2 could do with just a straight PPL program:[...] Result was 4,979,849 Speaking of curiosity, running the equivalent program written in plain old RPN on an old Free42 version running on a 12-year-old iPad 2 gives 5,235,602. V. All My Articles & other Materials here: Valentin Albillo's HP Collection |
|||
10-27-2023, 06:51 AM
(This post was last modified: 10-27-2023 07:06 AM by Gjermund Skailand.)
Post: #5
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
Update
DM42 -USB connected screen turned off 01 LBL "xxx" 02 "RefLCD" 03 ASTO ST L 04 RCL IND ST L 05 STO 00 06 0 STO IND ST L 07 1E7 08 ENTER 09 LBL 00 10 DSE ST X 11 GTO 00 12 ASTO ST L 13 RCL 00 14 STO IND ST L 15 END Manual timing and stop screen off: 373166 ( screen on, 357200) PS to reset the screen refresh , manual goto line 12 and continue, see manual for other options for screen update br Gjermund |
|||
10-27-2023, 07:26 AM
Post: #6
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-26-2023 05:24 PM)Jonathan Busby Wrote: Wouldn't using the A-field be faster? Code: P=5 41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE |
|||
10-27-2023, 08:19 AM
Post: #7
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 07:26 AM)Werner Wrote:(10-26-2023 05:24 PM)Jonathan Busby Wrote: I'm not sure this code will work correctly, after you incremented P to 7. My solution would be something like this: Code: C=0 W that spends most time incrementing the B-field. Not tested though. J-F |
|||
10-27-2023, 08:35 AM
Post: #8
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 08:19 AM)J-F Garnier Wrote: I'm not sure this code will work correctly, after you incremented P to 7. You are right ;-) but we'll normally not get there, 2^24 = 16*10^6, and the count was in the order of 4*10^6. But we can't know up front, so: Code: A=0 W Should be marginally faster than yours and Jonathan's. Cheers, Werner 41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE |
|||
10-31-2023, 01:31 AM
Post: #9
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(10-27-2023 07:26 AM)Werner Wrote:(10-26-2023 05:24 PM)Jonathan Busby Wrote: D'oh! Very clever Once again I am blinded by my monomania, this time directed towards the custom interrupt routine -- I hadn't given the inner add loop a second thought in terms of performance. In a perfect world, the "C=C+1 WP" instruction would only take two cycles to decode -- and that's what I overlooked. Already at P = 4 , the cycle time of "C=C+1 WP" equals the 7 cycles ( ignoring memory access considerations ) of the "C=C+1 A" instruction. If you really want to go nuts, you could unroll the increment code for the B field in a variation on J-F Garnier's code Your original A field code brings the count up to around 4382026. So, a few hundred thousand more, but I was hoping for something more dramatic considering the code for an unoptimized HP-71B assembly language add loop, which runs ats 640KHz, reaches over 1E6 in 60 seconds. Anyways, it seems my ISR code is bugged ( the IRAM reconfiguration code is flawless ) in some strange way which I still trying to work out ( it doesn't help that I have chronic sleep deprivation ( really ) and I got up at around 4 AM today ). So, I left out checking for C.A overflow when TIMER2 generates an interrupt. This means it's *possible* that an interrupt could corrupt the count, but it's unlikely, and you'd know it if it happened -- I plan on having this fixed soon. Quote:Cheers, Werner Regards, Jonathan Aeternitas modo est. Longa non est, paene nil. |
|||
11-01-2023, 03:49 PM
(This post was last modified: 11-05-2023 02:04 AM by Jonathan Busby.)
Post: #10
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
I'm a little bit worried that my code is flawed
( EDIT #1 : But not for the reasons originally posted here ) Here's my logic : The inner add loop is Code: - C=C+1 A We can ignore the "C=C+1 P" as it only executes four times. Now, according to this , the inner loop should take 2̶4̶.̶5̶ 49 cycles per iteration. If that's true, then we have (3900000*60)/2̶4̶.̶5̶ 49 ~ 9̶5̶5̶1̶0̶2̶0̶ ̶ 4775510. T̶h̶i̶s̶ ̶i̶s̶ ̶a̶ ̶l̶i̶t̶t̶l̶e̶ ̶o̶v̶e̶r̶ ̶t̶w̶i̶c̶e̶ ̶t̶h̶e̶ ̶f̶i̶n̶a̶l̶ ̶c̶o̶u̶n̶t̶ ̶e̶v̶e̶n̶ ̶w̶i̶t̶h̶ ̶W̶e̶r̶n̶e̶r̶'̶s̶ ̶o̶p̶t̶i̶m̶i̶z̶e̶d̶ ̶i̶n̶n̶e̶r̶ ̶a̶d̶d̶ ̶l̶o̶o̶p̶.̶ ̶I̶ ̶k̶n̶o̶w̶ ̶t̶h̶a̶t̶ ̶l̶e̶a̶v̶i̶n̶g̶ ̶T̶I̶M̶E̶R̶2̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶s̶u̶c̶k̶s̶ ̶a̶ ̶l̶i̶t̶t̶l̶e̶ ̶b̶i̶t̶ ̶o̶f̶ ̶C̶P̶U̶ ̶t̶i̶m̶e̶ ̶d̶u̶e̶ ̶t̶o̶ ̶k̶e̶y̶b̶o̶a̶r̶d̶ ̶s̶c̶a̶n̶n̶i̶n̶g̶,̶ ̶b̶u̶t̶ ̶n̶o̶t̶ ̶*̶T̶H̶A̶T̶*̶ ̶m̶u̶c̶h̶.̶ S̶o̶,̶ ̶s̶h̶o̶u̶l̶d̶ ̶I̶ ̶b̶e̶ ̶w̶o̶r̶r̶i̶e̶d̶?̶ ̶D̶o̶e̶s̶ ̶a̶n̶y̶o̶n̶e̶ ̶h̶a̶v̶e̶ ̶a̶n̶ ̶e̶x̶p̶l̶a̶n̶a̶t̶i̶o̶n̶?̶ I've confirmed that my code runs the add loop for *exactly* 60 seconds -- even used a stopwatch. If there's a problem, I don't know what it is. Regards, Jonathan EDIT #1 : I mistakenly divided the cycle count from Cycles du Saturn by two, therefore rendering my conclusion incorrect -- it's been corrected. Aeternitas modo est. Longa non est, paene nil. |
|||
11-01-2023, 06:48 PM
Post: #11
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
[Possibly memory controller behavior. The memory controller is not well doscribed in any public documentation, but it matches the 4-bit Satun bus to 8-bit memory, and depending on the memory access pattern can introdue delays. Perhaps a tight loop is a case where that happens.
You might try unrolling the loop a bit, perhaps by putting two or four increments consecutively, and see whether the overhead is lower. |
|||
11-01-2023, 07:30 PM
(This post was last modified: 11-01-2023 08:09 PM by Jonathan Busby.)
Post: #12
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-01-2023 06:48 PM)brouhaha Wrote: Possibly memory controller behavior. Well, the cycle reference I used was supposed to take into account the SRAM instruction and data fetch and the extra cycles incurred by the Saturn bus. Quote:The memory controller is not well doscribed in any public documentation, I thought the Saturn bus was pretty well described in the HP-71B various internal design specs, including the daisy-chained memory controllers. But, if your talking about the internals of the Yorke memory comntroller(s) and eg. why there are quarter cycles, then there's very little information. Quote:but it matches the 4-bit Satun bus to 8-bit memory, Well, the minimum read cycle time for the Yorke is 1000nS . But, since bytes are fetched, the effective frequency is 2MHz which is the speed of the memory controllers and the Saturn bus. Quote:and depending on the memory access pattern can introdue delays. Well, the access pattern on the Saturn bus for the inner add loop is :
(1) I'm not sure if these bus operations take place -- if the Yorke Saturn is designed well then they shouldn't, which means my final cycle count could be off by six... So, in addition to the execution time, the Saturn bus introduces five ~2MHz cycles of overhead, which equates to ten ~4MHz cycles. Quote:Perhaps a tight loop is a case where that happens. Thanks for the tip I need to investigate this... Quote:You might try unrolling the loop a bit, perhaps by putting two or four increments consecutively, and see whether the overhead is lower. Good idea I think I'll do a 256 instruction unroll of the B field increment operation with a modified version of J-F Garnier's code. Regards, Jonathan Aeternitas modo est. Longa non est, paene nil. |
|||
11-02-2023, 05:45 PM
(This post was last modified: 11-02-2023 05:47 PM by Jonathan Busby.)
Post: #13
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
Well, I changed the inner add loop to :
Code: l1 C=C+1 A and I got a *HUGE* speedup. Now the count is : 6429914 Something weird is going on... Regards, Jonathan Aeternitas modo est. Longa non est, paene nil. |
|||
11-02-2023, 06:14 PM
Post: #14
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
I think we're hitting the law of diminishing returns.
With the following inner add loop : Code: l1 C=C+1 A The count is now 6784080 . Regards, Jonathan Aeternitas modo est. Longa non est, paene nil. |
|||
11-03-2023, 04:28 PM
(This post was last modified: 11-04-2023 01:45 AM by Jonathan Busby.)
Post: #15
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
Well, possibly due to some obscure, arcane, subtle hardware peculiarity or bug, when I run the non-unrolled add loop code, I sometimes get a count of :
8765045 As far as I can tell, there are no obvious bugs in my code and the add loop always runs for exactly 60 seconds. It's then interrupted and control returns to the OS when a HXS of the count is pushed to the stack. Any idea what's going on? Regards, Jonathan EDIT #1 : Corrected horrible style problems... Aeternitas modo est. Longa non est, paene nil. |
|||
11-04-2023, 12:36 AM
Post: #16
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
Is the timing different when tge loop starts at even or odd addresses?
|
|||
11-04-2023, 02:46 AM
Post: #17
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G | |||
11-04-2023, 02:20 PM
(This post was last modified: 11-05-2023 02:15 AM by Jonathan Busby.)
Post: #18
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
For all practical purposes, only the very inner portion of the add loop need be investigated since the outer portion of the loop is only executed about four times :
Code: l1 C=C+1 A The "GONC" branch instruction takes up the bulk of the CPU time. When the above code starts on an odd address, a total of around 5̶ ( wrong ) 10 additional cycles are required. If X is the total cycle time of the above inner loop, then we have : \(\displaystyle \dfrac{(X + 10)}{X} \approx \dfrac{8765045}{4383185} \) ( EDIT #1 : "10" was "5" above ) then \(\displaystyle X \approx \dfrac{-10}{1 - \left (\dfrac{8765045}{4383185} \right )} \) ( EDIT #1 : "-10" was "-5" above ) \(\displaystyle \approx 10 \) ( EDIT #1 : "10" was "5" above ) The above is nonsensical : to get the speedup indicated by the aberrant count, the whole inner loop would have to take just 5̶ 10 cycles, which is impossible. I think, therefore, we can conclude that the speedup, if it isn't a bug in my code, is not due to memory address parity. Regards, Jonathan EDIT #1 : My calculation is in error ( although it doesn't change the conclusion ) since I mistakenly divided the values from Cycles du Saturn by two -- it's been corrected. Aeternitas modo est. Longa non est, paene nil. |
|||
11-04-2023, 03:37 PM
Post: #19
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-02-2023 05:45 PM)Jonathan Busby Wrote: Well, I changed the inner add loop to : This speedup, at least, is to be expected? GOC and GONC have different execution times when they jump or not. Taking from the cycles.pdf in your other post, it's either 11 or 30 cycles (I wonder why you divide those by 2 btw?), so the original loop times 5 was 5*49=245 cycles, and the unrolled one 4*30+49 = 169. The ratio of 169/245 matches 4383185/6429914, well, almost. Cheers, Werner 41CV†,42S,48GX,49G,DM42,DM41X,17BII,15CE,DM15L,12C,16CE |
|||
11-04-2023, 05:23 PM
Post: #20
|
|||
|
|||
RE: New Saturn asm "add loop" benchmark for the HP48G
(11-04-2023 03:37 PM)Werner Wrote: This speedup, at least, is to be expected? GOC and GONC have different execution times when they jump or not. Taking from the cycles.pdf in your other post, it's either 11 or 30 cycles (I wonder why you divide those by 2 btw?), so the original loop times 5 was 5*49=245 cycles, and the unrolled one 4*30+49 = 169. The ratio of 169/245 matches 4383185/6429914, well, almost. Oh my Have I made a stupid arithmetic error? I thought I was supposed to divide the counts by two to get the cycle count. I don't read or speak French and I was confused by the document If I don't divide by two, then the cycle count for the inner add loop is 49. This gives : \(\displaystyle \dfrac {\left ( 3900000 \cdot 60 \right )}{49} \approx 4775510 \) Sorry people This still doesn't explain the errant counts I sometimes get... Regards, Jonathan Aeternitas modo est. Longa non est, paene nil. |
|||
« Next Oldest | Next Newest »
|
User(s) browsing this thread: 2 Guest(s)