Threaded Mode | Linear Mode

jte · (This post was last modified: 10-02-2023 10:38 PM by jte.)

(10-01-2023 06:09 PM)komame Wrote: I have a feeling that your response went further than the scope of my question…

(Note: after writing the bulk of this, I had an aha! Idea

moment: see the blue text below.)

more like: wandered off? Wink

Not an unfair call. There are a variety of types of benchmarks. A question to ask is: what is the purpose of the benchmark? If we're aiming to improve user experience, considering the chunks of code users will typically run is likely where to start. — or, more likely, approximations thereof: chunks of code we'd think similar to those users would run.

(10-01-2023 06:09 PM)komame Wrote: ⋮
I wasn't asking about optimizations but rather about a scenario where a simple example from the 'real world,' which executes quickly in a single iteration, has its timing disrupted by the internal handling of the loop iteration itself, which can constitute a significant portion of the measurement time. In such a case, it would be difficult to determine whether a change in development affected the measured operation or just the loop handling.

My comments regarding optimization has connections with several things, two being: 1. do PPL authors optimize their code using techniques like those I mentioned (e.g., loop unrolling [or, more precisely, are such techniques applied to code hotspots]) --- if so, benchmark sets including loop unrolled code would help the benchmark results better reflect "real world" use; and 2. will such optimization techniques improve run times of typical PPL code (e.g., how beneficial would automatic loop unrolling be?)?

If the benchmark collection included typical PPL code whose performance was impacted by loop overhead, and improvements could be made to reduce loop overhead, that would seem to be a good thing to make happen. (Or, at least, a good thing to monitor. And such benchmarks would be reflective of real-world use. So adjusting the system to benefit such code [to improve those benchmarks] would seem entirely appropriate.)

If we're focused too much on one particular line, changes aimed at improving performance may well improve the performance of that one line, but at the cost of others. (At an abstract level, it's a bit like compression: performance is improved for some programs, but at the cost of others. Usually "the others" are not the sort of programs that are of concern [e.g., execute extremely quickly in any case], although this line of reasoning can become less abstract and more practical as performance improves.)

My plan was to have benchmarks run in the standard way as standard PPL programs (apps are also a possibility) on physical calculators so as to get measurements of the "whole picture" (well, as regards the performance of PPL programs on HP Primes). If we want to aim at code snippets, some other approaches are possible (do the looping outside of the PPL interpreter --- the graphing engines do this sort of thing, and also carry some interpreter state information across evaluations to avoid setup costs).

(10-01-2023 06:09 PM)komame Wrote: ⋮

Code:

EXPORT BENCHMARK() BEGIN FOR I FROM 1 TO 100 DO ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) END; END;
write it as

Code:

EXPORT BENCHMARK() BEGIN FOR I FROM 1 TO 20 DO ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) ΣLIST(MAKELIST(RANDOM(1),X,1,1000)) END; END;

This is precisely what I meant by loop unrolling (it is an example of loop unrolling), in case I wasn't clear.

(10-01-2023 06:09 PM)komame Wrote: ⋮
That's why I showed measurements for the variability in the speed of empty loops (without code inside) between different firmware versions because in the scenario described above, these changes can have a significant impact on measurement results and lead to incorrect conclusions (for example, you make a change in the firmware, and then you might think that due to that change, the operation being measured inside the loop takes longer, even though it operates at the same speed; it's just that the loop itself runs slower internally). Reducing the number of loop iterations while maintaining the same number of executions of the measured operation will result in a smaller impact of the loop handling on the measurement results.
However, perhaps I'm being overly precise with this...

As an abstract comment: automation (automated benchmarking) can help us get a better handle on variability as it is more practical to run more tests. (Both across revisions, but also how other factors influence performance.)

If there are a substantial number of benchmark programs, my intuition is that performance changes involving things like loop overhead would show up across a large swath of the results. Thinking on it now, it seems that if loop unrolling is to be used, having that explicitly mentioned in the specially-formatted benchmark comment lines would be proper. I was going to write a bit more about this, but then...

My aha! Idea

moment: I was assuming the goal wasn't to measure tiny snippets of fast code --- code small enough that putting it into a loop would substantially adjust its performance. (That the time spent for the entire "real world" use being measured is substantially affected by wrapping it in a loop that executes once.)