Threaded Mode | Linear Mode

Werner · (This post was last modified: 01-15-2024 04:24 PM by Werner.)

Thanks to:

Valentin Albillo's original post, and fantastic 3-line solution for the 71B
Albert Chan's different solution and
John Keith's improvement

This, then, is an implementation for the 42S of John Keith's improvement of an algorithm posted by Albert Chan solving a problem posted by Valentin Albillo.

usage: with the square matrix in stack register X, XEQ "PER".
The permanent is returned in X, and the original, unchanged matrix is in Y. This is important for low-memory conditions on the 42S.

variables used:

REGS : v (sum of rows)
n : matrix order
i : 2^(n-1)
k : 1..i-1
p : permanent

00 { 147-Byte Prgm }
01▸LBL "PER"
02 ENTER
03 RSUM
04 X=Y?
05 DET
06 REAL?
07 RTN
08 STO "REGS"
09 DIM?
10 R↓
11 STO "n"
12 2
13 X<>Y
14 DSE ST X
15 Y^X
16 STO "i"
17 STO "k"
18 R↓
19 EDIT
20 CLX
21 GTO 14
22▸LBL 10
23 RCL "i"
24 RCL- "k"
25 1
26▸LBL 11
27 RCL ST Y
28 4
29 MOD
30 GTO IND ST X
31▸LBL 01
32▸LBL 03
33 STO+ ST X @ (1,3) -> (-2,2)
34 LASTX
35 -
36 GTO 04
37▸LBL 00
38 X<> ST L
39 STO÷ ST Z @ k := k/4;
40 SQRT @ j := j + 2;
41 +
42 GTO 11
43▸LBL 02
44 SIGN @ j := j + 1;
45 +
46 X<>Y
47 8
48 MOD @ k MOD 8 is 2 or 6
49 4 @ (2,6) -> (-2,2)
50 -
51▸LBL 04
52 RCL "n"
53 1
54 R^
55 STOIJ
56 R↓
57 GETM
58 ×
59 STO+ "REGS"
60 RCL "p"
61▸LBL 14
62 RCL "n"
63 RCL 00
64 DSE ST Y
65▸LBL 13
66 RCL× IND ST Y
67 DSE ST Y
68 GTO 13
69 RCL- ST Z
70 STO "p"
71 DSE "k"
72 GTO 10
73 RCL÷ "i"
74 +/-
75 RCLEL
76 EXITALL
77 X<>Y
78 END

Timing/testing the program is easy, once you realize that the permanent of a matrix of order n, filled with all 1's, is n!.

timing:
order   42S     DM42
5        0:18     0.05s
7        1:31     0.22s
9        7:10     1.03s
12  1:06:15     9.39s

The timing of order 12 for the 42S is but an estimate, fitting a logarithmic curve through the three data points (x=time/n, y=n), yielding a .99999.. correlation, so should be pretty good.

Cheers, Werner
[updated with a shorter version... I seem to always find simple ways of shortening programs, moments after I posted them]

Albert Chan · 01-15-2024, 02:24 PM

Hi, Werner

Perhaps you can add some comments to code?

Example, why does it use SQRT ?
Does it have something to do with j = bit.ffs(k) ?

With decimal machine, getting bit.ffs (Find First Set) is expensive.
My guess is it would cost way more than these lines (j = f[1]) for next loop.

f[1] = 1
f[i] = f[i+1]
f[i+1] = i+1

Werner · (This post was last modified: 01-16-2024 04:54 PM by Werner.)

Hi, Albert!
The SQRT is a quick way to turn a 4 into a 2 ;-)

The only part, I think, that requires some explanation is lines 23-51.
There, I determine j and dj (+/-2) from k, as John Keith explained.

So, j is the place of the least significant bit in k.
Let >>k be k shifted right till the least significant bit is 1,
then dj=2 if >>k MOD 4 equals 3, else dj=-2 (I *add* dj*Mj, so my signs are swapped).
I also technically calculate the permanent of the transposed matrix, which is of course the same - to save a TRANS(M) which on a 42S will duplicate the matrix. I subsequently fetch the j-th *column* of M instead of the row.

Instead of trying to find the least significant bit of k, successively dividing by 2, I perform k MOD 4 right away, and use GTO IND ST X to branch off:
- half of the time, it is either 1 or 3 and we're done (j=1 and dj=-2 for 1 and 2 for 3).
- if it is 0, divide k by 4, add 2 to j and loop
- if it is 2, add 1 to j, and k MOD 8 is now 2 or 6, and dj -2 or 2, respectively

I had a version using v,f and d vectors (even one using only the stack), but this latest version runs almost twice as fast, also because I store the vector v in REGS, so that I don't need to switch indexing/editing matrices, and the calculation of product(v) is faster.

Cheers, Werner

Albert Chan · (This post was last modified: 01-15-2024 04:17 PM by Albert Chan.)

(01-15-2024 02:45 PM)Werner Wrote: dj=2 if >>k MOD 4 equals 3, else dj=-2 (I *add* dj*Mj, so my signs are swapped).

If matrix cells are non-negative, product of initial v (column sums) is biggest.
This is the reason my code does v[j] = v[j] - d[i]*M[i][j]

Trivia:
If n×n matrix are all ones, its' permanent = n!
First term alone give upper limit: n^n / 2^(n-1) = 2 * (n/2)^n
For comparison, Stirling's approximation: n! ~ √(2*pi*n) * (n/e)^n

n=1 --> 2*(n/2)^n = 1 = 1!
n=2 --> 2*(n/2)^n = 2 = 2!
n=3 --> 2*(n/2)^n = 6.75 > (6 = 3!)
n=4 --> 2*(n/2)^n = 32 > (24 = 4!)
...

Quote:Instead of trying to find the least significant bit of k, successively dividing by 2, I perform k MOD 4 right away ...

Nice optimizing trick!

Quote:I had a version using v,f and d vectors (even one using only the stack), but this latest version runs almost twice as fast.

Interestingly, LuaJIT code is faster (almost twice as fast) using v,f and d vectors.
This despite I already use C to code bit.ffs and bit.test.

It is perhaps due to LuaJIT have only 1 kind of number = IEEE double.
Bit manipulations required conversion of double to integer, then back to double.

Guessing which way is faster is hard. We have to code it to know it.

Werner · 01-15-2024, 04:22 PM

(01-15-2024 04:06 PM)Albert Chan Wrote: If matrix cells are non-negative, product of initial v (column sums) is biggest.
This is the reason my code does v[j] = v[j] - d[i]*M[i][j]

? That's what I do as well, but the sign of the d[i]'s are inversed in my code - and btw you don't have a choice, you have to loop over all prod(sum(+/-M[i][j])), and d[j] being -2 means you swap a +M[i][j] for a -M[i][j].

Quote:Guessing which way is faster is hard. We have to code it to know it.

In the case of the 42S, having to access 4 arrays (M,f,v,d) meant a lot of indexing, as you can have only one matrix indexed/edited at a time - save if you store one in REGS and access it with RCL IND ..

Cheers, W.

Albert Chan · 01-16-2024, 01:00 AM

(01-15-2024 12:33 PM)Werner Wrote: timing:
order   42S     DM42
5        0:18     0.05s
7        1:31     0.22s
9        7:10     1.03s
12  1:06:15     9.39s

The timing of order 12 for the 42S is but an estimate, fitting a logarithmic curve through the three data points (x=time/n, y=n), yielding a .99999.. correlation, so should be pretty good.

XCas> Digits := 5
XCas> N := float([5,7,9])
XCas> T := float([18,91,430])
XCas> correlation(N, ln(T/N))                      → 0.99999
XCas> c := linear_regression(N, ln(T/N))      → (0.64641,-1.954)

ln(t/n) = c[0]*n + c[1]
t/n = exp(c[1]) * exp(c[0])^n

--> time (sec) ≈ 0.14170 * n * 1.9087^n
--> n=12 time ≈ 3976 sec = 1:06:16

Fit is excellent, but formula does not make sense.
We would expect time proportional to number of terms = 2^(n-1)
Each term, time to update v's and get its product, plus getting j and d.

Below does not produce as good correlation, but make more sense.

XCas> correlation(N, T/2^N)                      → 0.99917
XCas> c := linear_regression(N, T/2^N)      → (0.069336,0.21908)

t/2^n = c[0]*n + c[1] = c[0] * (n + c[1]/c[0])

--> time (sec) ≈ 0.069336 * (n + 3.1596) * 2^n
--> n=12 time ≈ 4305 sec = 1:11:45

Werner · 01-16-2024, 08:47 AM

(01-16-2024 01:00 AM)Albert Chan Wrote: Fit is excellent, but formula does not make sense.

With only 3 sample points, you can't expect miracles.
I won't burn through a set of batteries to see which estimate proves to be better, probably yours ;-)

Werner

Albert Chan · 01-16-2024, 12:24 PM

We can use DM42 timings, without n=12 numbers.
Then, we predict n=12 timing, compare against actual time (9.39 sec)

N := [5,7,9]
T := [.05, .22, 1.03]

Linear regression, N vs ln(T/N)
--> correlation = 0.99939
--> time (sec) = 0.00046355 * n * 1.8393^n
--> n=12 time ≈ 8.34 sec

Linear regression, N vs T/2^N
--> correlation = 0.98491
--> time (sec) = 0.00011230 * (n + 8.7101) * 2^n
--> n=12 time ≈ 9.53 sec

John Keith · 01-17-2024, 08:17 PM

(01-15-2024 12:33 PM)Werner Wrote: Thanks to:

Valentin Albillo's original post, and fantastic 3-line solution for the 71B

Albert Chan's different solution and

John Keith's improvement

This, then, is an implementation for the 42S of John Keith's improvement of an algorithm posted by Albert Chan solving a problem posted by Valentin Albillo.

... and based on a Python/Numpy program posted by GitHub user "lesshaste", who deserves credit as well.

Nice to see a version for Free42, I will try your optimization on my RPL program when i get the chance.

John Keith · 01-26-2024, 10:28 PM

(01-17-2024 08:17 PM)John Keith Wrote: ... I will try your optimization on my RPL program when i get the chance.

Well, I tried it but it made the program substantially larger and not much faster. RPL doesn't have anything like GTO IND ST X- the closest would be either a CASE statement or nested IF..THEN..ELSE structures. The method might work well on the 71B using ON..GOTO or ON..GOSUB though.