(32S, 32SII) Birthday paradox, hash collision probability - Printable Version +- HP Forums (https://www.hpmuseum.org/forum) +-- Forum: HP Software Libraries (/forum-10.html) +--- Forum: General Software Library (/forum-13.html) +--- Thread: (32S, 32SII) Birthday paradox, hash collision probability (/thread-14890.html) |
(32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-25-2020 04:49 PM The birthday paradox - called a paradox because the results seem so counterintuitive - states that in a group of just 23 people, there is approximately a 50% chance that some will share the same birthday (ignoring year), or stated another way, that the group does not have entirely unique birthdays. This program allows you to compute the probability that there will be at least one collision when selecting S samples from N possibilities, with repetition allowed (sampling with replacement). I won't go too deeply into the math and combinatorics behind it, as there are other resources out there that can do it more elegantly, but I'll describe the two formulas the program uses. The simple formula, which works with smaller inputs such as birthdays, is this: 1-((N-1)/N)^((S^2-S)/2) With large inputs, like calculating hash collisions, this formula potentially loses all of its significant digits - (N-1)/N will be rounded to 1, giving you an erroneous probability of zero. Fortunately there's a second approximation which, while giving poor accuracy for small inputs, provides good accuracy for large ones: 1-e^(-S(S-1)/(2N)) (Reference: https://preshing.com/20110504/hash-collision-probabilities/) Of course, you may notice there's still a problem with this formula: if the x in e^x is very small, the result is 1 + an extremely small number, and the result will still be rounded to 1, giving a probability of 0. Fortunately, there's a simple way to accurately calculate e^x-1 (if you've programmed a TVM solver, you might recognize this): e^x-1 = 2*sinh(x/2)*e^(x/2) In order to use this transformation, we have to first restate the problem in the form of e^x-1, which is easy enough: 1-e^(-S(S-1)/(2N)) = -1*(e^(-S(S-1)/(2N))-1) = -1*2*sinh(-S(S-1)/(2N)/2)*e^(-S(S-1)/(2N)/2) = -2*sinh(-S(S-1)/(4N))*e^(-S(S-1)/(4N)) So this program first attempts to use the formula for small inputs, and if a result of 0 is obtained, it then applies the second formula to produce a result. Usage This program can either be run interactively to compute a probability for a given N and S, or via the solver to calculate N or S given the other two of N, S, and probability. Interactive use: XEQ B Enter value for N, press R/S. Enter value for S, press R/S. Enter 0 for P (this variable is used with the solver), press R/S. Probability will be returned in x. To see the probability expressed as "1 in x chance", press 1/X. Solver use: FN= B Enter initial guesses in the X register and the variable you will be solving for (this is important, as there may be another solution to the equation with a meaningless negative value). SOLVE, choose N S or P Wait a couple of seconds, and results will be returned to x. Examples: What is the probability that at least two people in a group of 30 will share the same birthday? XEQ B "N?" 365 R/S "S?" 30 R/S "P?" 0 R/S .6968, or about 69.68%. What is the probability of a GUID collision if 5 trillion GUIDs are generated? (A GUID is a 128-bit globally unique identifier, with 2^128 possible values.) XEQ B "N?" 2 ENTER 128 y^x R/S "S?" 5 E 12 R/S "P?" 0 R/S 3.6734E-14, or about 1 in 27.22 trillion, i.e. not very likely! How many GUIDs would you need to generate before there is even a 1% chance of a collision? FN= B 2 ENTER 64 y^x STO S (initial guess) SOLVE S "N?" 2 ENTER 128 y^x R/S "P?" .01 R/S "S=2.6153E18" You would need to generate about 2,615,300,000,000,000,000 GUIDs. Program Code Code: LBL B [CK=EB70 058.5] Back story: We had a data copy job at work that was failing, complaining about duplicate values in a GUID column. The table had around 5 million records in it, and I wanted to first assess whether or not there was any possibility that we legitimately had duplicate GUIDs coming from the source system. After determining that was effectively impossible, I traced the issue to some phantom reads that were happening due to a sudden increase in write activity in the source data, and changed the query's concurrency model to prevent the fake duplicate rows from appearing. RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-26-2020 07:12 PM (04-25-2020 04:49 PM)Dave Britten Wrote: The simple formula, which works with smaller inputs such as birthdays, is this: First formula can be calculated more accurately using log1p: lua> m = require 'mathx' lua> function p1(n,s) return -m.expm1(0.5*s*(s-1) * m.log1p(-1/n)) end lua> function p2(n,s) return -m.expm1(0.5*s*(s-1) / -n) end lua> function both(n,s) return p1(n,s), p2(n,s) end lua> lua> both(365,30) 0.6968162999539532 0.6963200177219244 lua> both(2^128, 5e12) 3.6734198463188465e-014 3.6734198463188465e-014 lua> both(2^128, 2.6153e18) 0.009999839905579181 0.009999839905579181 Probability of all different birthday, for 30 people = perm(365,30) / 365^30 ≈ 29.37% Probability of "collision" ≈ 1 - 29.37% = 70.63% Approximation p1() seems better than p2(), at least for test cases. https://www.hpmuseum.org/forum/thread-11907-post-119843.html#pid119843 https://www.hpmuseum.org/forum/thread-14344-post-126297.html#pid126297 RE: (32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-26-2020 07:58 PM (04-26-2020 07:12 PM)Albert Chan Wrote: First formula can be calculated more accurately using log1p: Very clever! And also usable on calculators that don't have a built-in ln(1+x) function: ln(1+x) = ln(1+x)*x/(1+x-1) RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-27-2020 07:54 PM (04-26-2020 07:58 PM)Dave Britten Wrote: Very clever! And also usable on calculators that don't have a built-in ln(1+x) function: Thanks. Where do you get it ? Note: watch out for divide-by-zero, if x is tiny. Or, use Dieter's log1p and expm1 approximation. I tried to re-use my formula, for log(probability of no repetition) ln_P(n,s) := -s/(12*n*(n-s)) - (s + (n-s+0.5)*ln(1-s/n)) 30 people birthday "collision" ≈ -expm1(ln_P(365,30)) = 0.706316242724 Using exact formula: 1 - perm(365,30)/365^30 = 0.706316242719 For big n, small x = s/n, we can drop the first term. Second term has a problem, even if we have accurate log1p (s + n*ln(1 - s/n)) / n = x + log(1-x) ≈ x - x = 0 There might be a better way. For now, I use: x + log(1-x) = x - (x + x^2/2 + x^3/3 + x^4/4 + ...) = -(x^2/2 + x^3/3 + x^4/4 + ...) Code: m = require 'mathx' lua> p3(365,30) 0.7063895931587847 lua> p3(2^128, 5e12) 3.6734198463188465e-014 lua> p3(2^128, 2.6153e18) 0.009999839905579181 RE: (32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-27-2020 09:16 PM (04-27-2020 07:54 PM)Albert Chan Wrote:(04-26-2020 07:58 PM)Dave Britten Wrote: Very clever! And also usable on calculators that don't have a built-in ln(1+x) function: Can't remember where it came from exactly, but it was related to TVM calculations. 15C Advanced Functions handbook maybe? And yes, if x+1-1 returns 0, then you just return x, since ln(1+x) approaches x as x gets smaller. RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-27-2020 10:43 PM (04-27-2020 07:54 PM)Albert Chan Wrote: x + log(1-x) = x - (x + x^2/2 + x^3/3 + x^4/4 + ...) = -(x^2/2 + x^3/3 + x^4/4 + ...) If we let y = x/(2-x) → x = 2*y - x*y x + log(1-x) = x - 2*(y + y^3/3 + y^5/5 + ...) = -x*y - 2*(y^3/3 + y^5/5 + ...) → ln(P(n,s)) ≈ -s/(12*n*(n-s)) + 2*(n-s+0.5)(y^3/3 + y^5/5 + ...) - (s-1)y I find that y^3 term already gives very good approximation. Drop the first term, this is my revised ln_nr(n, s), shorter and more accurate. Code: function ln_nr(n,s) -- log(probability of no repetition) lua> P_nr(365,30) 0.2936840560221026 0.7063159439778974 lua> P_nr(2^128,5e12) 0.9999999999999633 3.6734198463188465e-014 lua> P_nr(2^128,2.6153e18) 0.9900001600944208 0.009999839905579181 We can even drop y^3 term ... lua> function ln_nr(n,s) return s*(s-1)/(s-n-n) end lua> P_nr(365,30) 0.2885585859237581 0.711441414076242 lua> P_nr(2^128,5e12) 0.9999999999999633 3.6734198463188465e-014 lua> P_nr(2^128,2.6153e18) 0.9900001600944208 0.009999839905579181 Update: switched sign of y so x and y has same sign. This make it clear that x + log(1-x) ≤ 0 RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-28-2020 03:23 PM (04-27-2020 07:54 PM)Albert Chan Wrote: ln_P(n,s) := -s/(12*n*(n-s)) - (s + (n-s+0.5)*ln(1-s/n)) (04-27-2020 10:43 PM)Albert Chan Wrote: ln(P(n,s)) ≈ -s/(12*n*(n-s)) + 2*(n-s+0.5)(y^3/3 + y^5/5 + ...) - (s-1)y This version use the 2nd form, to avoid numerical cancellation problem of original formula. It assumed IEEE double, ln(2^-52) ≈ -36.0, and sum y^n terms in reverse order With all this work, we might as well add more corrections. XCas> series(1/(e^D-1),D) → 1/D -1/2 +D/12 -D^3/720 +D^5/30240 +D^6*order_size(D) Let u=n-x, f(x) = ln(n-x) = ln(u) → df = ln(n-x) dx = -ln(u) du With x limit from 0 to s, we have u = n-x = from n to n-s XCas> diff(-ln(u),u,3) / -720 → 1/(u^3*360) XCas> diff(-ln(u),u,5) / 30240 → -1/(u^5*1260) Code: function ln_nr(n,s) -- log(probability of no repetition) lua> m = require'mathx' lua> function fac(n) return m.tgamma(n+1) end lua> fac(69.95)/fac(69) 56.58432038352886 lua> function perm(n,s) return exp(ln_nr(n,s)) * n^s end lua> perm(69.95, 0.95) 56.58432038352816 lua> 70*perm(69.95,-0.05) -- 69.95!/69! = 70*(69.95!/70!) 56.58432038352817 lua> 70/perm(70,0.05) -- 69.95!/69! = 70/(70!/69.95!) 56.58432038352817 lua> ((69.95+70)/2)^0.95 -- inspired from Approximating Gamma Ratios 56.58427578530249 RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-02-2020 11:15 PM Assume x ≫ a, and let \(m = (x+{1+a\over2})\), prove \(\large {(x+a)!\over x!} ≈ m^a\) \(\large {(x+a)!\over x!} = {m!\,/\,x! \over m!\,/\,(x+a)!} = {perm(m, {1+a\over2}) \over perm(m, {1-a\over2})}\) A few lines in XCas proved this ... XCas> P_nr(n,s) := exp(-s*(s-1)/(2n)) // probability of no repetition, assumed n ≫ s XCas> my_perm(n,s) := P_nr(n,s) * n^s // avoid name conflict with XCas perm() XCas> m := x+(1+a)/2 XCas> simplify(my_perm(m,(1+a)/2) / my_perm(m,(1-a)/2) - m^a) → 0 XCas> subst((x+a)! / x!, [x,a]=[69,0.95]) → 56.5843203835 XCas> subst(m^a, [x,a]=[69,0.95]) → 56.5842757853 XCas> subst(m^a, [x,a]=[70,-0.05]) * 70 // 69.95!/69! = 69.95!/70! * 70 → 56.5843440581 Update: this ratio formula is better XCas> ratio(x,a) := (x*x + (a+1)*(x + (a+2)/6)) ^ (a/2) XCas> ratio(69.,0.95) → 56.5843203845 XCas> ratio(70., -.05) * 70 // 69.95!/69! = 69.95!/70! * 70 → 56.5843203829 RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-03-2020 01:00 PM (05-02-2020 11:15 PM)Albert Chan Wrote: XCas> ratio(x,a) := (x*x + (a+1)*(x + (a+2)/6)) ^ (a/2) We can reverse the logic, and produce a compact, accurate ln_nr(n, s) P_nr(n, s) = perm(n, s) / n^s = ratio(n-s, s) / n^s = b^(s/2) XCas> b := expand(ratio(n-s,s)^(2/s) / n^2) → \(-\frac{s}{n}+\frac{s^{2}}{6\cdot n^{2}}+1+\frac{1}{n}-\frac{s}{2\cdot n^{2}}+\frac{1}{n^{2}\cdot 3}\) XCas> factor(b-1) → \(\large\frac{(-s+1) (6\cdot n-s+2)}{6\cdot n^{2}}\) Code: m = require 'mathx' lua> n, s = 365, 30 lua> x1 = ln_nr(n,s) lua> x1, -m.expm1(x1) -1.2252495190150705 0.706315588664643 We can improve above by halving the "interval" lua> h = s/2 lua> x2 = ln_nr(n,h) + ln_nr(n-h,h) + h * m.log1p(-h/n) lua> x2, -m.expm1(x2) -1.2252516087363756 0.7063162023825731 I noticed halving the interval improved ln_nr(n,s) by about 16 times We can extrapolate for better estimate lua> x = x2 + (x2-x1)/15 -- (x - x1) ≈ 16*(x - x2) lua> x, -m.expm1(x) -- extrapolated ln_nr(n,s), collision probability -1.2252517480511294 0.7063162432970561 Update: I found a general formula. However, convergence is not good ... XCas> ln_nr(n,s,kmax) := ln(1+s/(n-s)) - sum(subst(sum((j/n)^k, j=1..jmax),jmax=s)/k, k=1..kmax) XCas> map(range(1,11), k -> [k, 1 - exp(ln_nr(365.,30.,k))]) \(\left(\begin{array}{cc} 1 & 0.695232406385 \\ 2 & 0.705857478651 \\ 3 & 0.706293132999 \\ 4 & 0.706314950579 \\ 5 & 0.706316165395 \\ 6 & 0.706316237864 \\ 7 & 0.706316242403 \\ 8 & 0.706316242698 \\ 9 & 0.706316242718 \\ 10 & 0.706316242719 \end{array}\right) \) RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-04-2020 05:56 PM (04-27-2020 09:16 PM)Dave Britten Wrote: Can't remember where it [ ln(1+x) = ln(x+1)*x/(1+x-1) ] came from exactly, but it was related to TVM calculations. You are right ! It was proved in Theorem 4, What Every Computer Scientist Should Know About Floating-Point Arithmetic Quote:And yes, if x+1-1 returns 0, then you just return x, since ln(1+x) approaches x as x gets smaller. I've noticed the formula failed for IEEE double with binary exponent of 53, and odd last bit. It is impossible to represent odd 54-bits integer x with IEEE double (53-bits mantissa). Let "⇒" represent round-to-even rules If x last bit is 0: (x+1)-1 ⇒ (x)-1 ⇒ x If x last bit is 1: (x+1)-1 ⇒ (x+2)-1 ⇒ x+2 lua> log = math.log lua> log1p = require('mathx').log1p lua> b = 2^53 lua> for i=1,1000 do x=b+2*i; print(i, log1p(x) , log(1+x)*x/(x+1-1)) end 1 36.7368005696771 36.736800569677094 2 36.7368005696771 36.7368005696771 3 36.7368005696771 36.73680056967709 4 36.7368005696771 36.7368005696771 5 36.7368005696771 36.736800569677094 ... RE: (32S, 32SII) Birthday paradox, hash collision probability - Werner - 05-05-2020 01:46 PM (05-03-2020 01:00 PM)Albert Chan Wrote: Fantastic, Albert! A 41-compatible routine, input n ENTER s: Code: >LBL "PCOL" Cheers, Werner |