Hashing to Estimate the Size of a Stream

1 · Jeremy Kun · Jan. 4, 2016, 9 a.m.
Summary
Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory. Solution: (in python) import random def randomHash(modulus): a, b = random.randint(0,modulus-1), random.randint(0,modulus-1) def f(x): return (a*x + b) % modulus return f def average(L): return sum(L) / len(L) def numDistinctElements(stream, numParallelHashes=10): modulus = 2**20 hashes = [randomHash(modulus) for _ in range(numParallelHashes)] minima = [modulus] * numParallelHashes currentEstimate...