A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Imagine you have a very large set of items (like all URLs visited, or all usernames taken) and you want to quickly check if a new item is *already* in the set, without storing the entire set explicitly.
Bloom filters are designed for this! They can tell you definitively if an element is "possibly in the set" or "definitely not in the set".
The key word is probabilistic: they can sometimes make a mistake by saying an element is possibly in the set when it actually isn't (a false positive). However, they will *never* say an element is definitely not in the set if it actually *is* present (no false negatives).
m bits, initially all set to 0. This array is typically much smaller than the space needed to store all the actual elements.
k different and independent hash functions. Each hash function takes an element as input and outputs an index (a position) within the bit array (from 0 to m-1).
k hash functions.k array indices.k positions in the bit array and set the bits at those indices to 1. (If a bit is already 1, it stays 1).k hash functions to get k indices.k indices in the array.0, we know the element is definitely not in the set (because if it had been added, all its corresponding bits would have been set to 1).1, we conclude that the element is possibly in the set. It might be there, or it might be a false positive.
A false positive occurs when we check for an element that was never added, but all the bits corresponding to its hash indices happen to have been set to 1 by *other* elements that *were* added. The more elements added, the more bits get set to 1, and the higher the probability of these collisions leading to a false positive.
There's a trade-off between the size of the bit array (m), the number of elements inserted (n), the number of hash functions (k), and the desired false positive probability (p):
m) → More space used, lower false positive rate.k) → Bits get set faster (potentially higher FP rate initially), but better distribution (can lower FP rate up to an optimal point). Too many hash functions fills the filter quickly, increasing FPs.n) → Higher chance of collisions, higher false positive rate.You typically choose m and k based on the expected number of elements n and the acceptable false positive rate p.
Given the number of items you expect to insert (n) and your desired maximum false positive probability (p), you can estimate the optimal size of the bit array (m) and the optimal number of hash functions (k) needed:
m ≈ - (n * ln(p)) / (ln(2)²) k ≈ (m / n) * ln(2)
Conversely, given m, n, and k, the approximate false positive probability p is:
p ≈ (1 - e(-k*n)/m)k
Use the simulation below to see how these parameters interact!
Expected items (n): ?
Bit array size (m): ?
Number of hashes (k): ?
Optimal k for given n, m: ?
Theoretical False Positive Rate (p): ? (?%)
Simulation log will appear here...
Items actually inserted: 0
Bits set: 0 / 0 (0%)
Theoretical FP Rate (p): ?
Tested Items Not Inserted: 0
Actual False Positives Found: 0
Empirical FP Rate: 0.00%