Interactive Bloom Filter Explorer

What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Imagine you have a very large set of items (like all URLs visited, or all usernames taken) and you want to quickly check if a new item is *already* in the set, without storing the entire set explicitly.

Bloom filters are designed for this! They can tell you definitively if an element is "possibly in the set" or "definitely not in the set".

The key word is probabilistic: they can sometimes make a mistake by saying an element is possibly in the set when it actually isn't (a false positive). However, they will *never* say an element is definitely not in the set if it actually *is* present (no false negatives).

How it Works

  1. Bit Array: We start with an array of m bits, initially all set to 0. This array is typically much smaller than the space needed to store all the actual elements.
  2. Hash Functions: We choose k different and independent hash functions. Each hash function takes an element as input and outputs an index (a position) within the bit array (from 0 to m-1).
  3. Insertion (Adding an element):
    • To add an element (e.g., "apple"), we feed it to each of the k hash functions.
    • This gives us k array indices.
    • We go to these k positions in the bit array and set the bits at those indices to 1. (If a bit is already 1, it stays 1).
  4. Query (Checking for an element):
    • To check if an element (e.g., "banana") might be in the set, we again feed it to the k hash functions to get k indices.
    • We look at the bits at these k indices in the array.
    • If any of these bits is 0, we know the element is definitely not in the set (because if it had been added, all its corresponding bits would have been set to 1).
    • If all of these bits are 1, we conclude that the element is possibly in the set. It might be there, or it might be a false positive.

False Positives

A false positive occurs when we check for an element that was never added, but all the bits corresponding to its hash indices happen to have been set to 1 by *other* elements that *were* added. The more elements added, the more bits get set to 1, and the higher the probability of these collisions leading to a false positive.

Trade-offs

There's a trade-off between the size of the bit array (m), the number of elements inserted (n), the number of hash functions (k), and the desired false positive probability (p):

You typically choose m and k based on the expected number of elements n and the acceptable false positive rate p.


Parameters and Optimization

Given the number of items you expect to insert (n) and your desired maximum false positive probability (p), you can estimate the optimal size of the bit array (m) and the optimal number of hash functions (k) needed:

Optimal size m ≈ - (n * ln(p)) / (ln(2)²)
Optimal hashes k ≈ (m / n) * ln(2)

Conversely, given m, n, and k, the approximate false positive probability p is:

p ≈ (1 - e(-k*n)/m)k

Use the simulation below to see how these parameters interact!

Calculated Parameters

Expected items (n): ?

Bit array size (m): ?

Number of hashes (k): ?

Optimal k for given n, m: ?

Theoretical False Positive Rate (p): ? (?%)


Interactive Simulation

Error message placeholder

Simulation log will appear here...

Simulation Results

Items actually inserted: 0

Bits set: 0 / 0 (0%)

Theoretical FP Rate (p): ?

Tested Items Not Inserted: 0

Actual False Positives Found: 0

Empirical FP Rate: 0.00%

Bit Array (m bits)