A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can yield false positives (reporting an element is present when it is not) but never false negatives (it will never miss a true member). It works by hashing elements into a bit array using multiple hash functions. Bloom filters are used in databases, web caches, network routers, and spell checkers where memory efficiency is critical and occasional false positives are tolerable.
P_fp = (1 - e^(-k*n/m))^k
LaTeX: P_{fp} = \left(1 - e^{-kn/m}\right)^k
| Symbol | Meaning | Unit |
|---|---|---|
| P_fp | Probability of a false positive | dimensionless (0 to 1) |
| k | Number of hash functions | count |
| n | Number of elements inserted | count |
| m | Size of the bit array (number of bits) | bits |
| e | Euler's number (≈ 2.718) | dimensionless |
Problem
A Bloom filter uses m = 1000 bits, k = 7 hash functions, and has n = 100 elements inserted. What is the probability of a false positive?
Solution
Step 1: Identify variables: m = 1000, k = 7, n = 100. Step 2: Compute the exponent: -k*n/m = -(7 × 100) / 1000 = -700 / 1000 = -0.7. Step 3: Compute e^(-0.7) ≈ 0.4966. Step 4: Compute 1 - e^(-0.7) = 1 - 0.4966 = 0.5034. Step 5: Raise to the power k: (0.5034)^7 ≈ 0.00819.
Answer
False positive probability ≈ 0.82% (approximately 1 in 122 queries will be a false positive)
| Property | Bloom Filter | Hash Set | Notes |
|---|---|---|---|
| Space per element | Very low (~10 bits) | High (pointer + data) | Bloom filter wins for large n |
| False positives | Possible (tunable) | None | Trade-off for space savings |
| False negatives | Never | Never | Both are reliable for misses |
| Deletion support | No (standard) | Yes | Counting Bloom filter adds deletion |
| Lookup time | O(k) — very fast | O(1) amortized | k is small constant |
Wikimedia Commons, CC BY-SA
A trie (also called a prefix tree or digital tree) is a tree-like data structure used to store strings where each node represents a single character of a key, and paths from root to leaf spell out complete keys. The root represents an empty string, and each edge corresponds to a character. Tries are highly efficient for prefix-based search operations, making them ideal for autocomplete systems, spell checkers, and IP routing tables.
A disjoint set (also called union-find or merge-find set) is a data structure that keeps track of a collection of non-overlapping sets and supports two primary operations: Union (merging two sets) and Find (determining which set an element belongs to by returning a representative). With path compression and union by rank optimizations, operations run in nearly O(1) amortized time per operation (Inverse Ackermann function). Disjoint sets are foundational in Kruskal's minimum spanning tree algorithm, network connectivity checking, and image segmentation.
A graph is a non-linear data structure consisting of a finite set of vertices (nodes) and a set of edges that connect pairs of vertices. Graphs can be directed (edges have a direction) or undirected, and weighted (edges carry a cost) or unweighted. They are used to model real-world problems such as social networks, transportation systems, web page link structures, and dependency resolution.
The Bloom filter was conceived by Burton Howard Bloom in 1970, and is named after him. The word "filter" reflects its function of filtering membership queries. Bloom published it in a paper titled "Space/Time Trade-offs in Hash Coding with Allowable Errors" in the Communications of the ACM.