Overview
A dataset is a sequence of elements. A pair with is called a duplicate in if .
Task: Given a dataset , find all duplicates. Example:
Set of duplicates in :
Naive solution
sort the elements in alphabetic order
Hash function
Let the elements in be drawn from a universe We use a hash function and assume:
- is efficiently computable.
- behaves like a random function, i.e.,
(independently for different )
Important
Each is uniformly distributed at random in , but
We choose to be much smaller than (compression).
Hash-based duplicate detection
Idea: elements that are equal must have the same hash value. For each position , compute and place into bucket . For every bucket , define
If two elements are duplicates, then their indices must appear in the same bucket.
Therefore, we only need to compare elements inside the same bucket.
Algorithm
- Create empty buckets.
- For every :
- compute
- insert into bucket
- For each bucket :
- compare all pairs of indices inside
- if , output
Pseudocode
Create empty buckets B[1], ..., B[m]
for i = 1 to n:
b = h(s_i)
append i to B[b]
Dupl = empty set
for b = 1 to m:
for each pair (i, j) in B[b] with i < j:
if s_i = s_j:
add (i, j) to Dupl
return Dupl