Overview

A dataset is a sequence of elements. A pair with is called a duplicate in if .

Task: Given a dataset , find all duplicates. Example:

Set of duplicates in :

Naive solution

sort the elements in alphabetic order

Hash function

Let the elements in be drawn from a universe We use a hash function and assume:

  • is efficiently computable.
  • behaves like a random function, i.e.,

(independently for different )

Important

Each is uniformly distributed at random in , but

We choose to be much smaller than (compression).

Hash-based duplicate detection

Idea: elements that are equal must have the same hash value. For each position , compute and place into bucket . For every bucket , define

If two elements are duplicates, then their indices must appear in the same bucket.

Therefore, we only need to compare elements inside the same bucket.

Algorithm

  1. Create empty buckets.
  2. For every :
    • compute
    • insert into bucket
  3. For each bucket :
    • compare all pairs of indices inside
    • if , output

Pseudocode

Create empty buckets B[1], ..., B[m]
for i = 1 to n:
    b = h(s_i)
    append i to B[b]
Dupl = empty set
for b = 1 to m:
    for each pair (i, j) in B[b] with i < j:
        if s_i = s_j:
            add (i, j) to Dupl
return Dupl