Finding Duplicates

Overview

A dataset $D = (s_{1}, s_{2}, \dots, s_{n})$ is a sequence of $n$ elements. A pair $(i, j)$ with $1 \leq i < j \leq n$ is called a duplicate in $D$ if $s_{i} = s_{j}$ .

Task: Given a dataset $D$ , find all duplicates. Example:

D = (A, C, B, Z, C, B, C)

1234567

Set of duplicates in $D$ :

Dupl (D) = {(2, 5), (2, 7), (5, 7), (3, 6)}

Naive solution

sort the elements in alphabetic order

Hash function

Let the elements in $D$ be drawn from a universe $U$ We use a hash function $h : U \to [m]$ and assume:

$h$ is efficiently computable.
$h$ behaves like a random function, i.e.,

\forall u \in U \forall i \in [m] : Pr [h (u) = i] = \frac{1}{m}

(independently for different $u$ )

Important

Each $h (s_{i})$ is uniformly distributed at random in $[m]$ , but
$s_{i} = s_{j} \Rightarrow h (s_{i}) = h (s_{j}) .$
We choose $m$ to be much smaller than $∣ U ∣$ (compression).

Hash-based duplicate detection

Idea: elements that are equal must have the same hash value. For each position $i$ , compute $h (s_{i})$ and place $i$ into bucket $h (s_{i})$ . For every bucket $b \in [m]$ , define

B_{b} = {i \in [n] : h (s_{i}) = b} .

If two elements are duplicates, then their indices must appear in the same bucket.

s_{i} = s_{j} \Rightarrow h (s_{i}) = h (s_{j})

Therefore, we only need to compare elements inside the same bucket.

Algorithm

Create $m$ empty buckets.
For every $i = 1, \dots, n$ :
- compute $b = h (s_{i})$
- insert $i$ into bucket $B_{b}$
For each bucket $B_{b}$ :
- compare all pairs of indices inside $B_{b}$
- if $s_{i} = s_{j}$ , output $(i, j)$

Pseudocode

Create empty buckets B[1], ..., B[m]
for i = 1 to n:
    b = h(s_i)
    append i to B[b]
Dupl = empty set
for b = 1 to m:
    for each pair (i, j) in B[b] with i < j:
        if s_i = s_j:
            add (i, j) to Dupl
return Dupl

Thomas Second Brain

Explorer

Finding Duplicates

Overview

Hash function

Hash-based duplicate detection

Algorithm

Pseudocode

Graph View

Table of Contents

Backlinks