deduper

Content-based duplicate file detection CLI for Linux.

Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.

Phases

Phase Status Method
Image Complete SHA-256 + dHash perceptual hashing
Music 🔲 Planned Audio fingerprinting
Video 🔲 Planned Video fingerprinting

Installation

From source

# Requires Rust 1.85+ (2024 edition)
git clone https://gitea.ingwaz.work/admin/deduper.git
cd deduper
cargo build --release

Binary lands at target/release/deduper.

Usage

deduper <folder> [hamming-threshold]
Argument Required Default Description
folder yes Root directory to scan (recursive)
hamming-threshold no 8 Max Hamming distance for perceptual similarity (064)

Examples

Scan a photo library with default threshold:

deduper ~/Pictures

Strict matching (fewer false positives):

deduper ~/Pictures 4

Relaxed matching (catches more variants):

deduper ~/Pictures 12

Output

group 1 [exact]
  /home/user/Pictures/vacation/beach.jpg
  /home/user/Pictures/backup/beach.jpg

group 2 [similar]
  /home/user/Pictures/vacation/sunset.jpg
  /home/user/Pictures/edited/sunset_cropped.jpg
  /home/user/Pictures/thumbs/sunset_small.jpg
  • exact — Byte-identical files (SHA-256 match)
  • similar — Perceptually similar images (dHash Hamming distance ≤ threshold)

How It Works

Image Phase

  1. Recursive scan — Walks the directory tree, filtering by image extension (jpg, jpeg, png, webp, bmp, gif, tif, tiff)
  2. SHA-256 hashing — Identifies byte-identical duplicates
  3. dHash fingerprinting — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
  4. Hamming distance — Measures bit differences between dHash values. Lower = more similar
  5. Union-Find grouping — Clusters similar images into groups, separating exact from perceptual matches

Threshold Guide

Threshold Behavior
0 Exact perceptual match only (identical visual content, different encoding)
14 Very conservative — catches resizes and minor compression artifacts
58 Recommended — catches resizes, crops, slight edits
912 Tolerant — may catch heavily edited versions, higher false positive risk
13+ Aggressive — will group loosely related images

Supported Formats

jpg · jpeg · png · webp · bmp · gif · tif · tiff

Non-image files are silently skipped.

Testing

cargo test --all

20 tests covering:

  • Unit (13): Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
  • Integration (7): Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output

Allure Reports

Test results publish to Allure after each run:

cargo test --all 2>&1 | tee /tmp/test_output.txt
python3 /a0/usr/skills/allure-publish/publish.py deduper

Project Structure

deduper/
├── Cargo.toml          # Dependencies: image, sha2, walkdir, anyhow
├── src/
│   ├── lib.rs          # Core library: scan, hash, group
│   └── main.rs         # CLI entrypoint
├── tests/
│   └── image_phase.rs  # Integration tests
└── README.md

Dependencies

Crate Purpose
image Image decoding and manipulation
sha2 SHA-256 cryptographic hashing
walkdir Recursive directory traversal
anyhow Error handling

License

MIT

Description
Content-based duplicate file detection CLI
Readme 162 KiB
Languages
Rust 100%