deduper
Content-based duplicate file detection CLI for Linux.
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
Phases
| Phase | Status | Method |
|---|---|---|
| Image | ✅ Complete | SHA-256 + dHash perceptual hashing |
| Music | 🔲 Planned | Audio fingerprinting |
| Video | 🔲 Planned | Video fingerprinting |
Installation
From source
# Requires Rust 1.85+ (2024 edition)
git clone https://gitea.ingwaz.work/admin/deduper.git
cd deduper
cargo build --release
Binary lands at target/release/deduper.
Usage
deduper <folder> [hamming-threshold]
| Argument | Required | Default | Description |
|---|---|---|---|
folder |
yes | — | Root directory to scan (recursive) |
hamming-threshold |
no | 8 |
Max Hamming distance for perceptual similarity (0–64) |
Examples
Scan a photo library with default threshold:
deduper ~/Pictures
Strict matching (fewer false positives):
deduper ~/Pictures 4
Relaxed matching (catches more variants):
deduper ~/Pictures 12
Output
group 1 [exact]
/home/user/Pictures/vacation/beach.jpg
/home/user/Pictures/backup/beach.jpg
group 2 [similar]
/home/user/Pictures/vacation/sunset.jpg
/home/user/Pictures/edited/sunset_cropped.jpg
/home/user/Pictures/thumbs/sunset_small.jpg
- exact — Byte-identical files (SHA-256 match)
- similar — Perceptually similar images (dHash Hamming distance ≤ threshold)
How It Works
Image Phase
- Recursive scan — Walks the directory tree, filtering by image extension (
jpg,jpeg,png,webp,bmp,gif,tif,tiff) - SHA-256 hashing — Identifies byte-identical duplicates
- dHash fingerprinting — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
- Hamming distance — Measures bit differences between dHash values. Lower = more similar
- Union-Find grouping — Clusters similar images into groups, separating exact from perceptual matches
Threshold Guide
| Threshold | Behavior |
|---|---|
0 |
Exact perceptual match only (identical visual content, different encoding) |
1–4 |
Very conservative — catches resizes and minor compression artifacts |
5–8 |
Recommended — catches resizes, crops, slight edits |
9–12 |
Tolerant — may catch heavily edited versions, higher false positive risk |
13+ |
Aggressive — will group loosely related images |
Supported Formats
jpg · jpeg · png · webp · bmp · gif · tif · tiff
Non-image files are silently skipped.
Testing
cargo test --all
20 tests covering:
- Unit (13): Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
- Integration (7): Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
Allure Reports
Test results publish to Allure after each run:
cargo test --all 2>&1 | tee /tmp/test_output.txt
python3 /a0/usr/skills/allure-publish/publish.py deduper
Project Structure
deduper/
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
├── src/
│ ├── lib.rs # Core library: scan, hash, group
│ └── main.rs # CLI entrypoint
├── tests/
│ └── image_phase.rs # Integration tests
└── README.md
Dependencies
| Crate | Purpose |
|---|---|
image |
Image decoding and manipulation |
sha2 |
SHA-256 cryptographic hashing |
walkdir |
Recursive directory traversal |
anyhow |
Error handling |
License
MIT
Description
Languages
Rust
100%