# deduper Content-based duplicate file detection CLI for Linux. Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough. ## Phases | Phase | Status | Method | |-------|--------|--------| | **Image** | โœ… Complete | SHA-256 + dHash perceptual hashing | | **Music** | ๐Ÿ”ฒ Planned | Audio fingerprinting | | **Video** | ๐Ÿ”ฒ Planned | Video fingerprinting | ## Installation ### From source ```bash # Requires Rust 1.85+ (2024 edition) git clone https://gitea.ingwaz.work/admin/deduper.git cd deduper cargo build --release ``` Binary lands at `target/release/deduper`. ## Usage ```bash deduper [hamming-threshold] ``` | Argument | Required | Default | Description | |----------|----------|---------|-------------| | `folder` | yes | โ€” | Root directory to scan (recursive) | | `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (0โ€“64) | ### Examples **Scan a photo library with default threshold:** ```bash deduper ~/Pictures ``` **Strict matching (fewer false positives):** ```bash deduper ~/Pictures 4 ``` **Relaxed matching (catches more variants):** ```bash deduper ~/Pictures 12 ``` ### Output ``` group 1 [exact] /home/user/Pictures/vacation/beach.jpg /home/user/Pictures/backup/beach.jpg group 2 [similar] /home/user/Pictures/vacation/sunset.jpg /home/user/Pictures/edited/sunset_cropped.jpg /home/user/Pictures/thumbs/sunset_small.jpg ``` - **exact** โ€” Byte-identical files (SHA-256 match) - **similar** โ€” Perceptually similar images (dHash Hamming distance โ‰ค threshold) ## How It Works ### Image Phase 1. **Recursive scan** โ€” Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`) 2. **SHA-256 hashing** โ€” Identifies byte-identical duplicates 3. **dHash fingerprinting** โ€” Resizes each image to 9ร—8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash 4. **Hamming distance** โ€” Measures bit differences between dHash values. Lower = more similar 5. **Union-Find grouping** โ€” Clusters similar images into groups, separating exact from perceptual matches ### Threshold Guide | Threshold | Behavior | |-----------|----------| | `0` | Exact perceptual match only (identical visual content, different encoding) | | `1โ€“4` | Very conservative โ€” catches resizes and minor compression artifacts | | `5โ€“8` | **Recommended** โ€” catches resizes, crops, slight edits | | `9โ€“12` | Tolerant โ€” may catch heavily edited versions, higher false positive risk | | `13+` | Aggressive โ€” will group loosely related images | ## Supported Formats `jpg` ยท `jpeg` ยท `png` ยท `webp` ยท `bmp` ยท `gif` ยท `tif` ยท `tiff` Non-image files are silently skipped. ## Testing ```bash cargo test --all ``` **20 tests** covering: - **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic - **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output ### Allure Reports Test results publish to Allure after each run: ```bash cargo test --all 2>&1 | tee /tmp/test_output.txt python3 /a0/usr/skills/allure-publish/publish.py deduper ``` ## Project Structure ``` deduper/ โ”œโ”€โ”€ Cargo.toml # Dependencies: image, sha2, walkdir, anyhow โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ lib.rs # Core library: scan, hash, group โ”‚ โ””โ”€โ”€ main.rs # CLI entrypoint โ”œโ”€โ”€ tests/ โ”‚ โ””โ”€โ”€ image_phase.rs # Integration tests โ””โ”€โ”€ README.md ``` ## Dependencies | Crate | Purpose | |-------|---------| | `image` | Image decoding and manipulation | | `sha2` | SHA-256 cryptographic hashing | | `walkdir` | Recursive directory traversal | | `anyhow` | Error handling | ## License MIT