diff --git a/README.md b/README.md new file mode 100644 index 0000000..3b10cb7 --- /dev/null +++ b/README.md @@ -0,0 +1,145 @@ +# deduper + +Content-based duplicate file detection CLI for Linux. + +Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough. + +## Phases + +| Phase | Status | Method | +|-------|--------|--------| +| **Image** | โœ… Complete | SHA-256 + dHash perceptual hashing | +| **Music** | ๐Ÿ”ฒ Planned | Audio fingerprinting | +| **Video** | ๐Ÿ”ฒ Planned | Video fingerprinting | + +## Installation + +### From source + +```bash +# Requires Rust 1.85+ (2024 edition) +git clone https://gitea.ingwaz.work/admin/deduper.git +cd deduper +cargo build --release +``` + +Binary lands at `target/release/deduper`. + +## Usage + +```bash +deduper [hamming-threshold] +``` + +| Argument | Required | Default | Description | +|----------|----------|---------|-------------| +| `folder` | yes | โ€” | Root directory to scan (recursive) | +| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (0โ€“64) | + +### Examples + +**Scan a photo library with default threshold:** + +```bash +deduper ~/Pictures +``` + +**Strict matching (fewer false positives):** + +```bash +deduper ~/Pictures 4 +``` + +**Relaxed matching (catches more variants):** + +```bash +deduper ~/Pictures 12 +``` + +### Output + +``` +group 1 [exact] + /home/user/Pictures/vacation/beach.jpg + /home/user/Pictures/backup/beach.jpg + +group 2 [similar] + /home/user/Pictures/vacation/sunset.jpg + /home/user/Pictures/edited/sunset_cropped.jpg + /home/user/Pictures/thumbs/sunset_small.jpg +``` + +- **exact** โ€” Byte-identical files (SHA-256 match) +- **similar** โ€” Perceptually similar images (dHash Hamming distance โ‰ค threshold) + +## How It Works + +### Image Phase + +1. **Recursive scan** โ€” Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`) +2. **SHA-256 hashing** โ€” Identifies byte-identical duplicates +3. **dHash fingerprinting** โ€” Resizes each image to 9ร—8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash +4. **Hamming distance** โ€” Measures bit differences between dHash values. Lower = more similar +5. **Union-Find grouping** โ€” Clusters similar images into groups, separating exact from perceptual matches + +### Threshold Guide + +| Threshold | Behavior | +|-----------|----------| +| `0` | Exact perceptual match only (identical visual content, different encoding) | +| `1โ€“4` | Very conservative โ€” catches resizes and minor compression artifacts | +| `5โ€“8` | **Recommended** โ€” catches resizes, crops, slight edits | +| `9โ€“12` | Tolerant โ€” may catch heavily edited versions, higher false positive risk | +| `13+` | Aggressive โ€” will group loosely related images | + +## Supported Formats + +`jpg` ยท `jpeg` ยท `png` ยท `webp` ยท `bmp` ยท `gif` ยท `tif` ยท `tiff` + +Non-image files are silently skipped. + +## Testing + +```bash +cargo test --all +``` + +**20 tests** covering: + +- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic +- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output + +### Allure Reports + +Test results publish to Allure after each run: + +```bash +cargo test --all 2>&1 | tee /tmp/test_output.txt +python3 /a0/usr/skills/allure-publish/publish.py deduper +``` + +## Project Structure + +``` +deduper/ +โ”œโ”€โ”€ Cargo.toml # Dependencies: image, sha2, walkdir, anyhow +โ”œโ”€โ”€ src/ +โ”‚ โ”œโ”€โ”€ lib.rs # Core library: scan, hash, group +โ”‚ โ””โ”€โ”€ main.rs # CLI entrypoint +โ”œโ”€โ”€ tests/ +โ”‚ โ””โ”€โ”€ image_phase.rs # Integration tests +โ””โ”€โ”€ README.md +``` + +## Dependencies + +| Crate | Purpose | +|-------|---------| +| `image` | Image decoding and manipulation | +| `sha2` | SHA-256 cryptographic hashing | +| `walkdir` | Recursive directory traversal | +| `anyhow` | Error handling | + +## License + +MIT