Files
deduper/README.md

146 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# deduper
Content-based duplicate file detection CLI for Linux.
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
## Phases
| Phase | Status | Method |
|-------|--------|--------|
| **Image** | ✅ Complete | SHA-256 + dHash perceptual hashing |
| **Music** | 🔲 Planned | Audio fingerprinting |
| **Video** | 🔲 Planned | Video fingerprinting |
## Installation
### From source
```bash
# Requires Rust 1.85+ (2024 edition)
git clone https://gitea.ingwaz.work/admin/deduper.git
cd deduper
cargo build --release
```
Binary lands at `target/release/deduper`.
## Usage
```bash
deduper <folder> [hamming-threshold]
```
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `folder` | yes | — | Root directory to scan (recursive) |
| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (064) |
### Examples
**Scan a photo library with default threshold:**
```bash
deduper ~/Pictures
```
**Strict matching (fewer false positives):**
```bash
deduper ~/Pictures 4
```
**Relaxed matching (catches more variants):**
```bash
deduper ~/Pictures 12
```
### Output
```
group 1 [exact]
/home/user/Pictures/vacation/beach.jpg
/home/user/Pictures/backup/beach.jpg
group 2 [similar]
/home/user/Pictures/vacation/sunset.jpg
/home/user/Pictures/edited/sunset_cropped.jpg
/home/user/Pictures/thumbs/sunset_small.jpg
```
- **exact** — Byte-identical files (SHA-256 match)
- **similar** — Perceptually similar images (dHash Hamming distance ≤ threshold)
## How It Works
### Image Phase
1. **Recursive scan** — Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`)
2. **SHA-256 hashing** — Identifies byte-identical duplicates
3. **dHash fingerprinting** — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
4. **Hamming distance** — Measures bit differences between dHash values. Lower = more similar
5. **Union-Find grouping** — Clusters similar images into groups, separating exact from perceptual matches
### Threshold Guide
| Threshold | Behavior |
|-----------|----------|
| `0` | Exact perceptual match only (identical visual content, different encoding) |
| `14` | Very conservative — catches resizes and minor compression artifacts |
| `58` | **Recommended** — catches resizes, crops, slight edits |
| `912` | Tolerant — may catch heavily edited versions, higher false positive risk |
| `13+` | Aggressive — will group loosely related images |
## Supported Formats
`jpg` · `jpeg` · `png` · `webp` · `bmp` · `gif` · `tif` · `tiff`
Non-image files are silently skipped.
## Testing
```bash
cargo test --all
```
**20 tests** covering:
- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
### Allure Reports
Test results publish to Allure after each run:
```bash
cargo test --all 2>&1 | tee /tmp/test_output.txt
python3 /a0/usr/skills/allure-publish/publish.py deduper
```
## Project Structure
```
deduper/
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
├── src/
│ ├── lib.rs # Core library: scan, hash, group
│ └── main.rs # CLI entrypoint
├── tests/
│ └── image_phase.rs # Integration tests
└── README.md
```
## Dependencies
| Crate | Purpose |
|-------|---------|
| `image` | Image decoding and manipulation |
| `sha2` | SHA-256 cryptographic hashing |
| `walkdir` | Recursive directory traversal |
| `anyhow` | Error handling |
## License
MIT