146 lines
3.9 KiB
Markdown
146 lines
3.9 KiB
Markdown
# deduper
|
||
|
||
Content-based duplicate file detection CLI for Linux.
|
||
|
||
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
|
||
|
||
## Phases
|
||
|
||
| Phase | Status | Method |
|
||
|-------|--------|--------|
|
||
| **Image** | ✅ Complete | SHA-256 + dHash perceptual hashing |
|
||
| **Music** | 🔲 Planned | Audio fingerprinting |
|
||
| **Video** | 🔲 Planned | Video fingerprinting |
|
||
|
||
## Installation
|
||
|
||
### From source
|
||
|
||
```bash
|
||
# Requires Rust 1.85+ (2024 edition)
|
||
git clone https://gitea.ingwaz.work/admin/deduper.git
|
||
cd deduper
|
||
cargo build --release
|
||
```
|
||
|
||
Binary lands at `target/release/deduper`.
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
deduper <folder> [hamming-threshold]
|
||
```
|
||
|
||
| Argument | Required | Default | Description |
|
||
|----------|----------|---------|-------------|
|
||
| `folder` | yes | — | Root directory to scan (recursive) |
|
||
| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (0–64) |
|
||
|
||
### Examples
|
||
|
||
**Scan a photo library with default threshold:**
|
||
|
||
```bash
|
||
deduper ~/Pictures
|
||
```
|
||
|
||
**Strict matching (fewer false positives):**
|
||
|
||
```bash
|
||
deduper ~/Pictures 4
|
||
```
|
||
|
||
**Relaxed matching (catches more variants):**
|
||
|
||
```bash
|
||
deduper ~/Pictures 12
|
||
```
|
||
|
||
### Output
|
||
|
||
```
|
||
group 1 [exact]
|
||
/home/user/Pictures/vacation/beach.jpg
|
||
/home/user/Pictures/backup/beach.jpg
|
||
|
||
group 2 [similar]
|
||
/home/user/Pictures/vacation/sunset.jpg
|
||
/home/user/Pictures/edited/sunset_cropped.jpg
|
||
/home/user/Pictures/thumbs/sunset_small.jpg
|
||
```
|
||
|
||
- **exact** — Byte-identical files (SHA-256 match)
|
||
- **similar** — Perceptually similar images (dHash Hamming distance ≤ threshold)
|
||
|
||
## How It Works
|
||
|
||
### Image Phase
|
||
|
||
1. **Recursive scan** — Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`)
|
||
2. **SHA-256 hashing** — Identifies byte-identical duplicates
|
||
3. **dHash fingerprinting** — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
|
||
4. **Hamming distance** — Measures bit differences between dHash values. Lower = more similar
|
||
5. **Union-Find grouping** — Clusters similar images into groups, separating exact from perceptual matches
|
||
|
||
### Threshold Guide
|
||
|
||
| Threshold | Behavior |
|
||
|-----------|----------|
|
||
| `0` | Exact perceptual match only (identical visual content, different encoding) |
|
||
| `1–4` | Very conservative — catches resizes and minor compression artifacts |
|
||
| `5–8` | **Recommended** — catches resizes, crops, slight edits |
|
||
| `9–12` | Tolerant — may catch heavily edited versions, higher false positive risk |
|
||
| `13+` | Aggressive — will group loosely related images |
|
||
|
||
## Supported Formats
|
||
|
||
`jpg` · `jpeg` · `png` · `webp` · `bmp` · `gif` · `tif` · `tiff`
|
||
|
||
Non-image files are silently skipped.
|
||
|
||
## Testing
|
||
|
||
```bash
|
||
cargo test --all
|
||
```
|
||
|
||
**20 tests** covering:
|
||
|
||
- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
|
||
- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
|
||
|
||
### Allure Reports
|
||
|
||
Test results publish to Allure after each run:
|
||
|
||
```bash
|
||
cargo test --all 2>&1 | tee /tmp/test_output.txt
|
||
python3 /a0/usr/skills/allure-publish/publish.py deduper
|
||
```
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
deduper/
|
||
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
|
||
├── src/
|
||
│ ├── lib.rs # Core library: scan, hash, group
|
||
│ └── main.rs # CLI entrypoint
|
||
├── tests/
|
||
│ └── image_phase.rs # Integration tests
|
||
└── README.md
|
||
```
|
||
|
||
## Dependencies
|
||
|
||
| Crate | Purpose |
|
||
|-------|---------|
|
||
| `image` | Image decoding and manipulation |
|
||
| `sha2` | SHA-256 cryptographic hashing |
|
||
| `walkdir` | Recursive directory traversal |
|
||
| `anyhow` | Error handling |
|
||
|
||
## License
|
||
|
||
MIT
|