docs: add README with usage instructions, threshold guide, project structure

This commit is contained in:
admin
2026-04-27 23:38:45 +00:00
parent e1f8201b5c
commit 226a5e7e8f

145
README.md Normal file
View File

@@ -0,0 +1,145 @@
# deduper
Content-based duplicate file detection CLI for Linux.
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
## Phases
| Phase | Status | Method |
|-------|--------|--------|
| **Image** | ✅ Complete | SHA-256 + dHash perceptual hashing |
| **Music** | 🔲 Planned | Audio fingerprinting |
| **Video** | 🔲 Planned | Video fingerprinting |
## Installation
### From source
```bash
# Requires Rust 1.85+ (2024 edition)
git clone https://gitea.ingwaz.work/admin/deduper.git
cd deduper
cargo build --release
```
Binary lands at `target/release/deduper`.
## Usage
```bash
deduper <folder> [hamming-threshold]
```
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `folder` | yes | — | Root directory to scan (recursive) |
| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (064) |
### Examples
**Scan a photo library with default threshold:**
```bash
deduper ~/Pictures
```
**Strict matching (fewer false positives):**
```bash
deduper ~/Pictures 4
```
**Relaxed matching (catches more variants):**
```bash
deduper ~/Pictures 12
```
### Output
```
group 1 [exact]
/home/user/Pictures/vacation/beach.jpg
/home/user/Pictures/backup/beach.jpg
group 2 [similar]
/home/user/Pictures/vacation/sunset.jpg
/home/user/Pictures/edited/sunset_cropped.jpg
/home/user/Pictures/thumbs/sunset_small.jpg
```
- **exact** — Byte-identical files (SHA-256 match)
- **similar** — Perceptually similar images (dHash Hamming distance ≤ threshold)
## How It Works
### Image Phase
1. **Recursive scan** — Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`)
2. **SHA-256 hashing** — Identifies byte-identical duplicates
3. **dHash fingerprinting** — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
4. **Hamming distance** — Measures bit differences between dHash values. Lower = more similar
5. **Union-Find grouping** — Clusters similar images into groups, separating exact from perceptual matches
### Threshold Guide
| Threshold | Behavior |
|-----------|----------|
| `0` | Exact perceptual match only (identical visual content, different encoding) |
| `14` | Very conservative — catches resizes and minor compression artifacts |
| `58` | **Recommended** — catches resizes, crops, slight edits |
| `912` | Tolerant — may catch heavily edited versions, higher false positive risk |
| `13+` | Aggressive — will group loosely related images |
## Supported Formats
`jpg` · `jpeg` · `png` · `webp` · `bmp` · `gif` · `tif` · `tiff`
Non-image files are silently skipped.
## Testing
```bash
cargo test --all
```
**20 tests** covering:
- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
### Allure Reports
Test results publish to Allure after each run:
```bash
cargo test --all 2>&1 | tee /tmp/test_output.txt
python3 /a0/usr/skills/allure-publish/publish.py deduper
```
## Project Structure
```
deduper/
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
├── src/
│ ├── lib.rs # Core library: scan, hash, group
│ └── main.rs # CLI entrypoint
├── tests/
│ └── image_phase.rs # Integration tests
└── README.md
```
## Dependencies
| Crate | Purpose |
|-------|---------|
| `image` | Image decoding and manipulation |
| `sha2` | SHA-256 cryptographic hashing |
| `walkdir` | Recursive directory traversal |
| `anyhow` | Error handling |
## License
MIT