docs: add README with usage instructions, threshold guide, project structure
This commit is contained in:
145
README.md
Normal file
145
README.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# deduper
|
||||
|
||||
Content-based duplicate file detection CLI for Linux.
|
||||
|
||||
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
|
||||
|
||||
## Phases
|
||||
|
||||
| Phase | Status | Method |
|
||||
|-------|--------|--------|
|
||||
| **Image** | ✅ Complete | SHA-256 + dHash perceptual hashing |
|
||||
| **Music** | 🔲 Planned | Audio fingerprinting |
|
||||
| **Video** | 🔲 Planned | Video fingerprinting |
|
||||
|
||||
## Installation
|
||||
|
||||
### From source
|
||||
|
||||
```bash
|
||||
# Requires Rust 1.85+ (2024 edition)
|
||||
git clone https://gitea.ingwaz.work/admin/deduper.git
|
||||
cd deduper
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
Binary lands at `target/release/deduper`.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
deduper <folder> [hamming-threshold]
|
||||
```
|
||||
|
||||
| Argument | Required | Default | Description |
|
||||
|----------|----------|---------|-------------|
|
||||
| `folder` | yes | — | Root directory to scan (recursive) |
|
||||
| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (0–64) |
|
||||
|
||||
### Examples
|
||||
|
||||
**Scan a photo library with default threshold:**
|
||||
|
||||
```bash
|
||||
deduper ~/Pictures
|
||||
```
|
||||
|
||||
**Strict matching (fewer false positives):**
|
||||
|
||||
```bash
|
||||
deduper ~/Pictures 4
|
||||
```
|
||||
|
||||
**Relaxed matching (catches more variants):**
|
||||
|
||||
```bash
|
||||
deduper ~/Pictures 12
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
```
|
||||
group 1 [exact]
|
||||
/home/user/Pictures/vacation/beach.jpg
|
||||
/home/user/Pictures/backup/beach.jpg
|
||||
|
||||
group 2 [similar]
|
||||
/home/user/Pictures/vacation/sunset.jpg
|
||||
/home/user/Pictures/edited/sunset_cropped.jpg
|
||||
/home/user/Pictures/thumbs/sunset_small.jpg
|
||||
```
|
||||
|
||||
- **exact** — Byte-identical files (SHA-256 match)
|
||||
- **similar** — Perceptually similar images (dHash Hamming distance ≤ threshold)
|
||||
|
||||
## How It Works
|
||||
|
||||
### Image Phase
|
||||
|
||||
1. **Recursive scan** — Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`)
|
||||
2. **SHA-256 hashing** — Identifies byte-identical duplicates
|
||||
3. **dHash fingerprinting** — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
|
||||
4. **Hamming distance** — Measures bit differences between dHash values. Lower = more similar
|
||||
5. **Union-Find grouping** — Clusters similar images into groups, separating exact from perceptual matches
|
||||
|
||||
### Threshold Guide
|
||||
|
||||
| Threshold | Behavior |
|
||||
|-----------|----------|
|
||||
| `0` | Exact perceptual match only (identical visual content, different encoding) |
|
||||
| `1–4` | Very conservative — catches resizes and minor compression artifacts |
|
||||
| `5–8` | **Recommended** — catches resizes, crops, slight edits |
|
||||
| `9–12` | Tolerant — may catch heavily edited versions, higher false positive risk |
|
||||
| `13+` | Aggressive — will group loosely related images |
|
||||
|
||||
## Supported Formats
|
||||
|
||||
`jpg` · `jpeg` · `png` · `webp` · `bmp` · `gif` · `tif` · `tiff`
|
||||
|
||||
Non-image files are silently skipped.
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
cargo test --all
|
||||
```
|
||||
|
||||
**20 tests** covering:
|
||||
|
||||
- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
|
||||
- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
|
||||
|
||||
### Allure Reports
|
||||
|
||||
Test results publish to Allure after each run:
|
||||
|
||||
```bash
|
||||
cargo test --all 2>&1 | tee /tmp/test_output.txt
|
||||
python3 /a0/usr/skills/allure-publish/publish.py deduper
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
deduper/
|
||||
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
|
||||
├── src/
|
||||
│ ├── lib.rs # Core library: scan, hash, group
|
||||
│ └── main.rs # CLI entrypoint
|
||||
├── tests/
|
||||
│ └── image_phase.rs # Integration tests
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Crate | Purpose |
|
||||
|-------|---------|
|
||||
| `image` | Image decoding and manipulation |
|
||||
| `sha2` | SHA-256 cryptographic hashing |
|
||||
| `walkdir` | Recursive directory traversal |
|
||||
| `anyhow` | Error handling |
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
Reference in New Issue
Block a user