docs: add README with usage instructions, threshold guide, project structure
This commit is contained in:
145
README.md
Normal file
145
README.md
Normal file
@@ -0,0 +1,145 @@
|
|||||||
|
# deduper
|
||||||
|
|
||||||
|
Content-based duplicate file detection CLI for Linux.
|
||||||
|
|
||||||
|
Finds exact and near-duplicate files using cryptographic hashing and perceptual fingerprinting. Designed for large media libraries where filename or size comparisons aren't enough.
|
||||||
|
|
||||||
|
## Phases
|
||||||
|
|
||||||
|
| Phase | Status | Method |
|
||||||
|
|-------|--------|--------|
|
||||||
|
| **Image** | ✅ Complete | SHA-256 + dHash perceptual hashing |
|
||||||
|
| **Music** | 🔲 Planned | Audio fingerprinting |
|
||||||
|
| **Video** | 🔲 Planned | Video fingerprinting |
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### From source
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Requires Rust 1.85+ (2024 edition)
|
||||||
|
git clone https://gitea.ingwaz.work/admin/deduper.git
|
||||||
|
cd deduper
|
||||||
|
cargo build --release
|
||||||
|
```
|
||||||
|
|
||||||
|
Binary lands at `target/release/deduper`.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
deduper <folder> [hamming-threshold]
|
||||||
|
```
|
||||||
|
|
||||||
|
| Argument | Required | Default | Description |
|
||||||
|
|----------|----------|---------|-------------|
|
||||||
|
| `folder` | yes | — | Root directory to scan (recursive) |
|
||||||
|
| `hamming-threshold` | no | `8` | Max Hamming distance for perceptual similarity (0–64) |
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
**Scan a photo library with default threshold:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
deduper ~/Pictures
|
||||||
|
```
|
||||||
|
|
||||||
|
**Strict matching (fewer false positives):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
deduper ~/Pictures 4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Relaxed matching (catches more variants):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
deduper ~/Pictures 12
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
```
|
||||||
|
group 1 [exact]
|
||||||
|
/home/user/Pictures/vacation/beach.jpg
|
||||||
|
/home/user/Pictures/backup/beach.jpg
|
||||||
|
|
||||||
|
group 2 [similar]
|
||||||
|
/home/user/Pictures/vacation/sunset.jpg
|
||||||
|
/home/user/Pictures/edited/sunset_cropped.jpg
|
||||||
|
/home/user/Pictures/thumbs/sunset_small.jpg
|
||||||
|
```
|
||||||
|
|
||||||
|
- **exact** — Byte-identical files (SHA-256 match)
|
||||||
|
- **similar** — Perceptually similar images (dHash Hamming distance ≤ threshold)
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
### Image Phase
|
||||||
|
|
||||||
|
1. **Recursive scan** — Walks the directory tree, filtering by image extension (`jpg`, `jpeg`, `png`, `webp`, `bmp`, `gif`, `tif`, `tiff`)
|
||||||
|
2. **SHA-256 hashing** — Identifies byte-identical duplicates
|
||||||
|
3. **dHash fingerprinting** — Resizes each image to 9×8 grayscale, compares adjacent pixels to produce a 64-bit perceptual hash
|
||||||
|
4. **Hamming distance** — Measures bit differences between dHash values. Lower = more similar
|
||||||
|
5. **Union-Find grouping** — Clusters similar images into groups, separating exact from perceptual matches
|
||||||
|
|
||||||
|
### Threshold Guide
|
||||||
|
|
||||||
|
| Threshold | Behavior |
|
||||||
|
|-----------|----------|
|
||||||
|
| `0` | Exact perceptual match only (identical visual content, different encoding) |
|
||||||
|
| `1–4` | Very conservative — catches resizes and minor compression artifacts |
|
||||||
|
| `5–8` | **Recommended** — catches resizes, crops, slight edits |
|
||||||
|
| `9–12` | Tolerant — may catch heavily edited versions, higher false positive risk |
|
||||||
|
| `13+` | Aggressive — will group loosely related images |
|
||||||
|
|
||||||
|
## Supported Formats
|
||||||
|
|
||||||
|
`jpg` · `jpeg` · `png` · `webp` · `bmp` · `gif` · `tif` · `tiff`
|
||||||
|
|
||||||
|
Non-image files are silently skipped.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cargo test --all
|
||||||
|
```
|
||||||
|
|
||||||
|
**20 tests** covering:
|
||||||
|
|
||||||
|
- **Unit (13):** Hamming distance, image path filtering, dHash determinism, duplicate grouping logic
|
||||||
|
- **Integration (7):** Real image fixtures, empty directories, cropped/resized detection, non-image exclusion, subdirectory recursion, single-file edge case, CLI binary output
|
||||||
|
|
||||||
|
### Allure Reports
|
||||||
|
|
||||||
|
Test results publish to Allure after each run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cargo test --all 2>&1 | tee /tmp/test_output.txt
|
||||||
|
python3 /a0/usr/skills/allure-publish/publish.py deduper
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
deduper/
|
||||||
|
├── Cargo.toml # Dependencies: image, sha2, walkdir, anyhow
|
||||||
|
├── src/
|
||||||
|
│ ├── lib.rs # Core library: scan, hash, group
|
||||||
|
│ └── main.rs # CLI entrypoint
|
||||||
|
├── tests/
|
||||||
|
│ └── image_phase.rs # Integration tests
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
| Crate | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `image` | Image decoding and manipulation |
|
||||||
|
| `sha2` | SHA-256 cryptographic hashing |
|
||||||
|
| `walkdir` | Recursive directory traversal |
|
||||||
|
| `anyhow` | Error handling |
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
Reference in New Issue
Block a user