clean_dubs/README.md
2025-02-25 18:01:48 +01:00

80 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Compare Directories and Remove Duplicates
This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-wordbased removal and an automatic pruning mechanism (keeping the first directory in each group).
Features
Normalization & Cleaning:
Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.
Fuzzy Matching:
Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.”
Automatic Removal:
Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not.
Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry.
Dry-Run Mode:
Preview actions without deleting any directories by using the --dry-run flag.
Requirements
Bash (version 4+ recommended)
Python 3
bc for floating point comparisons
Installation
Clone the repository:
git clone https://github.com/yourusername/compare-dirs.git
cd compare-dirs
Make the script executable:
bash
Kopiér
chmod +x compare_dirs.sh
Usage
./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
Arguments
<dir1>: The first directory containing subdirectories to compare.
<dir2>: The second directory containing subdirectories to compare.
<words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.
Options
--dry-run
Print the actions without actually removing any directories.
--threshold <threshold>
Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
Examples
Dry Run with Default Threshold (0.8):
./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
Dry Run with a Custom Threshold (0.7):
./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
Actual Run (without dry-run):
./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
How It Works
Scanning:
The script scans for immediate subdirectories in the two specified directories.
Normalization & Cleaning:
Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file).
Grouping:
Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.
Removal:
Automatic Removal Based on Undesirable Words:
Within duplicate groups, if one directorys original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
Duplicate Pruning:
After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.
Dry-Run:
When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.