clean_dubs/README.md
2025-02-25 18:01:48 +01:00

3.1 KiB
Raw Blame History

Compare Directories and Remove Duplicates This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-wordbased removal and an automatic pruning mechanism (keeping the first directory in each group).

Features Normalization & Cleaning: Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.

Fuzzy Matching: Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.”

Automatic Removal:

Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not. Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry. Dry-Run Mode: Preview actions without deleting any directories by using the --dry-run flag.

Requirements Bash (version 4+ recommended) Python 3 bc for floating point comparisons Installation Clone the repository:

git clone https://github.com/yourusername/compare-dirs.git cd compare-dirs Make the script executable:

bash Kopiér chmod +x compare_dirs.sh

Usage ./compare_dirs.sh [--dry-run] [--threshold ] <words_file> Arguments : The first directory containing subdirectories to compare. : The second directory containing subdirectories to compare. <words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.

Options --dry-run

Print the actions without actually removing any directories.

--threshold Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.

Examples Dry Run with Default Threshold (0.8):

./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words Dry Run with a Custom Threshold (0.7):

./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words Actual Run (without dry-run):

./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words How It Works Scanning: The script scans for immediate subdirectories in the two specified directories.

Normalization & Cleaning: Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file).

Grouping: Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.

Removal: Automatic Removal Based on Undesirable Words: Within duplicate groups, if one directorys original name contains an undesirable word while an alternative does not, that directory is flagged for removal.

Duplicate Pruning: After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.

Dry-Run: When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.