From 9e6961c406854df5c392c7e771faa3d36f2e74ca Mon Sep 17 00:00:00 2001 From: masterdraco Date: Tue, 25 Feb 2025 18:01:48 +0100 Subject: [PATCH] README.md updated --- README.md | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 78 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c911b16..b9b749e 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,79 @@ -# clean_dubs +Compare Directories and Remove Duplicates +This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-word–based removal and an automatic pruning mechanism (keeping the first directory in each group). -Clean directories for dublicates \ No newline at end of file +Features +Normalization & Cleaning: +Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file. + +Fuzzy Matching: +Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.” + +Automatic Removal: + +Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not. +Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry. +Dry-Run Mode: +Preview actions without deleting any directories by using the --dry-run flag. + +Requirements +Bash (version 4+ recommended) +Python 3 +bc for floating point comparisons +Installation +Clone the repository: + +git clone https://github.com/yourusername/compare-dirs.git +cd compare-dirs +Make the script executable: + +bash +Kopiér +chmod +x compare_dirs.sh + +Usage +./compare_dirs.sh [--dry-run] [--threshold ] +Arguments +: The first directory containing subdirectories to compare. +: The second directory containing subdirectories to compare. +: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process. + +Options +--dry-run + +Print the actions without actually removing any directories. + +--threshold +Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates. + +Examples +Dry Run with Default Threshold (0.8): + + +./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words +Dry Run with a Custom Threshold (0.7): + + +./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words +Actual Run (without dry-run): + + +./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words +How It Works +Scanning: +The script scans for immediate subdirectories in the two specified directories. + +Normalization & Cleaning: +Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file). + +Grouping: +Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates. + +Removal: +Automatic Removal Based on Undesirable Words: +Within duplicate groups, if one directory’s original name contains an undesirable word while an alternative does not, that directory is flagged for removal. + +Duplicate Pruning: +After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest. + +Dry-Run: +When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.