README.md updated

2025-02-25 18:01:48 +01:00 · 2025-02-25 18:01:48 +01:00 · 9e6961c406
commit 9e6961c406
parent 67c7056b2e
1 changed files with 78 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,79 @@
-# clean_dubs
+Compare Directories and Remove Duplicates
+This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-word–based removal and an automatic pruning mechanism (keeping the first directory in each group).

-Clean directories for dublicates
+Features
+Normalization & Cleaning:
+Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.
+
+Fuzzy Matching:
+Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.”
+
+Automatic Removal:
+
+Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not.
+Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry.
+Dry-Run Mode:
+Preview actions without deleting any directories by using the --dry-run flag.
+
+Requirements
+Bash (version 4+ recommended)
+Python 3
+bc for floating point comparisons
+Installation
+Clone the repository:
+
+git clone https://github.com/yourusername/compare-dirs.git
+cd compare-dirs
+Make the script executable:
+
+bash
+Kopiér
+chmod +x compare_dirs.sh
+
+Usage
+./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
+Arguments
+<dir1>: The first directory containing subdirectories to compare.
+<dir2>: The second directory containing subdirectories to compare.
+<words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.
+
+Options
+--dry-run
+
+Print the actions without actually removing any directories.
+
+--threshold <threshold>
+Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
+
+Examples
+Dry Run with Default Threshold (0.8):
+
+
+./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
+Dry Run with a Custom Threshold (0.7):
+
+
+./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
+Actual Run (without dry-run):
+
+
+./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
+How It Works
+Scanning:
+The script scans for immediate subdirectories in the two specified directories.
+
+Normalization & Cleaning:
+Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file).
+
+Grouping:
+Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.
+
+Removal:
+Automatic Removal Based on Undesirable Words:
+Within duplicate groups, if one directory’s original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
+
+Duplicate Pruning:
+After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.
+
+Dry-Run:
+When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.