README.md updated
This commit is contained in:
parent
67c7056b2e
commit
9e6961c406
80
README.md
80
README.md
@ -1,3 +1,79 @@
|
|||||||
# clean_dubs
|
Compare Directories and Remove Duplicates
|
||||||
|
This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-word–based removal and an automatic pruning mechanism (keeping the first directory in each group).
|
||||||
|
|
||||||
Clean directories for dublicates
|
Features
|
||||||
|
Normalization & Cleaning:
|
||||||
|
Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.
|
||||||
|
|
||||||
|
Fuzzy Matching:
|
||||||
|
Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.”
|
||||||
|
|
||||||
|
Automatic Removal:
|
||||||
|
|
||||||
|
Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not.
|
||||||
|
Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry.
|
||||||
|
Dry-Run Mode:
|
||||||
|
Preview actions without deleting any directories by using the --dry-run flag.
|
||||||
|
|
||||||
|
Requirements
|
||||||
|
Bash (version 4+ recommended)
|
||||||
|
Python 3
|
||||||
|
bc for floating point comparisons
|
||||||
|
Installation
|
||||||
|
Clone the repository:
|
||||||
|
|
||||||
|
git clone https://github.com/yourusername/compare-dirs.git
|
||||||
|
cd compare-dirs
|
||||||
|
Make the script executable:
|
||||||
|
|
||||||
|
bash
|
||||||
|
Kopiér
|
||||||
|
chmod +x compare_dirs.sh
|
||||||
|
|
||||||
|
Usage
|
||||||
|
./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
|
||||||
|
Arguments
|
||||||
|
<dir1>: The first directory containing subdirectories to compare.
|
||||||
|
<dir2>: The second directory containing subdirectories to compare.
|
||||||
|
<words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.
|
||||||
|
|
||||||
|
Options
|
||||||
|
--dry-run
|
||||||
|
|
||||||
|
Print the actions without actually removing any directories.
|
||||||
|
|
||||||
|
--threshold <threshold>
|
||||||
|
Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
|
||||||
|
|
||||||
|
Examples
|
||||||
|
Dry Run with Default Threshold (0.8):
|
||||||
|
|
||||||
|
|
||||||
|
./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
|
||||||
|
Dry Run with a Custom Threshold (0.7):
|
||||||
|
|
||||||
|
|
||||||
|
./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
|
||||||
|
Actual Run (without dry-run):
|
||||||
|
|
||||||
|
|
||||||
|
./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
|
||||||
|
How It Works
|
||||||
|
Scanning:
|
||||||
|
The script scans for immediate subdirectories in the two specified directories.
|
||||||
|
|
||||||
|
Normalization & Cleaning:
|
||||||
|
Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file).
|
||||||
|
|
||||||
|
Grouping:
|
||||||
|
Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.
|
||||||
|
|
||||||
|
Removal:
|
||||||
|
Automatic Removal Based on Undesirable Words:
|
||||||
|
Within duplicate groups, if one directory’s original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
|
||||||
|
|
||||||
|
Duplicate Pruning:
|
||||||
|
After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.
|
||||||
|
|
||||||
|
Dry-Run:
|
||||||
|
When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user