150 lines
5.4 KiB
Markdown
150 lines
5.4 KiB
Markdown
Compare Directories and Remove Duplicates
|
||
This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-word–based removal and an automatic pruning mechanism (keeping the first directory in each group).
|
||
|
||
Features
|
||
Normalization & Cleaning:
|
||
Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.
|
||
|
||
Fuzzy Matching:
|
||
Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are "similar."
|
||
|
||
Automatic Removal:
|
||
|
||
Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not.
|
||
Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry.
|
||
Dry-Run Mode:
|
||
Preview actions without deleting any directories by using the --dry-run flag.
|
||
|
||
New Features (in compare_dirs_improved.sh):
|
||
- Parallel Processing: Both directory scanning and filtering are now processed in parallel for improved performance
|
||
- Configuration File: Support for persistent configuration via compare_dirs.conf file
|
||
- Comprehensive Logging: Detailed logging with configurable levels (DEBUG, INFO, WARNING, ERROR)
|
||
- Better Error Handling: Improved error detection and reporting
|
||
- Color-coded Console Output: Better visual distinction between different types of messages
|
||
|
||
Requirements
|
||
Bash (version 4+ recommended)
|
||
Python 3
|
||
bc for floating point comparisons
|
||
|
||
Installation
|
||
Clone the repository:
|
||
|
||
git clone https://github.com/yourusername/compare-dirs.git
|
||
cd compare-dirs
|
||
|
||
Make the scripts executable:
|
||
|
||
```bash
|
||
chmod +x compare_dirs.sh
|
||
chmod +x compare_dirs_improved.sh
|
||
```
|
||
|
||
Usage
|
||
## Original Script
|
||
```
|
||
./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
|
||
```
|
||
|
||
## Improved Script
|
||
```
|
||
./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [--log-file <log_file>] [--log-level <level>] [--parallel <processes>] [<dir1> <dir2> <words_file>]
|
||
```
|
||
|
||
Arguments
|
||
<dir1>: The first directory containing subdirectories to compare.
|
||
<dir2>: The second directory containing subdirectories to compare.
|
||
<words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.
|
||
|
||
Options
|
||
### Common Options
|
||
`--dry-run`: Print the actions without actually removing any directories.
|
||
`--threshold <threshold>`: Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
|
||
|
||
### Improved Script Options
|
||
`--config <config_file>`: Specify a configuration file (default: ./compare_dirs.conf)
|
||
`--log-file <log_file>`: Specify a log file path (default: ./compare_dirs.log)
|
||
`--log-level <level>`: Set logging level (DEBUG, INFO, WARNING, ERROR)
|
||
`--parallel <processes>`: Number of parallel processes to use (0 = auto, uses all CPU cores)
|
||
`--help`: Display usage information
|
||
|
||
Config File
|
||
The improved script supports a configuration file (default: compare_dirs.conf) with the following parameters:
|
||
|
||
```
|
||
# Configuration file for compare_dirs.sh
|
||
|
||
# Default directory paths
|
||
DIR1="/path/to/dir1"
|
||
DIR2="/path/to/dir2"
|
||
|
||
# Path to words file
|
||
WORDS_FILE="./words"
|
||
|
||
# Similarity threshold (0.0-1.0)
|
||
SIMILARITY_THRESHOLD=0.8
|
||
|
||
# Enable/disable dry run mode (true/false)
|
||
DRY_RUN=false
|
||
|
||
# Number of parallel processes to use (0 = auto)
|
||
PARALLEL_PROCESSES=0
|
||
|
||
# Logging configuration
|
||
LOG_ENABLED=true
|
||
LOG_FILE="./compare_dirs.log"
|
||
LOG_LEVEL="INFO" # DEBUG, INFO, WARNING, ERROR
|
||
```
|
||
|
||
Examples
|
||
### Original Script
|
||
|
||
Dry Run with Default Threshold (0.8):
|
||
```
|
||
./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
|
||
```
|
||
|
||
Dry Run with a Custom Threshold (0.7):
|
||
```
|
||
./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
|
||
```
|
||
|
||
Actual Run (without dry-run):
|
||
```
|
||
./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
|
||
```
|
||
|
||
### Improved Script
|
||
|
||
Using Configuration File:
|
||
```
|
||
./compare_dirs_improved.sh --config my_config.conf
|
||
```
|
||
|
||
With Parallel Processing and Custom Log Level:
|
||
```
|
||
./compare_dirs_improved.sh --parallel 8 --log-level DEBUG /mnt/dsnas /mnt/dsnas1 ./words
|
||
```
|
||
|
||
How It Works
|
||
Scanning:
|
||
The script scans for immediate subdirectories in the two specified directories. In the improved version, this is done in parallel for better performance.
|
||
|
||
Normalization & Cleaning:
|
||
Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then "cleaned" by stripping out undesirable words (one per line from the words file).
|
||
|
||
Grouping:
|
||
Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.
|
||
|
||
Removal:
|
||
Automatic Removal Based on Undesirable Words:
|
||
Within duplicate groups, if one directory's original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
|
||
|
||
Duplicate Pruning:
|
||
After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.
|
||
|
||
Dry-Run:
|
||
When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.
|
||
|
||
Logging (Improved Script):
|
||
The improved script maintains a detailed log of all operations, which can be used for audit purposes or troubleshooting. |