clean_dubs/README.md
2025-02-28 11:35:18 +01:00

150 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Compare Directories and Remove Duplicates
This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-wordbased removal and an automatic pruning mechanism (keeping the first directory in each group).
Features
Normalization & Cleaning:
Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.
Fuzzy Matching:
Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are "similar."
Automatic Removal:
Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not.
Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry.
Dry-Run Mode:
Preview actions without deleting any directories by using the --dry-run flag.
New Features (in compare_dirs_improved.sh):
- Parallel Processing: Both directory scanning and filtering are now processed in parallel for improved performance
- Configuration File: Support for persistent configuration via compare_dirs.conf file
- Comprehensive Logging: Detailed logging with configurable levels (DEBUG, INFO, WARNING, ERROR)
- Better Error Handling: Improved error detection and reporting
- Color-coded Console Output: Better visual distinction between different types of messages
Requirements
Bash (version 4+ recommended)
Python 3
bc for floating point comparisons
Installation
Clone the repository:
git clone https://github.com/yourusername/compare-dirs.git
cd compare-dirs
Make the scripts executable:
```bash
chmod +x compare_dirs.sh
chmod +x compare_dirs_improved.sh
```
Usage
## Original Script
```
./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
```
## Improved Script
```
./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [--log-file <log_file>] [--log-level <level>] [--parallel <processes>] [<dir1> <dir2> <words_file>]
```
Arguments
<dir1>: The first directory containing subdirectories to compare.
<dir2>: The second directory containing subdirectories to compare.
<words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.
Options
### Common Options
`--dry-run`: Print the actions without actually removing any directories.
`--threshold <threshold>`: Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
### Improved Script Options
`--config <config_file>`: Specify a configuration file (default: ./compare_dirs.conf)
`--log-file <log_file>`: Specify a log file path (default: ./compare_dirs.log)
`--log-level <level>`: Set logging level (DEBUG, INFO, WARNING, ERROR)
`--parallel <processes>`: Number of parallel processes to use (0 = auto, uses all CPU cores)
`--help`: Display usage information
Config File
The improved script supports a configuration file (default: compare_dirs.conf) with the following parameters:
```
# Configuration file for compare_dirs.sh
# Default directory paths
DIR1="/path/to/dir1"
DIR2="/path/to/dir2"
# Path to words file
WORDS_FILE="./words"
# Similarity threshold (0.0-1.0)
SIMILARITY_THRESHOLD=0.8
# Enable/disable dry run mode (true/false)
DRY_RUN=false
# Number of parallel processes to use (0 = auto)
PARALLEL_PROCESSES=0
# Logging configuration
LOG_ENABLED=true
LOG_FILE="./compare_dirs.log"
LOG_LEVEL="INFO" # DEBUG, INFO, WARNING, ERROR
```
Examples
### Original Script
Dry Run with Default Threshold (0.8):
```
./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
```
Dry Run with a Custom Threshold (0.7):
```
./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
```
Actual Run (without dry-run):
```
./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
```
### Improved Script
Using Configuration File:
```
./compare_dirs_improved.sh --config my_config.conf
```
With Parallel Processing and Custom Log Level:
```
./compare_dirs_improved.sh --parallel 8 --log-level DEBUG /mnt/dsnas /mnt/dsnas1 ./words
```
How It Works
Scanning:
The script scans for immediate subdirectories in the two specified directories. In the improved version, this is done in parallel for better performance.
Normalization & Cleaning:
Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then "cleaned" by stripping out undesirable words (one per line from the words file).
Grouping:
Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.
Removal:
Automatic Removal Based on Undesirable Words:
Within duplicate groups, if one directory's original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
Duplicate Pruning:
After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.
Dry-Run:
When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.
Logging (Improved Script):
The improved script maintains a detailed log of all operations, which can be used for audit purposes or troubleshooting.