clean_dubs/README.md
2025-02-28 11:35:18 +01:00

5.4 KiB
Raw Blame History

Compare Directories and Remove Duplicates This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-wordbased removal and an automatic pruning mechanism (keeping the first directory in each group).

Features Normalization & Cleaning: Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.

Fuzzy Matching: Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are "similar."

Automatic Removal:

Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not. Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry. Dry-Run Mode: Preview actions without deleting any directories by using the --dry-run flag.

New Features (in compare_dirs_improved.sh):

  • Parallel Processing: Both directory scanning and filtering are now processed in parallel for improved performance
  • Configuration File: Support for persistent configuration via compare_dirs.conf file
  • Comprehensive Logging: Detailed logging with configurable levels (DEBUG, INFO, WARNING, ERROR)
  • Better Error Handling: Improved error detection and reporting
  • Color-coded Console Output: Better visual distinction between different types of messages

Requirements Bash (version 4+ recommended) Python 3 bc for floating point comparisons

Installation Clone the repository:

git clone https://github.com/yourusername/compare-dirs.git cd compare-dirs

Make the scripts executable:

chmod +x compare_dirs.sh
chmod +x compare_dirs_improved.sh

Usage

Original Script

./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>

Improved Script

./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [--log-file <log_file>] [--log-level <level>] [--parallel <processes>] [<dir1> <dir2> <words_file>]

Arguments : The first directory containing subdirectories to compare. : The second directory containing subdirectories to compare. <words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.

Options

Common Options

--dry-run: Print the actions without actually removing any directories. --threshold <threshold>: Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.

Improved Script Options

--config <config_file>: Specify a configuration file (default: ./compare_dirs.conf) --log-file <log_file>: Specify a log file path (default: ./compare_dirs.log) --log-level <level>: Set logging level (DEBUG, INFO, WARNING, ERROR) --parallel <processes>: Number of parallel processes to use (0 = auto, uses all CPU cores) --help: Display usage information

Config File The improved script supports a configuration file (default: compare_dirs.conf) with the following parameters:

# Configuration file for compare_dirs.sh

# Default directory paths
DIR1="/path/to/dir1"
DIR2="/path/to/dir2"

# Path to words file
WORDS_FILE="./words"

# Similarity threshold (0.0-1.0)
SIMILARITY_THRESHOLD=0.8

# Enable/disable dry run mode (true/false)
DRY_RUN=false

# Number of parallel processes to use (0 = auto)
PARALLEL_PROCESSES=0

# Logging configuration
LOG_ENABLED=true
LOG_FILE="./compare_dirs.log"
LOG_LEVEL="INFO"  # DEBUG, INFO, WARNING, ERROR

Examples

Original Script

Dry Run with Default Threshold (0.8):

./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words

Dry Run with a Custom Threshold (0.7):

./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words

Actual Run (without dry-run):

./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words

Improved Script

Using Configuration File:

./compare_dirs_improved.sh --config my_config.conf

With Parallel Processing and Custom Log Level:

./compare_dirs_improved.sh --parallel 8 --log-level DEBUG /mnt/dsnas /mnt/dsnas1 ./words

How It Works Scanning: The script scans for immediate subdirectories in the two specified directories. In the improved version, this is done in parallel for better performance.

Normalization & Cleaning: Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then "cleaned" by stripping out undesirable words (one per line from the words file).

Grouping: Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.

Removal: Automatic Removal Based on Undesirable Words: Within duplicate groups, if one directory's original name contains an undesirable word while an alternative does not, that directory is flagged for removal.

Duplicate Pruning: After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.

Dry-Run: When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.

Logging (Improved Script): The improved script maintains a detailed log of all operations, which can be used for audit purposes or troubleshooting.