2025-02-28 11:35:18 +01:00
2025-02-28 11:35:18 +01:00
2025-02-28 11:35:18 +01:00
2025-02-28 11:35:18 +01:00
2025-02-25 14:35:18 +01:00

Compare Directories and Remove Duplicates This Bash script compares subdirectories in two locations, groups them based on fuzzy name similarity, and automatically removes duplicates. It supports both undesirable-wordbased removal and an automatic pruning mechanism (keeping the first directory in each group).

Features Normalization & Cleaning: Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.

Fuzzy Matching: Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are "similar."

Automatic Removal:

Undesirable Words: Automatically removes directories whose original name contains one of the undesirable words if another similar directory does not. Duplicate Pruning: After the initial pass, automatically removes all duplicates in a group while keeping the first entry. Dry-Run Mode: Preview actions without deleting any directories by using the --dry-run flag.

New Features (in compare_dirs_improved.sh):

  • Parallel Processing: Both directory scanning and filtering are now processed in parallel for improved performance
  • Configuration File: Support for persistent configuration via compare_dirs.conf file
  • Comprehensive Logging: Detailed logging with configurable levels (DEBUG, INFO, WARNING, ERROR)
  • Better Error Handling: Improved error detection and reporting
  • Color-coded Console Output: Better visual distinction between different types of messages

Requirements Bash (version 4+ recommended) Python 3 bc for floating point comparisons

Installation Clone the repository:

git clone https://github.com/yourusername/compare-dirs.git cd compare-dirs

Make the scripts executable:

chmod +x compare_dirs.sh
chmod +x compare_dirs_improved.sh

Usage

Original Script

./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>

Improved Script

./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [--log-file <log_file>] [--log-level <level>] [--parallel <processes>] [<dir1> <dir2> <words_file>]

Arguments : The first directory containing subdirectories to compare. : The second directory containing subdirectories to compare. <words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.

Options

Common Options

--dry-run: Print the actions without actually removing any directories. --threshold <threshold>: Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.

Improved Script Options

--config <config_file>: Specify a configuration file (default: ./compare_dirs.conf) --log-file <log_file>: Specify a log file path (default: ./compare_dirs.log) --log-level <level>: Set logging level (DEBUG, INFO, WARNING, ERROR) --parallel <processes>: Number of parallel processes to use (0 = auto, uses all CPU cores) --help: Display usage information

Config File The improved script supports a configuration file (default: compare_dirs.conf) with the following parameters:

# Configuration file for compare_dirs.sh

# Default directory paths
DIR1="/path/to/dir1"
DIR2="/path/to/dir2"

# Path to words file
WORDS_FILE="./words"

# Similarity threshold (0.0-1.0)
SIMILARITY_THRESHOLD=0.8

# Enable/disable dry run mode (true/false)
DRY_RUN=false

# Number of parallel processes to use (0 = auto)
PARALLEL_PROCESSES=0

# Logging configuration
LOG_ENABLED=true
LOG_FILE="./compare_dirs.log"
LOG_LEVEL="INFO"  # DEBUG, INFO, WARNING, ERROR

Examples

Original Script

Dry Run with Default Threshold (0.8):

./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words

Dry Run with a Custom Threshold (0.7):

./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words

Actual Run (without dry-run):

./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words

Improved Script

Using Configuration File:

./compare_dirs_improved.sh --config my_config.conf

With Parallel Processing and Custom Log Level:

./compare_dirs_improved.sh --parallel 8 --log-level DEBUG /mnt/dsnas /mnt/dsnas1 ./words

How It Works Scanning: The script scans for immediate subdirectories in the two specified directories. In the improved version, this is done in parallel for better performance.

Normalization & Cleaning: Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then "cleaned" by stripping out undesirable words (one per line from the words file).

Grouping: Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.

Removal: Automatic Removal Based on Undesirable Words: Within duplicate groups, if one directory's original name contains an undesirable word while an alternative does not, that directory is flagged for removal.

Duplicate Pruning: After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.

Dry-Run: When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.

Logging (Improved Script): The improved script maintains a detailed log of all operations, which can be used for audit purposes or troubleshooting.

Description
Clean directories for dublicates
Readme 50 KiB
Languages
Shell 100%