program updates

2025-02-28 11:35:18 +01:00
parent 56d41fb487
commit 6f0781ea12
3 changed files with 585 additions and 17 deletions
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@ Normalization & Cleaning:
 Converts directory names to lowercase, removes punctuation, and strips out any undesirable words specified in a words file.

 Fuzzy Matching:
-Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are “similar.”
+Uses Python's difflib to compute a similarity ratio between cleaned directory names. A configurable threshold determines if two names are "similar."

 Automatic Removal:

@@ -15,65 +15,136 @@ Duplicate Pruning: After the initial pass, automatically removes all duplicates
 Dry-Run Mode:
 Preview actions without deleting any directories by using the --dry-run flag.

+New Features (in compare_dirs_improved.sh):
+- Parallel Processing: Both directory scanning and filtering are now processed in parallel for improved performance
+- Configuration File: Support for persistent configuration via compare_dirs.conf file
+- Comprehensive Logging: Detailed logging with configurable levels (DEBUG, INFO, WARNING, ERROR)
+- Better Error Handling: Improved error detection and reporting
+- Color-coded Console Output: Better visual distinction between different types of messages
+
 Requirements
 Bash (version 4+ recommended)
 Python 3
 bc for floating point comparisons
+
 Installation
 Clone the repository:

 git clone https://github.com/yourusername/compare-dirs.git
 cd compare-dirs
-Make the script executable:

-bash
-Kopiér
+Make the scripts executable:
+
+```bash
 chmod +x compare_dirs.sh
+chmod +x compare_dirs_improved.sh
+```

 Usage
+## Original Script
+```
 ./compare_dirs.sh [--dry-run] [--threshold <threshold>] <dir1> <dir2> <words_file>
+```
+
+## Improved Script
+```
+./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [--log-file <log_file>] [--log-level <level>] [--parallel <processes>] [<dir1> <dir2> <words_file>]
+```
+
 Arguments
 <dir1>: The first directory containing subdirectories to compare.
 <dir2>: The second directory containing subdirectories to compare.
 <words_file>: A text file with one undesirable word per line. These words are removed from directory names during the cleaning process.

 Options
--dry-run
+### Common Options
+`--dry-run`: Print the actions without actually removing any directories.
+`--threshold <threshold>`: Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.

-Print the actions without actually removing any directories.
+### Improved Script Options
+`--config <config_file>`: Specify a configuration file (default: ./compare_dirs.conf)
+`--log-file <log_file>`: Specify a log file path (default: ./compare_dirs.log)
+`--log-level <level>`: Set logging level (DEBUG, INFO, WARNING, ERROR)
+`--parallel <processes>`: Number of parallel processes to use (0 = auto, uses all CPU cores)
+`--help`: Display usage information

--threshold <threshold>
-Set the fuzzy similarity threshold (default is 0.8). A lower threshold (e.g., 0.7) will group more directories as duplicates.
+Config File
+The improved script supports a configuration file (default: compare_dirs.conf) with the following parameters:
+
+```
+# Configuration file for compare_dirs.sh
+
+# Default directory paths
+DIR1="/path/to/dir1"
+DIR2="/path/to/dir2"
+
+# Path to words file
+WORDS_FILE="./words"
+
+# Similarity threshold (0.0-1.0)
+SIMILARITY_THRESHOLD=0.8
+
+# Enable/disable dry run mode (true/false)
+DRY_RUN=false
+
+# Number of parallel processes to use (0 = auto)
+PARALLEL_PROCESSES=0
+
+# Logging configuration
+LOG_ENABLED=true
+LOG_FILE="./compare_dirs.log"
+LOG_LEVEL="INFO"  # DEBUG, INFO, WARNING, ERROR
+```

 Examples
+### Original Script
+
 Dry Run with Default Threshold (0.8):
-
-
+```
 ./compare_dirs.sh --dry-run /mnt/dsnas /mnt/dsnas1 ./words
+```
+
 Dry Run with a Custom Threshold (0.7):
-
-
+```
 ./compare_dirs.sh --dry-run --threshold 0.7 /mnt/dsnas /mnt/dsnas1 ./words
+```
+
 Actual Run (without dry-run):
-
-
+```
 ./compare_dirs.sh /mnt/dsnas /mnt/dsnas1 ./words
+```
+
+### Improved Script
+
+Using Configuration File:
+```
+./compare_dirs_improved.sh --config my_config.conf
+```
+
+With Parallel Processing and Custom Log Level:
+```
+./compare_dirs_improved.sh --parallel 8 --log-level DEBUG /mnt/dsnas /mnt/dsnas1 ./words
+```
+
 How It Works
 Scanning:
-The script scans for immediate subdirectories in the two specified directories.
+The script scans for immediate subdirectories in the two specified directories. In the improved version, this is done in parallel for better performance.

 Normalization & Cleaning:
-Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then “cleaned” by stripping out undesirable words (one per line from the words file).
+Each subdirectory name is normalized (converted to lowercase, punctuation removed) and then "cleaned" by stripping out undesirable words (one per line from the words file).

 Grouping:
 Using a Python helper with difflib.SequenceMatcher, directories are grouped by comparing their cleaned names. If the similarity ratio meets or exceeds the threshold, they are considered duplicates.

 Removal:
 Automatic Removal Based on Undesirable Words:
-Within duplicate groups, if one directory’s original name contains an undesirable word while an alternative does not, that directory is flagged for removal.
+Within duplicate groups, if one directory's original name contains an undesirable word while an alternative does not, that directory is flagged for removal.

 Duplicate Pruning:
 After the undesirable-word check, any remaining duplicate groups are pruned by keeping the first directory in each group and removing the rest.

 Dry-Run:
 When run with the --dry-run flag, the script will print what it would remove without actually deleting any directories.
+
+Logging (Improved Script):
+The improved script maintains a detailed log of all operations, which can be used for audit purposes or troubleshooting.
--- a/compare_dirs.conf
+++ b/compare_dirs.conf
@@ -0,0 +1,22 @@
+# Configuration file for compare_dirs.sh
+
+# Default directory paths
+DIR1="/path/to/dir1"
+DIR2="/path/to/dir2"
+
+# Path to words file
+WORDS_FILE="./words"
+
+# Similarity threshold (0.0-1.0)
+SIMILARITY_THRESHOLD=0.8
+
+# Enable/disable dry run mode (true/false)
+DRY_RUN=false
+
+# Number of parallel processes to use (0 = auto)
+PARALLEL_PROCESSES=0
+
+# Logging configuration
+LOG_ENABLED=true
+LOG_FILE="./compare_dirs.log"
+LOG_LEVEL="INFO"  # DEBUG, INFO, WARNING, ERROR
--- a/compare_dirs_improved.sh
+++ b/compare_dirs_improved.sh
@@ -0,0 +1,475 @@
+#!/bin/bash
+# compare_dirs_improved.sh
+#
+# Usage: ./compare_dirs_improved.sh [--dry-run] [--threshold <threshold>] [--config <config_file>] [<dir1> <dir2> <words_file>]
+#
+# This script:
+#   1. Scans immediate subdirectories in <dir1> and <dir2> in parallel.
+#   2. For each directory, if its name contains any undesirable word (one per line in <words_file>),
+#      the directory is removed outright.
+#   3. The remaining directories are "cleaned" (converted to lowercase, punctuation removed)
+#      and then grouped by fuzzy similarity using a configurable threshold.
+#      The fuzzy similarity process is optimized with a multiprocessing helper.
+#   4. Within each group, if one directory's name contains "2160p" and another contains "1080p",
+#      the 1080p directory(ies) are removed (or flagged in dry-run mode).
+#   5. For any remaining duplicate groups, the user is prompted to select a directory to remove.
+#   6. A --dry-run mode is available to preview removals without actually deleting any directories.
+#   7. Supports configuration files for persistent settings.
+#   8. Provides comprehensive logging of operations.
+
+set -euo pipefail
+
+# Default configuration file location
+CONFIG_FILE="./compare_dirs.conf"
+
+# Initialize log function
+log() {
+    local level="$1"
+    local message="$2"
+    local timestamp=$(date "+%Y-%m-%d %H:%M:%S")
+    
+    # Only log if logging is enabled
+    if [[ "$LOG_ENABLED" == "true" ]]; then
+        # Log level filtering
+        case "$LOG_LEVEL" in
+            "DEBUG")
+                ;;
+            "INFO")
+                if [[ "$level" == "DEBUG" ]]; then return; fi
+                ;;
+            "WARNING")
+                if [[ "$level" == "DEBUG" || "$level" == "INFO" ]]; then return; fi
+                ;;
+            "ERROR")
+                if [[ "$level" == "DEBUG" || "$level" == "INFO" || "$level" == "WARNING" ]]; then return; fi
+                ;;
+        esac
+        
+        # Print to console with color
+        case "$level" in
+            "DEBUG")   echo -e "\033[36m[$timestamp] [$level] $message\033[0m" ;;  # Cyan
+            "INFO")    echo -e "\033[32m[$timestamp] [$level] $message\033[0m" ;;  # Green
+            "WARNING") echo -e "\033[33m[$timestamp] [$level] $message\033[0m" ;;  # Yellow
+            "ERROR")   echo -e "\033[31m[$timestamp] [$level] $message\033[0m" ;;  # Red
+            *)         echo "[$timestamp] [$level] $message" ;;
+        esac
+        
+        # Write to log file (without color codes)
+        echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+    fi
+}
+
+# Default options
+DRY_RUN=false
+SIMILARITY_THRESHOLD=0.8
+LOG_ENABLED=true
+LOG_FILE="./compare_dirs.log"
+LOG_LEVEL="INFO"
+PARALLEL_PROCESSES=0
+
+# Load configuration file if it exists
+load_config() {
+    local config_file="$1"
+    
+    if [[ -f "$config_file" ]]; then
+        # Source the config file
+        source "$config_file"
+        echo "Configuration loaded from $config_file"
+        return 0
+    else
+        echo "Warning: Configuration file '$config_file' not found. Using defaults."
+        return 1
+    fi
+}
+
+# Process directories in parallel
+process_directories_parallel() {
+    local dir="$1"
+    local max_procs="${2:-4}"  # Default to 4 processes if not specified
+    local temp_dir=$(mktemp -d)
+    local pids=()
+    local count=0
+    local result=()
+    
+    if [[ ! -d "$dir" ]]; then
+        log "ERROR" "Directory '$dir' not found in parallel processing."
+        return 1
+    }
+    
+    # If PARALLEL_PROCESSES is 0, use available CPU cores
+    if [[ "$max_procs" -eq 0 ]]; then
+        max_procs=$(nproc 2>/dev/null || echo 4)
+    fi
+    
+    log "DEBUG" "Processing directory '$dir' with $max_procs parallel processes"
+    
+    # Get all directories to process
+    local all_dirs=()
+    for d in "$dir"/*; do
+        if [[ -d "$d" ]]; then
+            all_dirs+=("$d")
+        fi
+    done
+    
+    local total_dirs=${#all_dirs[@]}
+    log "DEBUG" "Found $total_dirs directories to process"
+    
+    # Process in batches based on max_procs
+    for ((i=0; i<total_dirs; i++)); do
+        local d="${all_dirs[i]}"
+        local base=$(basename "$d")
+        local output_file="$temp_dir/$count"
+        
+        # Process directory in background
+        {
+            remove_flag=false
+            # Check if the directory name contains any undesirable word (case-insensitive)
+            for word in "${words[@]}"; do
+                if echo "$base" | grep -qi "$word"; then
+                    remove_flag=true
+                    break
+                fi
+            done
+            
+            if $remove_flag; then
+                echo "REMOVE:$d"
+            else
+                echo "KEEP:$d"
+            fi
+        } > "$output_file" &
+        
+        pids+=($!)
+        ((count++))
+        
+        # If we've reached max_procs or this is the last directory, wait for processes to finish
+        if [[ ${#pids[@]} -eq $max_procs || $i -eq $((total_dirs-1)) ]]; then
+            for pid in "${pids[@]}"; do
+                wait "$pid"
+            done
+            
+            # Read results
+            for ((j=0; j<${#pids[@]}; j++)); do
+                local file="$temp_dir/$j"
+                while IFS= read -r line; do
+                    result+=("$line")
+                done < "$file"
+            done
+            
+            # Reset for next batch
+            pids=()
+            count=0
+        fi
+    done
+    
+    # Clean up temporary directory
+    rm -rf "$temp_dir"
+    
+    # Output results
+    for line in "${result[@]}"; do
+        echo "$line"
+    done
+}
+
+# Process command-line flags
+while [[ $# -gt 0 && "$1" == --* ]]; do
+    case "$1" in
+        --dry-run)
+            DRY_RUN=true
+            shift
+            ;;
+        --threshold)
+            SIMILARITY_THRESHOLD="$2"
+            shift 2
+            ;;
+        --config)
+            CONFIG_FILE="$2"
+            shift 2
+            ;;
+        --log-file)
+            LOG_FILE="$2"
+            shift 2
+            ;;
+        --log-level)
+            LOG_LEVEL="$2"
+            shift 2
+            ;;
+        --parallel)
+            PARALLEL_PROCESSES="$2"
+            shift 2
+            ;;
+        --help)
+            echo "Usage: $0 [OPTIONS] [<dir1> <dir2> <words_file>]"
+            echo
+            echo "OPTIONS:"
+            echo "  --dry-run              Preview actions without deleting any directories"
+            echo "  --threshold <value>    Set the fuzzy similarity threshold (default: 0.8)"
+            echo "  --config <file>        Specify a configuration file to use"
+            echo "  --log-file <file>      Specify a log file to use"
+            echo "  --log-level <level>    Set log level: DEBUG, INFO, WARNING, ERROR"
+            echo "  --parallel <num>       Number of parallel processes (0 = auto)"
+            echo "  --help                 Display this help message"
+            echo
+            echo "If no directories and words file are specified, values from config file will be used."
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            echo "Use --help for usage information."
+            exit 1
+            ;;
+    esac
+done
+
+# Load configuration file if specified
+if [[ -f "$CONFIG_FILE" ]]; then
+    load_config "$CONFIG_FILE"
+fi
+
+# Initialize logging
+if [[ "$LOG_ENABLED" == "true" ]]; then
+    touch "$LOG_FILE"
+    log "INFO" "Logging initialized"
+fi
+
+# Check if arguments were provided to override config values
+if [ $# -eq 3 ]; then
+    DIR1="$1"
+    DIR2="$2"
+    WORDS_FILE="$3"
+    log "INFO" "Using command-line arguments for directories and words file"
+elif [ $# -ne 0 ]; then
+    log "ERROR" "Incorrect number of arguments"
+    echo "Usage: $0 [--dry-run] [--threshold <threshold>] [--config <config_file>] [<dir1> <dir2> <words_file>]"
+    exit 1
+fi
+
+log "INFO" "Script started with parameters: DIR1=$DIR1, DIR2=$DIR2, WORDS_FILE=$WORDS_FILE"
+log "INFO" "Configuration: THRESHOLD=$SIMILARITY_THRESHOLD, DRY_RUN=$DRY_RUN, PARALLEL_PROCESSES=$PARALLEL_PROCESSES"
+
+# Verify input paths
+if [ ! -d "$DIR1" ]; then
+    log "ERROR" "Directory '$DIR1' not found."
+    echo "Error: Directory '$DIR1' not found."
+    exit 1
+fi
+if [ ! -d "$DIR2" ]; then
+    log "ERROR" "Directory '$DIR2' not found."
+    echo "Error: Directory '$DIR2' not found."
+    exit 1
+fi
+if [ ! -f "$WORDS_FILE" ]; then
+    log "ERROR" "Words file '$WORDS_FILE' not found."
+    echo "Error: Words file '$WORDS_FILE' not found."
+    exit 1
+fi
+
+# Read undesirable words (one per line) into an array, ignoring blank lines.
+mapfile -t words < <(grep -v '^[[:space:]]*$' "$WORDS_FILE")
+log "INFO" "Loaded ${#words[@]} undesirable words from $WORDS_FILE"
+
+echo "=== Pre-filtering Directories by Undesirable Words ==="
+log "INFO" "Starting parallel directory filtering"
+
+# Process directories in parallel and process results
+filtered_dirs=()
+
+# Process DIR1
+log "INFO" "Processing directories in $DIR1"
+while IFS= read -r line; do
+    if [[ "$line" == KEEP:* ]]; then
+        dir="${line#KEEP:}"
+        filtered_dirs+=("$dir")
+        log "DEBUG" "Keeping directory: $dir"
+    elif [[ "$line" == REMOVE:* ]]; then
+        dir="${line#REMOVE:}"
+        log "INFO" "Removing directory with undesirable word: $dir"
+        if $DRY_RUN; then
+            echo "Dry-run: would remove '$dir'"
+            log "INFO" "Dry-run: would remove '$dir'"
+        else
+            rm -rf "$dir"
+            log "INFO" "Removed '$dir'"
+            echo "Removed '$dir'"
+        fi
+    fi
+done < <(process_directories_parallel "$DIR1" "$PARALLEL_PROCESSES")
+
+# Process DIR2
+log "INFO" "Processing directories in $DIR2"
+while IFS= read -r line; do
+    if [[ "$line" == KEEP:* ]]; then
+        dir="${line#KEEP:}"
+        filtered_dirs+=("$dir")
+        log "DEBUG" "Keeping directory: $dir"
+    elif [[ "$line" == REMOVE:* ]]; then
+        dir="${line#REMOVE:}"
+        log "INFO" "Removing directory with undesirable word: $dir"
+        if $DRY_RUN; then
+            echo "Dry-run: would remove '$dir'"
+            log "INFO" "Dry-run: would remove '$dir'"
+        else
+            rm -rf "$dir"
+            log "INFO" "Removed '$dir'"
+            echo "Removed '$dir'"
+        fi
+    fi
+done < <(process_directories_parallel "$DIR2" "$PARALLEL_PROCESSES")
+
+log "INFO" "Filtered directories remaining: ${#filtered_dirs[@]}"
+
+# Function: Normalize and clean a directory name.
+clean_name() {
+    local name="$1"
+    echo "$name" | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9 ]//g' | xargs
+}
+
+# Function: Compute fuzzy similarities between a target and a list of strings using multiprocessing.
+compute_similarities() {
+    local target="$1"
+    shift
+    # Pass target and the list of representatives as command-line arguments to Python.
+    python3 - "$target" "$@" <<EOF
+import sys
+from difflib import SequenceMatcher
+from multiprocessing import Pool, cpu_count
+
+target = sys.argv[1]
+reps = sys.argv[2:]
+
+def similarity(rep):
+    return SequenceMatcher(None, target, rep).ratio()
+
+# Use all available CPUs
+with Pool(processes=cpu_count()) as pool:
+    results = pool.map(similarity, reps)
+
+print(" ".join(map(str, results)))
+EOF
+}
+
+echo "=== Grouping Remaining Directories by Fuzzy Similarity ==="
+log "INFO" "Starting fuzzy similarity grouping"
+
+# Initialize grouping arrays.
+declare -a group_rep=()   # Array for representative cleaned names.
+declare -A groups=()      # Associative array: groups[i] holds newline-separated directory paths.
+
+# Group the directories in filtered_dirs.
+total_dirs=${#filtered_dirs[@]}
+log "INFO" "Grouping $total_dirs directories based on similarity threshold $SIMILARITY_THRESHOLD"
+
+for d in "${filtered_dirs[@]}"; do
+    base=$(basename "$d")
+    cleaned=$(clean_name "$base")
+    added=false
+    if [ "${#group_rep[@]}" -gt 0 ]; then
+        # Compute similarities between the cleaned name and all group representatives concurrently.
+        similarities=$(compute_similarities "$cleaned" "${group_rep[@]}")
+        read -r -a sims <<< "$similarities"
+        for i in "${!sims[@]}"; do
+            if (( $(echo "${sims[$i]} >= $SIMILARITY_THRESHOLD" | bc -l) )); then
+                groups["$i"]+=$'\n'"$d"
+                log "DEBUG" "Added '$d' to group $i (${group_rep[$i]})"
+                added=true
+                break
+            fi
+        done
+    fi
+    if [ "$added" = false ]; then
+        new_index=${#group_rep[@]}
+        group_rep+=("$cleaned")
+        groups["$new_index"]="$d"
+        log "DEBUG" "Created new group $new_index with representative '$cleaned'"
+    fi
+done
+
+log "INFO" "Created ${#group_rep[@]} groups after fuzzy similarity matching"
+
+echo "=== Resolution Preference Filtering ==="
+log "INFO" "Starting resolution preference filtering"
+
+# For each group, if one directory contains "2160p" and another contains "1080p",
+# remove the 1080p directory(ies).
+for key in "${!groups[@]}"; do
+    IFS=$'\n' read -r -a paths <<< "${groups[$key]}" || true
+    has_2160p=false
+    has_1080p=false
+    for path in "${paths[@]}"; do
+        base=$(basename "$path")
+        if echo "$base" | grep -qi "2160p"; then
+            has_2160p=true
+            log "DEBUG" "Found 2160p in group $key: $path"
+        fi
+        if echo "$base" | grep -qi "1080p"; then
+            has_1080p=true
+            log "DEBUG" "Found 1080p in group $key: $path"
+        fi
+    done
+    if $has_2160p && $has_1080p; then
+        log "INFO" "Group $key (representative: ${group_rep[$key]:-unknown}) has both 1080p and 2160p directories"
+        echo "Group (representative: ${group_rep[$key]:-unknown}) has both 1080p and 2160p directories."
+        new_group=()
+        for path in "${paths[@]}"; do
+            base=$(basename "$path")
+            if echo "$base" | grep -qi "1080p"; then
+                log "INFO" "Removing '$path' because a 2160p version is present"
+                echo "Removing '$path' because a 2160p version is present."
+                if $DRY_RUN; then
+                    echo "Dry-run: would remove '$path'"
+                    log "INFO" "Dry-run: would remove '$path'"
+                else
+                    rm -rf "$path"
+                    log "INFO" "Removed '$path'"
+                    echo "Removed '$path'"
+                fi
+            else
+                new_group+=("$path")
+            fi
+        done
+        groups["$key"]=$(printf "%s\n" "${new_group[@]}")
+    fi
+done
+
+echo "=== Interactive Duplicate Resolution ==="
+log "INFO" "Starting interactive duplicate resolution"
+
+# For each group that still contains more than one directory, prompt the user to select one to remove.
+for key in "${!groups[@]}"; do
+    IFS=$'\n' read -r -a paths <<< "${groups[$key]}" || true
+    # Filter out directories that no longer exist.
+    existing=()
+    for path in "${paths[@]}"; do
+        if [ -d "$path" ]; then
+            existing+=("$path")
+        fi
+    done
+    if [ "${#existing[@]}" -gt 1 ]; then
+        log "INFO" "Prompting user for duplicate group: ${group_rep[$key]:-unknown}"
+        echo "Duplicate group (representative: ${group_rep[$key]:-unknown}):"
+        i=1
+        for p in "${existing[@]}"; do
+            echo "  [$i] $p"
+            ((i++))
+        done
+        echo -n "Enter the number of the directory you want to remove (or 0 to skip): "
+        read -r choice
+        if [[ "$choice" =~ ^[0-9]+$ ]] && [ "$choice" -gt 0 ] && [ "$choice" -le "${#existing[@]}" ]; then
+            dir_to_remove="${existing[$((choice-1))]}"
+            log "INFO" "User selected to remove: $dir_to_remove"
+            if $DRY_RUN; then
+                echo "Dry-run: would remove '$dir_to_remove'"
+                log "INFO" "Dry-run: would remove '$dir_to_remove'"
+            else
+                rm -rf "$dir_to_remove"
+                log "INFO" "Removed '$dir_to_remove'"
+                echo "Removed '$dir_to_remove'"
+            fi
+        else
+            log "INFO" "User skipped removal for group: ${group_rep[$key]:-unknown}"
+            echo "No removal selected for this group."
+        fi
+    fi
+done
+
+log "INFO" "Script completed successfully"
+echo "Script completed. See $LOG_FILE for detailed log."