Skip to content

Find Duplicate Column Values (qcoldup.sql)

This script identifies duplicate values in any specified table column, showing each duplicate value along with its occurrence count. It’s essential for data quality analysis, constraint validation, and identifying potential primary key candidates. The script groups all rows by the specified column and filters to show only values that appear more than once, providing a clear view of data duplicates.

rem qcoldup.sql
rem
col column_value heading 'COLUMN VALUE'
col group_count format 999,999,999 heading 'ROW COUNT'
rem
ttitle 'Duplicate Column Distribution'
rem
break on report
compute sum of group_count on report
rem
select &&column column_value, count(*) group_count
from &table
group by &column
having count(*) > 1;
rem
undefine column;
  • Simple and Flexible: Works with any table and column combination
  • Duplicate Detection: Shows only values that appear multiple times
  • Occurrence Counting: Displays exact count of each duplicate value
  • Statistical Summary: Provides total count of duplicate rows
  • Universal Compatibility: Works with any data type column
  • Performance Optimized: Uses efficient GROUP BY with HAVING clause
@qcoldup.sql

Input Parameters:

  • table: Table name to analyze (required)
  • column: Column name to check for duplicates (required, prompted twice for use in SELECT and GROUP BY)
SELECT on the specified table
Duplicate Column Distribution
COLUMN VALUE ROW COUNT
---------------------------------------------------------------- -----------
-----------
sum 16
  • COLUMN VALUE: The actual duplicate value found in the column
  • ROW COUNT: Number of times this value appears in the table
-- Duplicates can indicate:
-- Data entry errors
-- Missing unique constraints
-- ETL process issues
-- Business rule violations
-- Data integration problems
-- Common duplicate scenarios:
-- Email addresses in user tables
-- Product codes in inventory
-- Customer IDs in transactions
-- Reference data inconsistencies
-- Duplicate values can cause:
-- Report inaccuracies
-- Customer communication issues
-- Inventory management problems
-- Financial discrepancies
-- Process automation failures
-- Regulatory considerations:
-- Data accuracy requirements
-- Audit trail integrity
-- Customer data privacy
-- Financial reporting accuracy
  1. Data Quality Assessment

    • Identify data entry errors and inconsistencies
    • Validate data cleansing processes
    • Assess data migration quality
    • Support data governance initiatives
  2. Constraint Validation

    • Verify unique constraint requirements
    • Identify primary key candidates
    • Validate business rules
    • Support database design decisions
  3. Data Cleansing Preparation

    • Identify records needing deduplication
    • Plan data cleansing strategies
    • Assess cleansing impact
    • Support data standardization efforts
  4. Business Rule Enforcement

    • Validate business logic constraints
    • Identify policy violations
    • Support compliance monitoring
    • Plan process improvements
-- Analyze duplicate frequency patterns:
-- High occurrence counts = systematic issues
-- Low occurrence counts = random errors
-- Frequency distribution = data quality patterns
-- Outlier identification = specific problem areas
-- Assess duplicate impact:
-- Calculate percentage of affected rows
-- Identify most problematic values
-- Estimate cleansing effort required
-- Plan remediation strategies
-- Investigate duplicate sources:
-- Data entry process issues
-- System integration problems
-- Migration process errors
-- Business process gaps
-- Find duplicate email addresses:
Enter value for table: customers
Enter value for column: email_address
Enter value for column: email_address
-- Check for duplicate product codes:
Enter value for table: products
Enter value for column: product_code
Enter value for column: product_code
-- Validate customer ID uniqueness:
Enter value for table: orders
Enter value for column: customer_id
Enter value for column: customer_id
-- Find duplicate SSNs:
Enter value for table: employees
Enter value for column: ssn
Enter value for column: ssn
-- For critical duplicates:
-- Review each duplicate case individually
-- Determine which record is authoritative
-- Merge or delete inappropriate records
-- Update references to consolidated records
-- For systematic duplicates:
-- Develop deduplication rules
-- Implement automated merge logic
-- Create backup before processing
-- Validate results after deduplication
-- Implement preventive constraints:
-- UNIQUE constraints on columns
-- Composite unique constraints
-- Check constraints for business rules
-- Foreign key constraints for referential integrity
-- Application-level prevention:
-- Input validation routines
-- Duplicate checking before insert
-- User interface controls
-- Business logic enforcement
-- For large tables:
-- Consider adding WHERE clauses for date ranges
-- Use partitioned tables for better performance
-- Schedule analysis during off-peak hours
-- Monitor resource utilization
-- Optimize for duplicate checking:
-- Ensure indexes exist on analyzed columns
-- Consider composite indexes for multi-column analysis
-- Monitor query execution plans
-- Plan index maintenance
-- Manage resource usage:
-- Monitor CPU and memory consumption
-- Consider parallel processing for large datasets
-- Plan disk space for temporary operations
-- Schedule during maintenance windows
-- Include in data profiling:
-- Regular duplicate analysis schedules
-- Trend monitoring over time
-- Quality metric tracking
-- Report generation automation
-- ETL process integration:
-- Pre-load duplicate checking
-- Post-load validation
-- Exception handling procedures
-- Quality gate implementation
-- Ongoing monitoring:
-- Automated duplicate detection
-- Alert generation for new duplicates
-- Quality dashboard integration
-- Trend analysis and reporting
-- For multi-column duplicate analysis, modify script:
select column1, column2, count(*) group_count
from table_name
group by column1, column2
having count(*) > 1;
-- Calculate duplicate percentage:
WITH duplicate_count AS (
SELECT COUNT(*) as dup_rows
FROM (
SELECT column_name
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1
)
),
total_count AS (
SELECT COUNT(*) as total_rows
FROM table_name
)
SELECT
(dup_rows * 100.0 / total_rows) as duplicate_percentage
FROM duplicate_count, total_count;
-- For near-duplicate detection:
-- Use SOUNDEX for phonetic matching
-- Implement string similarity functions
-- Consider case-insensitive comparisons
-- Apply data standardization before comparison
-- Generate quality reports:
-- Duplicate summary statistics
-- Trend analysis over time
-- Before/after cleansing comparisons
-- Quality improvement metrics
-- Communicate findings:
-- Executive summary of data quality
-- Impact assessment on business processes
-- Recommended remediation actions
-- Timeline for quality improvements
-- Maintain audit documentation:
-- Record of duplicate analysis results
-- Actions taken for remediation
-- Validation of cleansing results
-- Process improvement recommendations
-- Schedule regular duplicate checks:
-- Daily for critical business data
-- Weekly for reference data
-- Monthly for historical data
-- After major data loads
-- Implement alerting:
-- Threshold-based duplicate alerts
-- Trend change notifications
-- Quality degradation warnings
-- Process failure notifications
-- Monitor integration:
-- Database monitoring tools
-- Data quality dashboards
-- Business intelligence reports
-- Operational health checks
-- Plan duplicate analysis:
-- Identify critical columns for uniqueness
-- Prioritize by business impact
-- Schedule appropriate frequency
-- Define acceptable quality thresholds
-- Develop remediation approach:
-- Create data backup before changes
-- Implement validation procedures
-- Plan rollback strategies
-- Test in development environment
-- Integrate with business processes:
-- Include in data governance framework
-- Align with business rules
-- Coordinate with stakeholders
-- Document procedures and standards
-- Script execution problems:
-- Verify table and column names
-- Check user privileges
-- Ensure column data type compatibility
-- Monitor resource utilization
-- Performance optimization:
-- Add appropriate indexes
-- Consider partitioning for large tables
-- Use parallel processing if available
-- Schedule during off-peak hours
-- Handle different data types:
-- Date columns may need formatting
-- CLOB/BLOB columns require special handling
-- Number precision considerations
-- Character set compatibility

This script is essential for:

  1. Data Quality Assessment - Identifying and quantifying duplicate values
  2. Constraint Validation - Verifying uniqueness requirements before constraint creation
  3. Data Cleansing - Planning and executing deduplication processes
  4. Business Rule Enforcement - Validating data against business requirements
  5. Compliance Monitoring - Supporting regulatory and audit requirements