Column Value Distribution Analysis (qcolgrp.sql)
What This Script Does
Section titled “What This Script Does”This script analyzes the distribution of values in any specified table column, showing each unique value along with its occurrence count. It’s essential for data profiling, understanding data patterns, identifying data quality issues, and supporting statistical analysis. The script provides a complete frequency distribution of all values in the specified column, including NULL values.
The Script
Section titled “The Script”rem qcolgrp.sqlremaccept table char prompt 'Table: 'accept column char prompt 'Column: 'remttitle 'Column Distribution'rembreak on reportcompute sum of total on reportremselect &column, count(*) total from &table group by &column;remundefine tableundefine column
Key Features
Section titled “Key Features”- Universal Column Analysis: Works with any table and column combination
- Complete Distribution: Shows all unique values with occurrence counts
- Interactive Input: User-friendly prompts for table and column names
- Statistical Summary: Provides total row count across all values
- NULL Value Handling: Includes NULL values in the distribution analysis
- Data Type Agnostic: Works with any data type (VARCHAR2, NUMBER, DATE, etc.)
- Variable Cleanup: Automatically undefines variables after execution
@qcolgrp.sql
Interactive Prompts:
- Table: Table name to analyze
- Column: Column name to analyze for value distribution
Required Privileges
Section titled “Required Privileges”SELECT on the specified table
Sample Output
Section titled “Sample Output” Column Distribution
DEPARTMENT_NAME TOTAL---------------------------------------- ----------ACCOUNTING 3ADMINISTRATION 1EXECUTIVE 1FINANCE 6HUMAN RESOURCES 1IT 5MARKETING 2OPERATIONS 4PROCUREMENT 1PUBLIC RELATIONS 1PURCHASING 5RESEARCH 2SALES 34SHIPPING 45 ----------sum 111
Key Output Columns
Section titled “Key Output Columns”- Column Value: The actual unique value found in the analyzed column
- TOTAL: Number of times this value appears in the table
- sum: Total count of all rows in the table
Understanding Value Distribution Analysis
Section titled “Understanding Value Distribution Analysis”Data Profiling Applications
Section titled “Data Profiling Applications”Distribution Patterns
Section titled “Distribution Patterns”-- Distribution analysis reveals:-- Data concentration patterns-- Value frequency distributions-- Outliers and anomalies-- Data quality issues-- Business pattern insights
Statistical Insights
Section titled “Statistical Insights”-- Key statistical measures:-- Most/least frequent values-- Data skewness indicators-- Cardinality assessment-- Null value prevalence-- Uniformity evaluation
Business Intelligence Applications
Section titled “Business Intelligence Applications”Pattern Recognition
Section titled “Pattern Recognition”-- Business patterns revealed:-- Customer behavior patterns-- Product popularity trends-- Geographic distribution-- Seasonal variations-- Market segmentation data
Decision Support
Section titled “Decision Support”-- Support business decisions:-- Resource allocation planning-- Marketing strategy development-- Inventory management optimization-- Customer segmentation strategies
Common Use Cases
Section titled “Common Use Cases”-
Data Quality Assessment
- Identify unexpected or invalid values
- Detect data entry errors and inconsistencies
- Assess data completeness (NULL analysis)
- Validate data migration quality
-
Business Analysis
- Understand customer demographics
- Analyze product sales patterns
- Evaluate geographic distribution
- Study transaction patterns
-
Database Design Support
- Determine appropriate data types
- Plan indexing strategies
- Assess normalization needs
- Design constraint requirements
-
Performance Optimization
- Analyze column selectivity for indexing
- Understand data skew for partitioning
- Plan optimizer statistics strategies
- Support query optimization efforts
Advanced Analysis Techniques
Section titled “Advanced Analysis Techniques”Cardinality Analysis
Section titled “Cardinality Analysis”-- Assess column cardinality:-- High cardinality = many unique values (good for indexing)-- Low cardinality = few unique values (consider bitmap indexes)-- Medium cardinality = balanced distribution-- Zero cardinality = all NULL values
Data Quality Metrics
Section titled “Data Quality Metrics”-- Calculate quality metrics:-- Completeness = (Total rows - NULL count) / Total rows-- Uniqueness = Distinct values / Total rows-- Validity = Valid values / Total rows-- Consistency = Expected patterns / Total values
Statistical Analysis
Section titled “Statistical Analysis”-- Derive statistical measures:-- Mode = Most frequently occurring value-- Frequency distribution = Value occurrence patterns-- Outlier detection = Values with unusual frequencies-- Concentration ratio = Distribution of top values
Practical Usage Examples
Section titled “Practical Usage Examples”Customer Status Analysis
Section titled “Customer Status Analysis”-- Analyze customer status distribution:Table: customersColumn: status
-- Results might show:-- ACTIVE: 15,450-- INACTIVE: 2,340-- SUSPENDED: 156-- PENDING: 89
Product Category Distribution
Section titled “Product Category Distribution”-- Analyze product categories:Table: productsColumn: category
-- Results reveal:-- Electronics: 2,500-- Clothing: 1,800-- Books: 3,200-- Home & Garden: 950
Geographic Analysis
Section titled “Geographic Analysis”-- Analyze customer geographic distribution:Table: customersColumn: state
-- Results show:-- CA: 5,230-- TX: 3,890-- NY: 3,456-- FL: 2,890
Order Status Tracking
Section titled “Order Status Tracking”-- Analyze order status distribution:Table: ordersColumn: order_status
-- Results indicate:-- SHIPPED: 45,670-- PROCESSING: 1,230-- PENDING: 567-- CANCELLED: 234
Data Quality Insights
Section titled “Data Quality Insights”Anomaly Detection
Section titled “Anomaly Detection”Unexpected Values
Section titled “Unexpected Values”-- Identify data quality issues:-- Misspellings in categorical data-- Invalid codes or statuses-- Inconsistent formatting-- Unexpected NULL values
Data Entry Errors
Section titled “Data Entry Errors”-- Common error patterns:-- Case sensitivity variations ("Active" vs. "ACTIVE")-- Spelling variations ("Cancelled" vs. "Canceled")-- Extra spaces or special characters-- Invalid enumerated values
Completeness Assessment
Section titled “Completeness Assessment”NULL Value Analysis
Section titled “NULL Value Analysis”-- NULL value implications:-- High NULL counts may indicate:-- - Data collection issues-- - Optional field usage patterns-- - Data migration problems-- - Business process gaps
Data Coverage
Section titled “Data Coverage”-- Assess data coverage:-- Expected vs. actual value ranges-- Missing categorical values-- Incomplete data populations-- Temporal data gaps
Performance and Optimization
Section titled “Performance and Optimization”Index Planning
Section titled “Index Planning”Selectivity Assessment
Section titled “Selectivity Assessment”-- Use distribution for index planning:-- High selectivity columns = good B-tree index candidates-- Low selectivity columns = consider bitmap indexes-- Very low selectivity = may not benefit from indexing-- Skewed distributions = consider histogram statistics
Composite Index Strategy
Section titled “Composite Index Strategy”-- Plan composite indexes:-- Lead with high-selectivity columns-- Consider query patterns-- Balance maintenance overhead-- Assess space requirements
Partitioning Strategy
Section titled “Partitioning Strategy”Partition Key Selection
Section titled “Partition Key Selection”-- Evaluate partitioning potential:-- Even distribution = good hash partitioning candidate-- Skewed distribution = consider list partitioning-- Date/time patterns = range partitioning opportunity-- Null handling = partition by special values
Data Skew Management
Section titled “Data Skew Management”-- Handle data skew:-- Identify heavily populated values-- Plan partition boundaries accordingly-- Consider sub-partitioning strategies-- Monitor partition balance over time
Business Intelligence Applications
Section titled “Business Intelligence Applications”Market Analysis
Section titled “Market Analysis”Customer Segmentation
Section titled “Customer Segmentation”-- Customer analysis applications:-- Demographics distribution-- Purchase behavior patterns-- Geographic market analysis-- Customer lifecycle stages
Product Performance
Section titled “Product Performance”-- Product analysis insights:-- Category performance comparison-- Seasonal demand patterns-- Inventory turnover rates-- Market penetration metrics
Operational Insights
Section titled “Operational Insights”Process Efficiency
Section titled “Process Efficiency”-- Operations analysis:-- Workflow status distributions-- Processing time patterns-- Error rate analysis-- Resource utilization patterns
Quality Monitoring
Section titled “Quality Monitoring”-- Quality control applications:-- Defect rate distributions-- Quality grade patterns-- Compliance status tracking-- Process improvement opportunities
Integration with Analytics
Section titled “Integration with Analytics”Statistical Software
Section titled “Statistical Software”-- Export results for analysis:-- Import into R or Python for advanced statistics-- Create visualizations in BI tools-- Generate management reports-- Support predictive modeling
Data Warehousing
Section titled “Data Warehousing”-- Data warehouse applications:-- Dimension table profiling-- Fact table distribution analysis-- ETL quality validation-- Historical trend analysis
Machine Learning
Section titled “Machine Learning”-- ML model support:-- Feature distribution analysis-- Class imbalance detection-- Data preprocessing planning-- Model validation support
Automation and Reporting
Section titled “Automation and Reporting”Automated Profiling
Section titled “Automated Profiling”-- Automate distribution analysis:-- Schedule regular profiling jobs-- Generate automated reports-- Monitor data quality trends-- Alert on significant changes
Dashboard Integration
Section titled “Dashboard Integration”-- Dashboard integration:-- Real-time distribution monitoring-- Quality metric dashboards-- Trend analysis visualizations-- Executive summary reports
Change Detection
Section titled “Change Detection”-- Monitor distribution changes:-- Compare current vs. historical distributions-- Detect significant pattern shifts-- Alert on data quality degradation-- Track improvement initiatives
Best Practices
Section titled “Best Practices”Analysis Planning
Section titled “Analysis Planning”-- Plan effective analysis:-- Start with critical business columns-- Consider performance impact on large tables-- Schedule during off-peak hours-- Document findings and insights
Performance Considerations
Section titled “Performance Considerations”-- Optimize for performance:-- Ensure appropriate indexes exist-- Consider sampling for very large tables-- Monitor resource utilization-- Plan for temporary space requirements
Data Interpretation
Section titled “Data Interpretation”-- Interpret results effectively:-- Understand business context-- Consider temporal factors-- Validate unexpected patterns-- Collaborate with business stakeholders
Related Scripts
Section titled “Related Scripts”- qcoldup.sql - Find duplicate column values
- dtabcol.sql - Table column analysis
- dtable.sql - General table analysis
- dstatbig.sql - Statistics for large tables
Advanced Modifications
Section titled “Advanced Modifications”Percentage Distribution
Section titled “Percentage Distribution”-- Calculate percentages:select &column, count(*) total, round(count(*) * 100.0 / sum(count(*)) over(), 2) as percentage from &table group by &column order by count(*) desc;
Top N Analysis
Section titled “Top N Analysis”-- Show only top N values:select * from ( select &column, count(*) total from &table group by &column order by count(*) desc) where rownum <= 10;
Statistical Summary
Section titled “Statistical Summary”-- Add statistical measures:select count(distinct &column) as distinct_values, count(*) as total_rows, count(&column) as non_null_rows, count(*) - count(&column) as null_rows, round(count(&column) * 100.0 / count(*), 2) as completeness_pctfrom &table;
Summary
Section titled “Summary”This script is essential for:
- Data Profiling - Understanding data characteristics and patterns
- Quality Assessment - Identifying data quality issues and anomalies
- Business Analysis - Supporting business intelligence and decision making
- Database Optimization - Planning indexing and partitioning strategies
- Pattern Recognition - Discovering business insights and operational patterns