Column Value Distribution Analysis (qcolgrp.sql)
What This Script Does
Section titled “What This Script Does”This script analyzes the distribution of values in any specified table column, showing each unique value along with its occurrence count. It’s essential for data profiling, understanding data patterns, identifying data quality issues, and supporting statistical analysis. The script provides a complete frequency distribution of all values in the specified column, including NULL values.
The Script
Section titled “The Script”rem qcolgrp.sqlremaccept table char prompt 'Table: 'accept column char prompt 'Column: 'remttitle 'Column Distribution'rembreak on reportcompute sum of total on reportremselect &column, count(*) total from &table group by &column;remundefine tableundefine columnKey Features
Section titled “Key Features”- Universal Column Analysis: Works with any table and column combination
- Complete Distribution: Shows all unique values with occurrence counts
- Interactive Input: User-friendly prompts for table and column names
- Statistical Summary: Provides total row count across all values
- NULL Value Handling: Includes NULL values in the distribution analysis
- Data Type Agnostic: Works with any data type (VARCHAR2, NUMBER, DATE, etc.)
- Variable Cleanup: Automatically undefines variables after execution
@qcolgrp.sqlInteractive Prompts:
- Table: Table name to analyze
- Column: Column name to analyze for value distribution
Required Privileges
Section titled “Required Privileges”SELECT on the specified tableSample Output
Section titled “Sample Output” Column Distribution
DEPARTMENT_NAME TOTAL---------------------------------------- ----------ACCOUNTING 3ADMINISTRATION 1EXECUTIVE 1FINANCE 6HUMAN RESOURCES 1IT 5MARKETING 2OPERATIONS 4PROCUREMENT 1PUBLIC RELATIONS 1PURCHASING 5RESEARCH 2SALES 34SHIPPING 45 ----------sum 111Key Output Columns
Section titled “Key Output Columns”- Column Value: The actual unique value found in the analyzed column
- TOTAL: Number of times this value appears in the table
- sum: Total count of all rows in the table
Understanding Value Distribution Analysis
Section titled “Understanding Value Distribution Analysis”Data Profiling Applications
Section titled “Data Profiling Applications”Distribution Patterns
Section titled “Distribution Patterns”-- Distribution analysis reveals:-- Data concentration patterns-- Value frequency distributions-- Outliers and anomalies-- Data quality issues-- Business pattern insightsStatistical Insights
Section titled “Statistical Insights”-- Key statistical measures:-- Most/least frequent values-- Data skewness indicators-- Cardinality assessment-- Null value prevalence-- Uniformity evaluationBusiness Intelligence Applications
Section titled “Business Intelligence Applications”Pattern Recognition
Section titled “Pattern Recognition”-- Business patterns revealed:-- Customer behavior patterns-- Product popularity trends-- Geographic distribution-- Seasonal variations-- Market segmentation dataDecision Support
Section titled “Decision Support”-- Support business decisions:-- Resource allocation planning-- Marketing strategy development-- Inventory management optimization-- Customer segmentation strategiesCommon Use Cases
Section titled “Common Use Cases”-
Data Quality Assessment
- Identify unexpected or invalid values
- Detect data entry errors and inconsistencies
- Assess data completeness (NULL analysis)
- Validate data migration quality
-
Business Analysis
- Understand customer demographics
- Analyze product sales patterns
- Evaluate geographic distribution
- Study transaction patterns
-
Database Design Support
- Determine appropriate data types
- Plan indexing strategies
- Assess normalization needs
- Design constraint requirements
-
Performance Optimization
- Analyze column selectivity for indexing
- Understand data skew for partitioning
- Plan optimizer statistics strategies
- Support query optimization efforts
Advanced Analysis Techniques
Section titled “Advanced Analysis Techniques”Cardinality Analysis
Section titled “Cardinality Analysis”-- Assess column cardinality:-- High cardinality = many unique values (good for indexing)-- Low cardinality = few unique values (consider bitmap indexes)-- Medium cardinality = balanced distribution-- Zero cardinality = all NULL valuesData Quality Metrics
Section titled “Data Quality Metrics”-- Calculate quality metrics:-- Completeness = (Total rows - NULL count) / Total rows-- Uniqueness = Distinct values / Total rows-- Validity = Valid values / Total rows-- Consistency = Expected patterns / Total valuesStatistical Analysis
Section titled “Statistical Analysis”-- Derive statistical measures:-- Mode = Most frequently occurring value-- Frequency distribution = Value occurrence patterns-- Outlier detection = Values with unusual frequencies-- Concentration ratio = Distribution of top valuesPractical Usage Examples
Section titled “Practical Usage Examples”Customer Status Analysis
Section titled “Customer Status Analysis”-- Analyze customer status distribution:Table: customersColumn: status
-- Results might show:-- ACTIVE: 15,450-- INACTIVE: 2,340-- SUSPENDED: 156-- PENDING: 89Product Category Distribution
Section titled “Product Category Distribution”-- Analyze product categories:Table: productsColumn: category
-- Results reveal:-- Electronics: 2,500-- Clothing: 1,800-- Books: 3,200-- Home & Garden: 950Geographic Analysis
Section titled “Geographic Analysis”-- Analyze customer geographic distribution:Table: customersColumn: state
-- Results show:-- CA: 5,230-- TX: 3,890-- NY: 3,456-- FL: 2,890Order Status Tracking
Section titled “Order Status Tracking”-- Analyze order status distribution:Table: ordersColumn: order_status
-- Results indicate:-- SHIPPED: 45,670-- PROCESSING: 1,230-- PENDING: 567-- CANCELLED: 234Data Quality Insights
Section titled “Data Quality Insights”Anomaly Detection
Section titled “Anomaly Detection”Unexpected Values
Section titled “Unexpected Values”-- Identify data quality issues:-- Misspellings in categorical data-- Invalid codes or statuses-- Inconsistent formatting-- Unexpected NULL valuesData Entry Errors
Section titled “Data Entry Errors”-- Common error patterns:-- Case sensitivity variations ("Active" vs. "ACTIVE")-- Spelling variations ("Cancelled" vs. "Canceled")-- Extra spaces or special characters-- Invalid enumerated valuesCompleteness Assessment
Section titled “Completeness Assessment”NULL Value Analysis
Section titled “NULL Value Analysis”-- NULL value implications:-- High NULL counts may indicate:-- - Data collection issues-- - Optional field usage patterns-- - Data migration problems-- - Business process gapsData Coverage
Section titled “Data Coverage”-- Assess data coverage:-- Expected vs. actual value ranges-- Missing categorical values-- Incomplete data populations-- Temporal data gapsPerformance and Optimization
Section titled “Performance and Optimization”Index Planning
Section titled “Index Planning”Selectivity Assessment
Section titled “Selectivity Assessment”-- Use distribution for index planning:-- High selectivity columns = good B-tree index candidates-- Low selectivity columns = consider bitmap indexes-- Very low selectivity = may not benefit from indexing-- Skewed distributions = consider histogram statisticsComposite Index Strategy
Section titled “Composite Index Strategy”-- Plan composite indexes:-- Lead with high-selectivity columns-- Consider query patterns-- Balance maintenance overhead-- Assess space requirementsPartitioning Strategy
Section titled “Partitioning Strategy”Partition Key Selection
Section titled “Partition Key Selection”-- Evaluate partitioning potential:-- Even distribution = good hash partitioning candidate-- Skewed distribution = consider list partitioning-- Date/time patterns = range partitioning opportunity-- Null handling = partition by special valuesData Skew Management
Section titled “Data Skew Management”-- Handle data skew:-- Identify heavily populated values-- Plan partition boundaries accordingly-- Consider sub-partitioning strategies-- Monitor partition balance over timeBusiness Intelligence Applications
Section titled “Business Intelligence Applications”Market Analysis
Section titled “Market Analysis”Customer Segmentation
Section titled “Customer Segmentation”-- Customer analysis applications:-- Demographics distribution-- Purchase behavior patterns-- Geographic market analysis-- Customer lifecycle stagesProduct Performance
Section titled “Product Performance”-- Product analysis insights:-- Category performance comparison-- Seasonal demand patterns-- Inventory turnover rates-- Market penetration metricsOperational Insights
Section titled “Operational Insights”Process Efficiency
Section titled “Process Efficiency”-- Operations analysis:-- Workflow status distributions-- Processing time patterns-- Error rate analysis-- Resource utilization patternsQuality Monitoring
Section titled “Quality Monitoring”-- Quality control applications:-- Defect rate distributions-- Quality grade patterns-- Compliance status tracking-- Process improvement opportunitiesIntegration with Analytics
Section titled “Integration with Analytics”Statistical Software
Section titled “Statistical Software”-- Export results for analysis:-- Import into R or Python for advanced statistics-- Create visualizations in BI tools-- Generate management reports-- Support predictive modelingData Warehousing
Section titled “Data Warehousing”-- Data warehouse applications:-- Dimension table profiling-- Fact table distribution analysis-- ETL quality validation-- Historical trend analysisMachine Learning
Section titled “Machine Learning”-- ML model support:-- Feature distribution analysis-- Class imbalance detection-- Data preprocessing planning-- Model validation supportAutomation and Reporting
Section titled “Automation and Reporting”Automated Profiling
Section titled “Automated Profiling”-- Automate distribution analysis:-- Schedule regular profiling jobs-- Generate automated reports-- Monitor data quality trends-- Alert on significant changesDashboard Integration
Section titled “Dashboard Integration”-- Dashboard integration:-- Real-time distribution monitoring-- Quality metric dashboards-- Trend analysis visualizations-- Executive summary reportsChange Detection
Section titled “Change Detection”-- Monitor distribution changes:-- Compare current vs. historical distributions-- Detect significant pattern shifts-- Alert on data quality degradation-- Track improvement initiativesBest Practices
Section titled “Best Practices”Analysis Planning
Section titled “Analysis Planning”-- Plan effective analysis:-- Start with critical business columns-- Consider performance impact on large tables-- Schedule during off-peak hours-- Document findings and insightsPerformance Considerations
Section titled “Performance Considerations”-- Optimize for performance:-- Ensure appropriate indexes exist-- Consider sampling for very large tables-- Monitor resource utilization-- Plan for temporary space requirementsData Interpretation
Section titled “Data Interpretation”-- Interpret results effectively:-- Understand business context-- Consider temporal factors-- Validate unexpected patterns-- Collaborate with business stakeholdersRelated Scripts
Section titled “Related Scripts”- qcoldup.sql - Find duplicate column values
- dtabcol.sql - Table column analysis
- dtable.sql - General table analysis
- dstatbig.sql - Statistics for large tables
Advanced Modifications
Section titled “Advanced Modifications”Percentage Distribution
Section titled “Percentage Distribution”-- Calculate percentages:select &column, count(*) total, round(count(*) * 100.0 / sum(count(*)) over(), 2) as percentage from &table group by &column order by count(*) desc;Top N Analysis
Section titled “Top N Analysis”-- Show only top N values:select * from ( select &column, count(*) total from &table group by &column order by count(*) desc) where rownum <= 10;Statistical Summary
Section titled “Statistical Summary”-- Add statistical measures:select count(distinct &column) as distinct_values, count(*) as total_rows, count(&column) as non_null_rows, count(*) - count(&column) as null_rows, round(count(&column) * 100.0 / count(*), 2) as completeness_pctfrom &table;Summary
Section titled “Summary”This script is essential for:
- Data Profiling - Understanding data characteristics and patterns
- Quality Assessment - Identifying data quality issues and anomalies
- Business Analysis - Supporting business intelligence and decision making
- Database Optimization - Planning indexing and partitioning strategies
- Pattern Recognition - Discovering business insights and operational patterns