Skip to content

Column Value Distribution Analysis (qcolgrp.sql)

This script analyzes the distribution of values in any specified table column, showing each unique value along with its occurrence count. It’s essential for data profiling, understanding data patterns, identifying data quality issues, and supporting statistical analysis. The script provides a complete frequency distribution of all values in the specified column, including NULL values.

rem qcolgrp.sql
rem
accept table char prompt 'Table: '
accept column char prompt 'Column: '
rem
ttitle 'Column Distribution'
rem
break on report
compute sum of total on report
rem
select &column, count(*) total
from &table
group by &column;
rem
undefine table
undefine column
  • Universal Column Analysis: Works with any table and column combination
  • Complete Distribution: Shows all unique values with occurrence counts
  • Interactive Input: User-friendly prompts for table and column names
  • Statistical Summary: Provides total row count across all values
  • NULL Value Handling: Includes NULL values in the distribution analysis
  • Data Type Agnostic: Works with any data type (VARCHAR2, NUMBER, DATE, etc.)
  • Variable Cleanup: Automatically undefines variables after execution
@qcolgrp.sql

Interactive Prompts:

  • Table: Table name to analyze
  • Column: Column name to analyze for value distribution
SELECT on the specified table
Column Distribution
DEPARTMENT_NAME TOTAL
---------------------------------------- ----------
ACCOUNTING 3
ADMINISTRATION 1
EXECUTIVE 1
FINANCE 6
HUMAN RESOURCES 1
IT 5
MARKETING 2
OPERATIONS 4
PROCUREMENT 1
PUBLIC RELATIONS 1
PURCHASING 5
RESEARCH 2
SALES 34
SHIPPING 45
----------
sum 111
  • Column Value: The actual unique value found in the analyzed column
  • TOTAL: Number of times this value appears in the table
  • sum: Total count of all rows in the table
-- Distribution analysis reveals:
-- Data concentration patterns
-- Value frequency distributions
-- Outliers and anomalies
-- Data quality issues
-- Business pattern insights
-- Key statistical measures:
-- Most/least frequent values
-- Data skewness indicators
-- Cardinality assessment
-- Null value prevalence
-- Uniformity evaluation
-- Business patterns revealed:
-- Customer behavior patterns
-- Product popularity trends
-- Geographic distribution
-- Seasonal variations
-- Market segmentation data
-- Support business decisions:
-- Resource allocation planning
-- Marketing strategy development
-- Inventory management optimization
-- Customer segmentation strategies
  1. Data Quality Assessment

    • Identify unexpected or invalid values
    • Detect data entry errors and inconsistencies
    • Assess data completeness (NULL analysis)
    • Validate data migration quality
  2. Business Analysis

    • Understand customer demographics
    • Analyze product sales patterns
    • Evaluate geographic distribution
    • Study transaction patterns
  3. Database Design Support

    • Determine appropriate data types
    • Plan indexing strategies
    • Assess normalization needs
    • Design constraint requirements
  4. Performance Optimization

    • Analyze column selectivity for indexing
    • Understand data skew for partitioning
    • Plan optimizer statistics strategies
    • Support query optimization efforts
-- Assess column cardinality:
-- High cardinality = many unique values (good for indexing)
-- Low cardinality = few unique values (consider bitmap indexes)
-- Medium cardinality = balanced distribution
-- Zero cardinality = all NULL values
-- Calculate quality metrics:
-- Completeness = (Total rows - NULL count) / Total rows
-- Uniqueness = Distinct values / Total rows
-- Validity = Valid values / Total rows
-- Consistency = Expected patterns / Total values
-- Derive statistical measures:
-- Mode = Most frequently occurring value
-- Frequency distribution = Value occurrence patterns
-- Outlier detection = Values with unusual frequencies
-- Concentration ratio = Distribution of top values
-- Analyze customer status distribution:
Table: customers
Column: status
-- Results might show:
-- ACTIVE: 15,450
-- INACTIVE: 2,340
-- SUSPENDED: 156
-- PENDING: 89
-- Analyze product categories:
Table: products
Column: category
-- Results reveal:
-- Electronics: 2,500
-- Clothing: 1,800
-- Books: 3,200
-- Home & Garden: 950
-- Analyze customer geographic distribution:
Table: customers
Column: state
-- Results show:
-- CA: 5,230
-- TX: 3,890
-- NY: 3,456
-- FL: 2,890
-- Analyze order status distribution:
Table: orders
Column: order_status
-- Results indicate:
-- SHIPPED: 45,670
-- PROCESSING: 1,230
-- PENDING: 567
-- CANCELLED: 234
-- Identify data quality issues:
-- Misspellings in categorical data
-- Invalid codes or statuses
-- Inconsistent formatting
-- Unexpected NULL values
-- Common error patterns:
-- Case sensitivity variations ("Active" vs. "ACTIVE")
-- Spelling variations ("Cancelled" vs. "Canceled")
-- Extra spaces or special characters
-- Invalid enumerated values
-- NULL value implications:
-- High NULL counts may indicate:
-- - Data collection issues
-- - Optional field usage patterns
-- - Data migration problems
-- - Business process gaps
-- Assess data coverage:
-- Expected vs. actual value ranges
-- Missing categorical values
-- Incomplete data populations
-- Temporal data gaps
-- Use distribution for index planning:
-- High selectivity columns = good B-tree index candidates
-- Low selectivity columns = consider bitmap indexes
-- Very low selectivity = may not benefit from indexing
-- Skewed distributions = consider histogram statistics
-- Plan composite indexes:
-- Lead with high-selectivity columns
-- Consider query patterns
-- Balance maintenance overhead
-- Assess space requirements
-- Evaluate partitioning potential:
-- Even distribution = good hash partitioning candidate
-- Skewed distribution = consider list partitioning
-- Date/time patterns = range partitioning opportunity
-- Null handling = partition by special values
-- Handle data skew:
-- Identify heavily populated values
-- Plan partition boundaries accordingly
-- Consider sub-partitioning strategies
-- Monitor partition balance over time
-- Customer analysis applications:
-- Demographics distribution
-- Purchase behavior patterns
-- Geographic market analysis
-- Customer lifecycle stages
-- Product analysis insights:
-- Category performance comparison
-- Seasonal demand patterns
-- Inventory turnover rates
-- Market penetration metrics
-- Operations analysis:
-- Workflow status distributions
-- Processing time patterns
-- Error rate analysis
-- Resource utilization patterns
-- Quality control applications:
-- Defect rate distributions
-- Quality grade patterns
-- Compliance status tracking
-- Process improvement opportunities
-- Export results for analysis:
-- Import into R or Python for advanced statistics
-- Create visualizations in BI tools
-- Generate management reports
-- Support predictive modeling
-- Data warehouse applications:
-- Dimension table profiling
-- Fact table distribution analysis
-- ETL quality validation
-- Historical trend analysis
-- ML model support:
-- Feature distribution analysis
-- Class imbalance detection
-- Data preprocessing planning
-- Model validation support
-- Automate distribution analysis:
-- Schedule regular profiling jobs
-- Generate automated reports
-- Monitor data quality trends
-- Alert on significant changes
-- Dashboard integration:
-- Real-time distribution monitoring
-- Quality metric dashboards
-- Trend analysis visualizations
-- Executive summary reports
-- Monitor distribution changes:
-- Compare current vs. historical distributions
-- Detect significant pattern shifts
-- Alert on data quality degradation
-- Track improvement initiatives
-- Plan effective analysis:
-- Start with critical business columns
-- Consider performance impact on large tables
-- Schedule during off-peak hours
-- Document findings and insights
-- Optimize for performance:
-- Ensure appropriate indexes exist
-- Consider sampling for very large tables
-- Monitor resource utilization
-- Plan for temporary space requirements
-- Interpret results effectively:
-- Understand business context
-- Consider temporal factors
-- Validate unexpected patterns
-- Collaborate with business stakeholders
-- Calculate percentages:
select &column,
count(*) total,
round(count(*) * 100.0 / sum(count(*)) over(), 2) as percentage
from &table
group by &column
order by count(*) desc;
-- Show only top N values:
select * from (
select &column, count(*) total
from &table
group by &column
order by count(*) desc
) where rownum <= 10;
-- Add statistical measures:
select
count(distinct &column) as distinct_values,
count(*) as total_rows,
count(&column) as non_null_rows,
count(*) - count(&column) as null_rows,
round(count(&column) * 100.0 / count(*), 2) as completeness_pct
from &table;

This script is essential for:

  1. Data Profiling - Understanding data characteristics and patterns
  2. Quality Assessment - Identifying data quality issues and anomalies
  3. Business Analysis - Supporting business intelligence and decision making
  4. Database Optimization - Planning indexing and partitioning strategies
  5. Pattern Recognition - Discovering business insights and operational patterns