Data Profiling Guide

Data profiling is the process of examining data to understand its structure, content, quality, and patterns. With Sparvi, you can easily profile your database tables to gain deep insights into your data.

Understanding Sparvi Profiles

When you profile a table with Sparvi, it computes a variety of metrics and statistics that give you a comprehensive view of your data. A profile includes:

Basic Metrics

  • Row count: Total number of rows in the table
  • Column count: Total number of columns
  • Duplicate count: Number of duplicate rows
  • Timestamp: When the profile was created

Completeness Metrics

For each column, Sparvi calculates:

  • Null count: Number of NULL values
  • Null percentage: Percentage of NULL values
  • Distinct count: Number of unique values
  • Distinct percentage: Percentage of unique values

Numeric Column Statistics

For numeric columns, Sparvi calculates:

  • Min/Max: Minimum and maximum values
  • Mean/Median: Average and middle values
  • Standard Deviation: Measure of data dispersion
  • Quartiles: 25th, 50th, and 75th percentiles
  • Outliers: Values that fall significantly outside the normal range

Text Column Statistics

For text columns, Sparvi analyzes:

  • Min/Max/Avg Length: Character length statistics
  • Pattern Recognition: Common patterns in the data
  • Format Consistency: Checks for consistent formatting
  • Empty String Detection: Identifies empty strings (different from NULLs)

Date Column Statistics

For date and timestamp columns, Sparvi calculates:

  • Min/Max Date: Earliest and latest dates
  • Date Range: Span of time represented
  • Temporal Patterns: Distribution across time periods
  • Future Dates: Detection of dates in the future
  • Unrealistic Dates: Identification of dates that are unreasonably old

Anomaly Detection

Sparvi automatically detects anomalies such as:

  • Statistical Outliers: Values far from the statistical norm
  • Pattern Deviations: Inconsistencies in data patterns
  • Distribution Anomalies: Unusual distributions of values
  • Completeness Issues: Unexpected NULL patterns
  • Schema Shifts: Changes in table schema compared to historical data

Using Data Profiling in Sparvi Cloud

Sparvi Cloud provides automated data profiling through the web interface:

  1. Connect Your Database: Add your database connection in the Connections section
  2. Navigate to Data Explorer: Browse your database schemas and tables
  3. View Profiles: Click on any table to see its comprehensive data profile
  4. Automated Scheduling: Configure automated profiling schedules for continuous monitoring
  5. Historical Comparison: Track changes over time with automatic historical comparisons
  6. Anomaly Detection: Get alerts when unusual patterns or data quality issues are detected

Analyzing Profiles in Sparvi Cloud

Sparvi Cloud provides comprehensive profile analysis through interactive dashboards and visualizations.

Viewing Profile Metrics

In the Data Explorer, when you view a table profile, Sparvi Cloud displays:

  • Table Overview: Row count, column count, duplicate count, and profile timestamp
  • Column Completeness Dashboard: Visual representation of null percentages and distinct value counts for all columns
  • Interactive Charts: Sortable and filterable views of all column statistics

Identifying Data Quality Issues

Sparvi Cloud automatically highlights problematic columns:

  • High Null Rates: Columns with elevated null percentages are flagged with warning indicators
  • Low Cardinality Issues: Columns with very low distinct value percentages are identified
  • Sorting and Filtering: Sort columns by null percentage, distinct count, or other metrics to quickly find issues

Numeric Column Analysis

For numeric columns, Sparvi Cloud provides:

  • Statistical Summary: View min, max, average, median, and standard deviation
  • Distribution Visualizations: Histograms and box plots showing data distribution
  • Outlier Detection: Automatically identified outliers with example values
  • Trend Analysis: Historical comparison of numeric statistics over time

Anomaly Detection

Sparvi Cloud automatically detects and displays anomalies:

  • Anomaly List: All detected anomalies with severity levels (high, medium, low)
  • Anomaly Details: Description, affected columns, and detection timestamp
  • Schema Change Tracking: Automatic detection of schema shifts with before/after comparison
  • Filterable Views: Filter anomalies by type, severity, or affected table/column

Dashboard and Reporting

The Sparvi Cloud dashboard provides comprehensive visualizations:

  • Anomaly Trends Over Time: Line charts showing anomaly detection patterns
  • Severity Breakdown: Visual distribution of high, medium, and low severity issues
  • Anomaly Type Analysis: Charts showing which anomaly types are most common
  • Table Health Heatmap: Color-coded view of data quality across all tables
  • Export Options: Download reports and share with your team

Best Practices

  1. Profile Regularly: Run profiles on a regular schedule to track changes over time
  2. Use Historical Comparisons: Always compare current profiles with previous ones
  3. Start with Basic Metrics: Begin analysis with row counts and null percentages
  4. Focus on Outliers: Pay special attention to statistical outliers
  5. Look for Patterns: Analyze patterns in text data for inconsistencies
  6. Check Date Ranges: Ensure date ranges fall within expected bounds
  7. Track Schema Changes: Monitor changes to table structures over time
  8. Automate Alerting: Set up alerts for significant changes in key metrics
  9. Adjust Thresholds: Customize anomaly detection thresholds for your data
  10. Document Findings: Keep records of data quality issues for future reference

Next Steps

After profiling your data, consider:

  1. Setting up Validation Rules based on profile insights
  2. Exploring Anomaly Detection for more advanced monitoring
  3. Connecting your databases with Sparvi Cloud Database Connections