LM-Polygraph Normalization: Impact Areas and Default Behaviors

Normalization Impact Areas

1. Score Transformation

Raw Uncertainty Scores

  • Original uncertainty estimates in unbounded ranges

  • Higher values indicate more uncertainty

  • Various scales depending on estimation method

Normalized Confidence Values

  • Bounded in [0,1] range

  • Higher values indicate more confidence

  • Directly interpretable probabilities

  • Preserves relative ordering of original scores (for Isotonic PCC)

2. Evaluation Pipeline

Calibration Stage

  • Uses calibration dataset to learn normalization parameters

  • Requires generation quality metrics for Performance-Calibrated Confidence

  • Can use either task-specific or background data

  • Parameters are saved for reuse

Inference Stage

  • Applies learned normalization to new uncertainty estimates

  • No additional model inference required

  • Fast transformation using stored parameters

  • Can be applied to any compatible uncertainty estimator

3. Quality Metrics Integration

Metric Normalization

  • Quality metrics are normalized to [0,1] range

  • Enables consistent calibration across different metrics

  • Handles various metric types (ROUGE, BLEU, accuracy, etc.)

  • Supports both bounded and unbounded metrics

Metric Selection

  • Different metrics for different tasks

  • Task-specific normalization of quality scores

  • Multiple metrics can be used simultaneously

  • Quality metrics guide confidence calibration

4. Model Types Support

White-box Models

  • Access to internal probabilities and logits

  • Can normalize token-level uncertainties

  • Supports both sequence and token-level calibration

  • Works with HuggingFace models

Black-box Models

  • Limited to output-based uncertainty estimation

  • Only sequence-level normalization

  • Compatible with API-based models (OpenAI, etc.)

  • No access to internal model states

Default Behaviors

1. Score Processing

# Default score processing behavior
normalize_scores = {
    'clip_values': True,           # Clip to [0,1] range
    'flip_uncertainty': True,      # Convert uncertainty to confidence
    'preserve_order': True,        # Maintain sample ordering
    'handle_nans': 'ignore'        # Skip NaN values in calibration
}

2. Calibration Settings

# Default calibration configuration
calibration_defaults = {
    'strategy': 'dataset_specific',    # Use task-specific calibration
    'num_samples': 1000,               # Default calibration set size
    'background_data': None,           # No background data by default
    'split_ratio': None,               # No train/test split
    'cache_enabled': True              # Cache calibration parameters
}

3. Method Selection

# Default normalization method selection
method_defaults = {
    'primary_method': 'isotonic_pcc',  # Default to Isotonic PCC
    'fallback_method': 'minmax',       # Use MinMax as fallback
    'combine_methods': False,          # Don't combine multiple methods
    'quality_metric': 'auto'           # Auto-select appropriate metric
}

4. Task-Specific Defaults

# Task-type specific defaults
task_defaults:
  qa:
    metric: 'accuracy'
    normalize_answers: true
    ignore_case: true

  translation:
    metric: 'bleu'
    normalize_translations: true
    source_cleaning: true

  summarization:
    metric: 'rouge'
    normalize_summaries: true
    trim_outputs: true

5. Error Handling

# Default error handling behavior
error_handling = {
    'invalid_scores': 'skip',          # Skip invalid uncertainty scores
    'missing_metrics': 'error',        # Raise error for missing metrics
    'calibration_fails': 'fallback',   # Use fallback method if calibration fails
    'out_of_bounds': 'clip'           # Clip out-of-bounds values
}

6. Memory Management

# Default memory management settings
memory_settings = {
    'cache_location': '~/.cache/lm-polygraph/norm',
    'max_cache_size': '1GB',
    'clear_cache_on_exit': False,
    'compression': True
}

Usage Guidelines

1. Choosing Calibration Data

  • Use task-specific data when available

  • Ensure calibration set is representative

  • Consider using background data for sparse tasks

  • Monitor calibration set size vs. performance

2. Method Selection

  • Start with Isotonic PCC for best balance

  • Use MinMax for simple scaling needs

  • Consider Binned PCC for interpretability

  • Evaluate multiple methods if uncertain

3. Error Handling

  • Monitor normalization failures

  • Validate calibration success

  • Check normalized score distributions

  • Verify quality metric calculations

4. Performance Optimization

  • Enable caching for repeated use

  • Adjust calibration set size as needed

  • Use appropriate quality metrics

  • Monitor memory usage