LM-Polygraph Normalization: Impact Areas and Default Behaviors
===============================================================

Normalization Impact Areas
--------------------------

1. Score Transformation
^^^^^^^^^^^^^^^^^^^^^^^

**Raw Uncertainty Scores**

- Original uncertainty estimates in unbounded ranges
- Higher values indicate more uncertainty
- Various scales depending on estimation method

**Normalized Confidence Values**

- Bounded in [0,1] range
- Higher values indicate more confidence
- Directly interpretable probabilities
- Preserves relative ordering of original scores (for Isotonic PCC)

2. Evaluation Pipeline
^^^^^^^^^^^^^^^^^^^^^^

**Calibration Stage**

- Uses calibration dataset to learn normalization parameters
- Requires generation quality metrics for Performance-Calibrated Confidence
- Can use either task-specific or background data
- Parameters are saved for reuse

**Inference Stage**

- Applies learned normalization to new uncertainty estimates
- No additional model inference required
- Fast transformation using stored parameters
- Can be applied to any compatible uncertainty estimator

3. Quality Metrics Integration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Metric Normalization**

- Quality metrics are normalized to [0,1] range
- Enables consistent calibration across different metrics
- Handles various metric types (ROUGE, BLEU, accuracy, etc.)
- Supports both bounded and unbounded metrics

**Metric Selection**

- Different metrics for different tasks
- Task-specific normalization of quality scores
- Multiple metrics can be used simultaneously
- Quality metrics guide confidence calibration

4. Model Types Support
^^^^^^^^^^^^^^^^^^^^^^

**White-box Models**

- Access to internal probabilities and logits
- Can normalize token-level uncertainties
- Supports both sequence and token-level calibration
- Works with HuggingFace models

**Black-box Models**

- Limited to output-based uncertainty estimation
- Only sequence-level normalization
- Compatible with API-based models (OpenAI, etc.)
- No access to internal model states

Default Behaviors
-----------------

1. Score Processing
^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Default score processing behavior
    normalize_scores = {
        'clip_values': True,           # Clip to [0,1] range
        'flip_uncertainty': True,      # Convert uncertainty to confidence
        'preserve_order': True,        # Maintain sample ordering
        'handle_nans': 'ignore'        # Skip NaN values in calibration
    }

2. Calibration Settings
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Default calibration configuration
    calibration_defaults = {
        'strategy': 'dataset_specific',    # Use task-specific calibration
        'num_samples': 1000,               # Default calibration set size
        'background_data': None,           # No background data by default
        'split_ratio': None,               # No train/test split
        'cache_enabled': True              # Cache calibration parameters
    }

3. Method Selection
^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Default normalization method selection
    method_defaults = {
        'primary_method': 'isotonic_pcc',  # Default to Isotonic PCC
        'fallback_method': 'minmax',       # Use MinMax as fallback
        'combine_methods': False,          # Don't combine multiple methods
        'quality_metric': 'auto'           # Auto-select appropriate metric
    }

4. Task-Specific Defaults
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: yaml

    # Task-type specific defaults
    task_defaults:
      qa:
        metric: 'accuracy'
        normalize_answers: true
        ignore_case: true

      translation:
        metric: 'bleu'
        normalize_translations: true
        source_cleaning: true

      summarization:
        metric: 'rouge'
        normalize_summaries: true
        trim_outputs: true

5. Error Handling
^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Default error handling behavior
    error_handling = {
        'invalid_scores': 'skip',          # Skip invalid uncertainty scores
        'missing_metrics': 'error',        # Raise error for missing metrics
        'calibration_fails': 'fallback',   # Use fallback method if calibration fails
        'out_of_bounds': 'clip'           # Clip out-of-bounds values
    }

6. Memory Management
^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Default memory management settings
    memory_settings = {
        'cache_location': '~/.cache/lm-polygraph/norm',
        'max_cache_size': '1GB',
        'clear_cache_on_exit': False,
        'compression': True
    }

Usage Guidelines
----------------

1. Choosing Calibration Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Use task-specific data when available
- Ensure calibration set is representative
- Consider using background data for sparse tasks
- Monitor calibration set size vs. performance

2. Method Selection
^^^^^^^^^^^^^^^^^^^

- Start with Isotonic PCC for best balance
- Use MinMax for simple scaling needs
- Consider Binned PCC for interpretability
- Evaluate multiple methods if uncertain

3. Error Handling
^^^^^^^^^^^^^^^^^

- Monitor normalization failures
- Validate calibration success
- Check normalized score distributions
- Verify quality metric calculations

4. Performance Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Enable caching for repeated use
- Adjust calibration set size as needed
- Use appropriate quality metrics
- Monitor memory usage