Dataset-Specific Normalization Configurations ============================================== Overview -------- Dataset-specific normalization configurations in LM-Polygraph allow fine-tuning how uncertainty scores are normalized for different tasks and data types. These configurations can be found in the evaluation config files under ``/examples/configs/`` and its subfolders. Configuration Structure ----------------------- 1. Common Parameters ^^^^^^^^^^^^^^^^^^^^ Every dataset-specific configuration includes these core normalization parameters: .. code-block:: yaml # Dataset sampling configuration subsample_background_train_dataset: 1000 # Size of background dataset for normalization subsample_train_dataset: 1000 # Size of task-specific calibration dataset subsample_eval_dataset: -1 # Size of evaluation dataset (-1 = full) # Training data settings train_dataset: null # Optional separate training dataset train_test_split: false # Whether to split data for calibration test_split_size: 1 # Test split ratio if splitting enabled # Background dataset configuration background_train_dataset: allenai/c4 # Default background dataset background_train_dataset_text_column: text # Text column name background_train_dataset_label_column: url # Label column name background_load_from_disk: false # Loading mode 2. Task-Specific Configurations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Question-Answering Tasks (TriviaQA, MMLU, CoQA) """"""""""""""""""""""""""""""""""""""""""""""" .. code-block:: yaml # Additional QA-specific settings process_output_fn: path: output_processing_scripts/qa_normalize.py fn_name: normalize_qa normalize: true normalize_metrics: true target_ignore_regex: null Translation Tasks (WMT) """"""""""""""""""""""" .. code-block:: yaml # Translation-specific normalization source_ignore_regex: "^.*?: " # Regex to clean source text target_ignore_regex: null # Regex to clean target text normalize_translations: true Summarization Tasks (XSum, AESLC) """"""""""""""""""""""""""""""""" .. code-block:: yaml # Summarization normalization normalize_summaries: true output_ignore_regex: null processing: trim_outputs: true lowercase: true 3. Language-Specific Settings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For multilingual tasks (especially in claim-level fact-checking): .. code-block:: yaml # Language-specific normalization language: "en" # Options: en, zh, ar, ru multilingual_normalization: enabled: true use_language_specific_bins: true combine_language_statistics: false Usage Examples -------------- 1. Basic QA Task Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml hydra: run: dir: ${cache_path}/${task}/${model.path}/${dataset}/${now:%Y-%m-%d} defaults: - model: default - _self_ dataset: triviaqa subsample_train_dataset: 1000 normalize: true process_output_fn: path: output_processing_scripts/triviaqa.py fn_name: normalize_qa 2. Translation Task Setup ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml dataset: wmt14_deen subsample_train_dataset: 2000 source_ignore_regex: "^Translation: " normalize_translations: true background_train_dataset: null 3. Multilingual Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml dataset: person_bio language: zh multilingual_normalization: enabled: true use_language_specific_bins: true subsample_train_dataset: 1000 background_train_dataset: allenai/c4 Key Considerations ------------------ 1. Dataset Size and Sampling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Use ``subsample_train_dataset`` to control calibration dataset size - Larger values provide better calibration but increase compute time - Default value of 1000 works well for most tasks 2. Background Dataset Usage ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Background dataset provides additional calibration data - Useful for tasks with limited in-domain data - C4 dataset is default choice for English tasks 3. Processing and Cleaning ^^^^^^^^^^^^^^^^^^^^^^^^^^ - Task-specific normalization functions handle special cases - Regular expressions clean input/output texts - Language-specific processing for multilingual tasks 4. Performance Impact ^^^^^^^^^^^^^^^^^^^^^ - Larger sample sizes increase normalization quality but computational cost - Background dataset usage adds overhead - Consider caching normalized values for repeated evaluations Best Practices -------------- 1. Dataset Size Selection ^^^^^^^^^^^^^^^^^^^^^^^^^ - Use at least 1000 samples for calibration - Increase for complex tasks or when accuracy is critical - Consider computational resources available 2. Background Dataset Usage ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Use for tasks with limited training data - Ensure background data distribution matches task - Consider language and domain compatibility 3. Processing Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Configure task-specific normalization functions - Use appropriate regex patterns for cleaning - Enable language-specific processing for multilingual tasks 4. Optimization Tips ^^^^^^^^^^^^^^^^^^^^ - Cache normalized values when possible - Use smaller sample sizes during development - Enable background dataset loading from disk for large datasets