Advanced Usage

Multi-reference datasets

When running a benchmark on a dataset with multiple reference values (like TriviaQA with multiple alias values for each question), you can evaluate generation metrics against each provided reference. Resulting metric value will be the maximum among all references.

CLI

When running benchmark from CLI using polygraph_eval script, just set multiref config option to true.

Python API

If you are calling UEManager directly from Python code, you’ll need to wrap each generation metric in AggregatedMetric before passing them to UEManager constructor:

from lm_polygraph.generation_metrics import AggregatedMetric, RougeMetric
from lm_polygraph.utils.manager import UEManager

metrics = [
    AggregatedMetric(base_metric=RougeMetric('rouge1'))
    AggregatedMetric(base_metric=RougeMetric('rouge2'))
    AggregatedMetric(base_metric=RougeMetric('rougeL'))
]

man = UEManager(
    dataset,
    model,
    estimators,
    generation_metrics,
    ue_metrics
    **other_args)

man()

Constrained generation

WiP

Uncertainty calibration

WiP

Custom modules

WiP