Advanced Usage

Multi-reference datasets

When running a benchmark on a dataset with multiple reference values (like TriviaQA with multiple alias values for each question), you can evaluate generation metrics against each provided reference. Resulting metric value will be the maximum among all references.

CLI

When running benchmark from CLI using polygraph_eval script, just set multiref config option to true.

Python API

If you are calling UEManager directly from Python code, you’ll need to wrap each generation metric in AggregatedMetric before passing them to UEManager constructor:

from lm_polygraph.generation_metrics import AggregatedMetric, RougeMetric
from lm_polygraph.utils.manager import UEManager

metrics = [
    AggregatedMetric(base_metric=RougeMetric('rouge1'))
    AggregatedMetric(base_metric=RougeMetric('rouge2'))
    AggregatedMetric(base_metric=RougeMetric('rougeL'))
]

man = UEManager(
    dataset,
    model,
    estimators,
    generation_metrics,
    ue_metrics
    **other_args)

man()

Constrained generation

WiP

Uncertainty calibration

WiP

Custom modules

WiP