Advanced Usage ============== Multi-reference datasets ------------------------ When running a benchmark on a dataset with multiple reference values (like TriviaQA with multiple ``alias`` values for each question), you can evaluate generation metrics against each provided reference. Resulting metric value will be the maximum among all references. CLI ~~~ When running benchmark from CLI using ``polygraph_eval`` script, just set ``multiref`` config option to ``true``. Python API ~~~~~~~~~~ If you are calling ``UEManager`` directly from Python code, you'll need to wrap each generation metric in ``AggregatedMetric`` before passing them to ``UEManager`` constructor: .. code-block:: python from lm_polygraph.generation_metrics import AggregatedMetric, RougeMetric from lm_polygraph.utils.manager import UEManager metrics = [ AggregatedMetric(base_metric=RougeMetric('rouge1')) AggregatedMetric(base_metric=RougeMetric('rouge2')) AggregatedMetric(base_metric=RougeMetric('rougeL')) ] man = UEManager( dataset, model, estimators, generation_metrics, ue_metrics **other_args) man() Constrained generation ---------------------- WiP Uncertainty calibration ----------------------- WiP Custom modules -------------- WiP