UE Manager

UEManager is the central class for estimating uncertainty scores, calculating required underlying statistics based on model generation, and storing the results. It is invoked by the polygraph_eval script, and after successful evaluation of the benchmark, UEManager will store various statistics and results in the following attributes:

  • stats is a defaultdict where keys are names of statistics and values are the statistics themselves. Values of the statistics are not restricted to any particular type and can be anything from a single number to a complex object. Most (but not all) statistics are outputs of StatCalculator objects that were used during the evaluation.

  • estimations stores outputs of the Estimator objects that were specified at manager creation. It is a defaultdict where keys are in the form of (level, estimator_name) tuples and values are corresponding estimator’s outputs. The level can be one of (sequence, token, claim) and represents the type of uncertainty estimation method. For sequence-level estimators, values are 1D numpy arrays with length equal to the number of examples in the dataset. For token-level estimators, values are lists of numpy arrays, where length of outer list is the number of examples, and each inner array has length ueal to the number of tokens generated by the model for the corresponding example, excluding EOS token. For claim-level estimators, values are lists of numpy arrays, where length of outer list is the number of examples, and each inner array has length equal to the number of claims generated by the model for the corresponding example.

  • gen_metrics keeps quality metrics of generated sequences. It is a defaultdict where keys are in the form of (level, metric_name) tuples and values are np.arrays of metric values. The level can be one of (sequence, claim) and represents the type of quality metric. For sequence-level metrics, values are 1D numpy arrays with length equal to the number of examples in the dataset. For claim-level metrics, values are lists of numpy arrays, where length of outer list is the number of examples, and each inner array has length equal to the number of claims generated by the model for the corresponding example.

  • metrics stores comparative scores of uncertainty estimation methods under evaluation. It is a dict that will hold scores for each combination of compatible estimator, generation metric and uncertainty estimation metric (e.g. PRR, RCC etc.). Only pairs of estimators and generation metrics that have the same level are included.

The UEManager object can be persisted using the save method:

man = UEManager(*args, **kwargs)
man()
man.save('path/to/save')

When using the polygraph_eval script, the manager object is saved automatically to the directory specified by the save_path config parameter. Manager does not serialize itself in full, but stores previously discussed attributes as a dict using torch.save. Thus, the saved object can be loaded using torch.load, or directly using the load method of the UEManager itself:

man = UEManager.load('path/to/save')