lm_polygraph.estimators.focus module
- class lm_polygraph.estimators.focus.Focus(gamma: float, p: float, model_name: str, path: str, idf_dataset: str, trust_remote_code: bool, idf_seed: int, idf_dataset_size: int, spacy_path: str, idf_dataset_text_column: str = 'text')[source]
Bases:
EstimatorImplements the Focus uncertainty estimator as described in: “Hallucination Detection in Neural Text Generation via Focused Uncertainty Estimation” (https://arxiv.org/abs/2311.13230).
- Args:
gamma (float): Context penalty coefficient that controls influence of surrounding tokens. p (float): Probability threshold below which token predictions are masked out. model_name (str): Hugging Face model name or path to the tokenizer. path (str): Path to save or load precomputed IDF values. idf_dataset (str): Dataset name used to calculate IDF values. trust_remote_code (bool): Whether to allow loading of custom dataset scripts. idf_seed (int): Random seed used to shuffle or sample dataset. idf_dataset_size (int): Number of examples to use for IDF computation (-1 for all). spacy_path (str): Name or path of spaCy language model to use for POS/NER parsing. idf_dataset_text_column (str): Name of the text column in the dataset (default: “text”).
- class lm_polygraph.estimators.focus.IDFStats(token_idf: List, NER_type: List[str], pos_tag: List[str], nlp: Language)[source]
Bases:
objectContainer for IDF-related statistics and resources used in the Focus estimator.
- Attributes:
token_idf (List): List of IDF values per token index. NER_type (List[str]): Named entity types considered important. pos_tag (List[str]): POS tags considered important. nlp (Language): Loaded spaCy NLP pipeline.
- NER_type: List[str]
- nlp: Language
- pos_tag: List[str]
- token_idf: List
- lm_polygraph.estimators.focus.calcu_idf(tokenizer_path, path, idf_dataset, trust_remote_code, idf_seed, idf_dataset_size, idf_dataset_text_column='text')[source]
Calculate inverse document frequency (IDF) scores for each token using a Hugging Face tokenizer and dataset. Results are saved to disk for reuse.
- Args:
tokenizer_path (str): Path to the tokenizer model. path (str): File path to save computed IDF values. idf_dataset (str): Hugging Face dataset identifier for IDF computation. trust_remote_code (bool): Whether to trust remote code when loading the dataset. idf_seed (int): Random seed for dataset shuffling. idf_dataset_size (int): Max number of documents to use (-1 for all). idf_dataset_text_column (str): Name of the text column in the dataset (default: “text”).
- lm_polygraph.estimators.focus.entropy2(p)[source]
Compute the entropy of a probability distribution using base-2 logarithm.
- Args:
p (array-like): Probability distribution.
- Returns:
float: Entropy value.
- lm_polygraph.estimators.focus.load_idf(model_name: str, path: str, idf_dataset: str, trust_remote_code: bool, idf_seed: int, idf_dataset_size: int, spacy_path: str, idf_dataset_text_column: str = 'text') IDFStats[source]
Load IDF statistics and spaCy model, computing IDF values if not already saved.
- Args:
model_name (str): Tokenizer model name or path. path (str): Path to load or save the IDF file. idf_dataset (str): Dataset name used to calculate IDF. trust_remote_code (bool): Trust remote dataset loading code. idf_seed (int): Random seed for sampling. idf_dataset_size (int): Max number of samples to use for IDF. spacy_path (str): Name or path of the spaCy language model. idf_dataset_text_column (str): Name of the text column in the dataset (default: “text”).
- Returns:
IDFStats: Loaded or computed IDF statistics.