lm_polygraph.utils.dataset module
- class lm_polygraph.utils.dataset.Dataset(x: List[str], y: List[str], batch_size: int, images: str | None = None)[source]
Bases:
objectSeq2seq or vision-language dataset for calculating quality of uncertainty estimation method.
- static from_csv(csv_path: str, x_column: str, y_column: str, batch_size: int, prompt: str = '', **kwargs)[source]
Creates the dataset from .CSV table.
- Parameters:
csv_path (str): path to .csv table, x_column (str): name of column to take input texts from, y_column (str): name of column to take target texts from, batch_size (int): the size of the texts batch.
- static from_datasets(dataset_path: str | List[str], x_column: str, y_column: str, batch_size: int, im_column: str | None = None, prompt: str = '', description: str = '', mmlu_max_subject_size: int = 100, n_shot: int = 0, few_shot_split: str = 'train', few_shot_prompt: str | None = None, instruct: bool = False, split: str = 'test', size: int = None, **kwargs)[source]
Creates the dataset from Huggingface datasets.
- Parameters:
dataset_path (str): HF path to dataset, x_column (str): name of column to take input texts from, y_column (str): name of column to take target texts from, batch_size (int): the size of the texts batch, prompt (str): prompt template to use for input texts (default: ‘’), split (str): dataset split to take data from (default: ‘text’), size (Optional[int]): size to subsample dataset to. If None, the full dataset split will be taken.
Default: None.
- static load(path_or_path_and_files: str | List[str], *args, **kwargs)[source]
Creates the dataset from either local .csv path (if such exists) or Huggingface datasets. See from_csv and from_datasets static functions for the description of *args and **kwargs arguments.
- Parameters:
path_or_path_and_files (str or List[str]): local path to .csv table or HF path to dataset.
- select(indices: List[int])[source]
Shrinks the dataset down to only texts with the specified index.
- Parameters:
indices (List[int]): indices to left in the dataset.Must have the same length as input texts.
- subsample(size: int, seed: int)[source]
Subsamples the dataset to the provided size.
- Parameters:
size (int): size of the resulting dataset, seed (int): seed to perform random subsampling with.
- train_test_split(test_size: int, seed: int, split: str = 'train')[source]
Samples dataset into train and test parts.
- Parameters:
test_size (int): size of test dataset, seed (int): seed to perform random splitting with, split (str): either ‘train’ or ‘test’. If ‘train’, lefts only train data in the current dataset object.
If ‘test’, left only test data. Default: ‘train’.
- Returns:
- Tuple[List[str], List[str], List[str], List[str]]: train input and target texts list,
test input and target texts list.