lm_polygraph.utils.dataset module

class lm_polygraph.utils.dataset.Dataset(x: List[str], y: List[str], batch_size: int, images: str | None = None)[source]

Bases: object

Seq2seq or vision-language dataset for calculating quality of uncertainty estimation method.

static from_csv(csv_path: str, x_column: str, y_column: str, batch_size: int, prompt: str = '', **kwargs)[source]

Creates the dataset from .CSV table.

Parameters:: csv_path (str): path to .csv table, x_column (str): name of column to take input texts from, y_column (str): name of column to take target texts from, batch_size (int): the size of the texts batch.

static from_datasets(dataset_path: str | List[str], x_column: str, y_column: str, batch_size: int, im_column: str | None = None, prompt: str = '', description: str = '', mmlu_max_subject_size: int = 100, n_shot: int = 0, few_shot_split: str = 'train', few_shot_prompt: str | None = None, instruct: bool = False, split: str = 'test', size: int = None, **kwargs)[source]

Creates the dataset from Huggingface datasets.

Parameters:: dataset_path (str): HF path to dataset, x_column (str): name of column to take input texts from, y_column (str): name of column to take target texts from, batch_size (int): the size of the texts batch, prompt (str): prompt template to use for input texts (default: ‘’), split (str): dataset split to take data from (default: ‘text’), size (Optional[int]): size to subsample dataset to. If None, the full dataset split will be taken.

Default: None.

static get_images(images: List[Image | str | bytes])[source]

static load(path_or_path_and_files: str | List[str], *args, **kwargs)[source]

Creates the dataset from either local .csv path (if such exists) or Huggingface datasets. See from_csv and from_datasets static functions for the description of *args and **kwargs arguments.

Parameters:: path_or_path_and_files (str or List[str]): local path to .csv table or HF path to dataset.

static load_hf_dataset(path: str | List[str], split: str, **kwargs)[source]

select(indices: List[int])[source]

Shrinks the dataset down to only texts with the specified index.

Parameters:: indices (List[int]): indices to left in the dataset.Must have the same length as input texts.

subsample(size: int, seed: int)[source]

Subsamples the dataset to the provided size.

Parameters:: size (int): size of the resulting dataset, seed (int): seed to perform random subsampling with.

train_test_split(test_size: int, seed: int, split: str = 'train')[source]

Samples dataset into train and test parts.

Parameters:

test_size (int): size of test dataset, seed (int): seed to perform random splitting with, split (str): either ‘train’ or ‘test’. If ‘train’, lefts only train data in the current dataset object.

If ‘test’, left only test data. Default: ‘train’.

Returns:

Tuple[List[str], List[str], List[str], List[str]]: train input and target texts list,: test input and target texts list.