Data

class dataset2vec.data.Dataset2VecLoader(data: Path | list[Path] | list[DataFrame] | list[ndarray[Any, dtype[generic]]] | list[Tensor], batch_size: int = 32, n_batches: int = 100)

Bases: object

Dataloader responsible for the generation of the examples needed for the training of the Dataset2Vec. In each iteration it returns tuple \((X_1, y_1, X_2, y_2, label)\). \(X_1, X_2\) are subsets (both in terms of records and columns) of the features matrices of the passed datasets. \(y_1, y_2\) are subsets of the targets of the datasets (as for now it is the last column of the dataset). Label is equal to 1 when \((X_1, y_1)\) and \((X_2, y_2)\) originate from the same dataset and 0 otherwise. \(X_1, y_1, X_2, y_2\) are torch.Tensor.

Parameters:

data (Path | list[Path] | list[pd.DataFrame] | list[NDArray] | list[Tensor]) – input data to the loader. If Path, then all csv files are read from this directory. If the list of paths then csv files under these paths are read. If pd.DataFrame or np.NDArray, then they are converted to torch.Tensor. During the creation of the loader the data is imputed, standardized and categorical columns are one-hot encoded
batch_size (int, optional) – Number of the observations in the single batch. Defaults to 32.
n_batches (int, optional) – Number of batches that loader can generate. Defaults to 100.

class dataset2vec.data.RepeatableDataset2VecLoader(data: Path | list[Path] | list[DataFrame] | list[ndarray[Any, dtype[generic]]] | list[Tensor], batch_size: int = 32, n_batches: int = 100)

Bases: object

Loader with similar interface to Dataset2VecLoader but it returns on each iter call the same list of batches. Useful for the validation and testing purposes.

Parameters:

data (Path | list[Path] | list[pd.DataFrame] | list[NDArray] | list[Tensor]) – input data to the loader. If Path, then all csv files are read from this directory. If the list of paths then csv files under these paths are read. If pd.DataFrame or np.NDArray, then they are converted to torch.Tensor. During the creation of the loader the data is imputed, standardized and categorical columns are one-hot encoded
batch_size (int, optional) – Number of the observations in the single batch. Defaults to 32.
n_batches (int, optional) – Number of batches that loader can generate. Defaults to 100.