Model

class dataset2vec.model.Dataset2Vec(config: ~dataset2vec.config.Dataset2VecConfig = Dataset2VecConfig(activation_cls=<class 'torch.nn.modules.activation.ReLU'>, f_dense_hidden_size=32, f_res_hidden_size=32, f_res_n_layers=3, f_block_repetitions=7, f_out_size=32, g_layers_sizes=[32, 16, 8], h_dense_hidden_size=16, h_res_hidden_size=16, h_res_n_layers=3, h_block_repetitions=3, output_size=16), optimizer_config: ~dataset2vec.config.OptimizerConfig = OptimizerConfig(gamma=1, optimizer_cls=<class 'torch.optim.adam.Adam'>, learning_rate=0.0001, weight_decay=0.0001))

Bases: LightningBase

Dataset2Vec meta-feature extractor implemented using torch.

calculate_loss(labels: Tensor, similarities: Tensor) → Tensor

Calculates loss function which corresponds to the cross-entropy in the classification whether two datasets originate from the same source.

Parameters:

labels (Tensor) – True labels of the data. Can be either discrete or continuous.
similarities (Tensor) – labels generated by the model.

Returns:

value of the loss function.

Return type:

Tensor

forward(X: Tensor, y: Tensor) → Any

Generates encoding of the dataset. The size of the output does not depend on the dimensionality of the data. The formula for the encoding is the following:

\[\varphi(x) = h\left( \frac{1}{|M||T|}\sum_{m \in M, t \in T} g\left( \frac{1}{N}\sum_{i=1, \dots, N}f(X_{i, m}, y_{i, t}) \right) \right)\]

\(f\) is the network responsible for the interdependency encoding, \(g\) creates generates joint distributions representations and \(h\) generates final encoding of the dataset. \(X_{i, m}\) and \(y_{i, t}\) are the \(m\)-th feature and \(t\)-th target of the \(i\)-th observation of the dataset. \(M, T\) are cardinalities of the features and target columns.

Parameters:

X (Tensor) – Feautre matrix
y (Tensor) – Targets matrix

Returns:

Encoding of the input dataset with output_size dimensionality

Return type:

Tensor