smote_cd.dataset_generation.generate_dataset#

smote_cd.dataset_generation.generate_dataset(n_features, n_classes, size, betas=None, random_state=None)#

Generate a synthetic dataset with compositional labels.

Parameters:

n_featuresint: The desired number of features for the dataset.
n_classesint: The desired number of classes for the dataset.
sizeint: The number of points to create in the dataset.
betasarray_like, shape (n_classes, n_features+1), optional: The betas matrix used to generate the data. If None, a random one is created.
random_stateint, optional: The random seed to use.

Returns:

Xnumpy.ndarray, shape (size, n_features): The array of the features of the created dataset.
ynumpy.ndarray, shape (size, n_classes): The array of the labels of the created dataset.
betasnumpy.ndarray, shape (n_classes, n_features+1): The betas matrix, either the created one or the one set as an input.

Notes

Each feature is uniformly generated between [-10, 10]. The label at a given index, where the features are \((x_1, \dots, x_p)\), is generated following a Dirichlet distribution of parameter \(\alpha\), where \(\alpha\) is:

\[\alpha = \mbox{softmax} (B_{0,1} + B_{1,1} x_1 + \dots + B_{p,1} x_p, \dots, B_{0,K} + B_{1,K} x_1 + \dots + B_{p,K} x_p),\]

where \(B\) denotes the matrix beta.

Examples

>>> from smote_cd import dataset_generation

If betas is not provided, a matrix betas is created. As the matrix is created with the seed random_state, if this parameter is specified, the created matrix will always be the same, and the points created aswell.

>>> X,_,betas = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=5, random_state=0)
>>> print(X)
[[ 4.30378733]
 [ 0.89766366]
 [ 2.91788226]
 [ 7.83546002]
 [-2.33116962]]
>>> print(betas)
[[0.5488135  0.71518937]
 [0.60276338 0.54488318]]

An common usage is to set a matrix betas to be able to randomly generate as many times as wanted, but always with the same distribution. The following code will return 10 random points that will always follow the same distribution at each call:

>>> betas = dataset_generation.generate_betas(n_features=1, n_classes=2,random_state=0)
>>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, betas=betas, size=10)

However, the following code will return 10 random points that will not follow the same distribution at each call:

>>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=10)