smote_cd.dataset_generation.generate_dataset#

smote_cd.dataset_generation.generate_dataset(n_features, n_classes, size, betas=None, random_state=None)#

Generate a synthetic dataset with compositional labels.

Parameters:
n_featuresint

The desired number of features for the dataset.

n_classesint

The desired number of classes for the dataset.

sizeint

The number of points to create in the dataset.

betasarray_like, shape (n_classes, n_features+1), optional

The betas matrix used to generate the data. If None, a random one is created.

random_stateint, optional

The random seed to use.

Returns:
Xnumpy.ndarray, shape (size, n_features)

The array of the features of the created dataset.

ynumpy.ndarray, shape (size, n_classes)

The array of the labels of the created dataset.

betasnumpy.ndarray, shape (n_classes, n_features+1)

The betas matrix, either the created one or the one set as an input.

Notes

Each feature is uniformly generated between [-10, 10]. The label at a given index, where the features are \((x_1, \dots, x_p)\), is generated following a Dirichlet distribution of parameter \(\alpha\), where \(\alpha\) is:

\[\alpha = \mbox{softmax} (B_{0,1} + B_{1,1} x_1 + \dots + B_{p,1} x_p, \dots, B_{0,K} + B_{1,K} x_1 + \dots + B_{p,K} x_p),\]

where \(B\) denotes the matrix beta.

Examples

>>> from smote_cd import dataset_generation

If betas is not provided, a matrix betas is created. As the matrix is created with the seed random_state, if this parameter is specified, the created matrix will always be the same, and the points created aswell.

>>> X,_,betas = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=5, random_state=0)
>>> print(X)
[[ 4.30378733]
 [ 0.89766366]
 [ 2.91788226]
 [ 7.83546002]
 [-2.33116962]]
>>> print(betas)
[[0.5488135  0.71518937]
 [0.60276338 0.54488318]]

An common usage is to set a matrix betas to be able to randomly generate as many times as wanted, but always with the same distribution. The following code will return 10 random points that will always follow the same distribution at each call:

>>> betas = dataset_generation.generate_betas(n_features=1, n_classes=2,random_state=0)
>>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, betas=betas, size=10)

However, the following code will return 10 random points that will not follow the same distribution at each call:

>>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=10)