smote_cd.dataset_generation.generate_dataset#
- smote_cd.dataset_generation.generate_dataset(n_features, n_classes, size, betas=None, random_state=None)#
Generate a synthetic dataset with compositional labels.
- Parameters:
- n_featuresint
The desired number of features for the dataset.
- n_classesint
The desired number of classes for the dataset.
- sizeint
The number of points to create in the dataset.
- betasarray_like, shape (n_classes, n_features+1), optional
The betas matrix used to generate the data. If None, a random one is created.
- random_stateint, optional
The random seed to use.
- Returns:
- Xnumpy.ndarray, shape (size, n_features)
The array of the features of the created dataset.
- ynumpy.ndarray, shape (size, n_classes)
The array of the labels of the created dataset.
- betasnumpy.ndarray, shape (n_classes, n_features+1)
The betas matrix, either the created one or the one set as an input.
Notes
Each feature is uniformly generated between [-10, 10]. The label at a given index, where the features are \((x_1, \dots, x_p)\), is generated following a Dirichlet distribution of parameter \(\alpha\), where \(\alpha\) is:
\[\alpha = \mbox{softmax} (B_{0,1} + B_{1,1} x_1 + \dots + B_{p,1} x_p, \dots, B_{0,K} + B_{1,K} x_1 + \dots + B_{p,K} x_p),\]where \(B\) denotes the matrix
beta.Examples
>>> from smote_cd import dataset_generation
If
betasis not provided, a matrixbetasis created. As the matrix is created with the seedrandom_state, if this parameter is specified, the created matrix will always be the same, and the points created aswell.>>> X,_,betas = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=5, random_state=0) >>> print(X) [[ 4.30378733] [ 0.89766366] [ 2.91788226] [ 7.83546002] [-2.33116962]] >>> print(betas) [[0.5488135 0.71518937] [0.60276338 0.54488318]]
An common usage is to set a matrix
betasto be able to randomly generate as many times as wanted, but always with the same distribution. The following code will return 10 random points that will always follow the same distribution at each call:>>> betas = dataset_generation.generate_betas(n_features=1, n_classes=2,random_state=0) >>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, betas=betas, size=10)
However, the following code will return 10 random points that will not follow the same distribution at each call:
>>> X, y, _ = dataset_generation.generate_dataset(n_features=1, n_classes=2, size=10)