smote_cd.oversampling_multioutput#

smote_cd.oversampling_multioutput(df_features, df_labels, label_distance='logratio', normalize=False, k=5, n_iter_max=100, norm=2, verbose=0, choice_new_point='min')#

Perform the oversampling on data which has a compositional label.

Parameters:

df_featuresarray_like, shape (n,k)

The features (X) of the data to be oversampled.

df_labelsarray_like, shape (n,q)

The labels (y) of the data to be oversampled.

label_distance{‘compositional’, ‘euclidian’, ‘logratio’}, optional

The distance to be used to compute the label of the new point based on two existing points and a random weight (the default is ‘logratio’).

If ‘compositional’, the label is computed with the operations on the Simplex space, defined in Aitchison 1982 “The statistical analysis of compositional data”.

If ‘euclidian’, the label is computed with the Euclidian operators (not recommended, as it does not follow the principles of the Simplex space geometry).

If ‘logratio’, the logratio transform is first applied to the labels, and the Euclidian operations are used to compute the new label, before transforming it back into the Simplex space.

normalizebool, optional

Whether to normalize the features at the beggining of the algorithm (the default is False).

kint, optional

The number of nearest neighbors among which a random neighbor is chosen (the default is 5).

n_iter_maxint, optional

The maximum number of iterations to be performed (the default is 100).

norm{non-zero int, inf, -inf, ‘fro’, ‘nuc’}, optional

The order of the norm used to compute the nearest neighbors (the default is 2).

verboseint or bool, optional

Whether to print text detailing the steps of the algorithm (the default is 0).

choice_new_point{‘min’, ‘random’}, optional

How a new point is selected : randomly or in the smallest class (the default is ‘min’).

Returns:

featuresnumpy.ndarray, shape (n+m,k): The oversampled features, containing the old ones (first n values) and the created ones (last m values).
labelsnumpy.ndarray, shape (n+m,q): The oversampled labels, containing the old ones (first n values) and the created ones (last m values).

Examples

The oversampling algorithm can be tried on synthetic generated dataset.

>>> import numpy as np
>>> import smote_cd

We first generate the synthetic dataset and keep only 20 points on one of the classes to make it imbalanced.

>>> X,y,_ = smote_cd.dataset_generation.generate_dataset(n_features=2,n_classes=2,size=500,random_state=1)
>>> y = np.concatenate((y[np.argmax(X,axis=1)==0][:20],y[np.argmax(X,axis=1)==1]))
>>> X = np.concatenate((X[np.argmax(X,axis=1)==0][:20],X[np.argmax(X,axis=1)==1]))
>>> print(sum(y)/np.sum(y))
[0.29337655 0.70662345]

We then applied the oversampling and check the balance.

>>> X_os,y_os = smote_cd.oversampling_multioutput(X,y)
>>> print(sum(y_os)/np.sum(y_os))
[0.47518739 0.52481261]