smote_cd.oversampling_multioutput#
- smote_cd.oversampling_multioutput(df_features, df_labels, label_distance='logratio', normalize=False, k=5, n_iter_max=100, norm=2, verbose=0, choice_new_point='min')#
Perform the oversampling on data which has a compositional label.
- Parameters:
- df_featuresarray_like, shape (n,k)
The features (X) of the data to be oversampled.
- df_labelsarray_like, shape (n,q)
The labels (y) of the data to be oversampled.
- label_distance{‘compositional’, ‘euclidian’, ‘logratio’}, optional
The distance to be used to compute the label of the new point based on two existing points and a random weight (the default is ‘logratio’).
If ‘compositional’, the label is computed with the operations on the Simplex space, defined in Aitchison 1982 “The statistical analysis of compositional data”.
If ‘euclidian’, the label is computed with the Euclidian operators (not recommended, as it does not follow the principles of the Simplex space geometry).
If ‘logratio’, the logratio transform is first applied to the labels, and the Euclidian operations are used to compute the new label, before transforming it back into the Simplex space.
- normalizebool, optional
Whether to normalize the features at the beggining of the algorithm (the default is False).
- kint, optional
The number of nearest neighbors among which a random neighbor is chosen (the default is 5).
- n_iter_maxint, optional
The maximum number of iterations to be performed (the default is 100).
- norm{non-zero int, inf, -inf, ‘fro’, ‘nuc’}, optional
The order of the norm used to compute the nearest neighbors (the default is 2).
- verboseint or bool, optional
Whether to print text detailing the steps of the algorithm (the default is 0).
- choice_new_point{‘min’, ‘random’}, optional
How a new point is selected : randomly or in the smallest class (the default is ‘min’).
- Returns:
- featuresnumpy.ndarray, shape (n+m,k)
The oversampled features, containing the old ones (first n values) and the created ones (last m values).
- labelsnumpy.ndarray, shape (n+m,q)
The oversampled labels, containing the old ones (first n values) and the created ones (last m values).
Examples
The oversampling algorithm can be tried on synthetic generated dataset.
>>> import numpy as np >>> import smote_cd
We first generate the synthetic dataset and keep only 20 points on one of the classes to make it imbalanced.
>>> X,y,_ = smote_cd.dataset_generation.generate_dataset(n_features=2,n_classes=2,size=500,random_state=1) >>> y = np.concatenate((y[np.argmax(X,axis=1)==0][:20],y[np.argmax(X,axis=1)==1])) >>> X = np.concatenate((X[np.argmax(X,axis=1)==0][:20],X[np.argmax(X,axis=1)==1])) >>> print(sum(y)/np.sum(y)) [0.29337655 0.70662345]
We then applied the oversampling and check the balance.
>>> X_os,y_os = smote_cd.oversampling_multioutput(X,y) >>> print(sum(y_os)/np.sum(y_os)) [0.47518739 0.52481261]