Oversampling#

The oversampling on compositional data is performed with the function smote_cd.oversampling_multioutput().

The oversampling is an extension of the SMOTE technique, applied to compositional data. The detailed description and tests of the algorithm are given in the paper Nguyen et al. 2023 “SMOTE for compositional data”.

The main usage is the following : if you have a dataset with the features X of size (n,K) and the labels Y of size (n,J), you can use smote_cd if Y is compositional, i.e. if every row of Y represents a proportion that sums up to 1. This can be written as np.unique(np.sum(Y,axis=1))==1. In that case, the call to use the oversampling is:

X_oversampled, y_oversampled = smote_cd.oversampling_multioutput(X,y)

In some cases, if the \(p\) biggest classes are too large, you may want to perform an undersampling (smote_cd.random_undersampling()) on them before applying the oversampling:

indexes_to_remove = smote_cd.random_undersampling(y)
y_undersampled=np.delete(y,indexes_to_remove,axis=0)
X_undersampled=np.delete(X,indexes_to_remove,axis=0)
X_oversampled, y_oversampled = smote_cd.oversampling_multioutput(X_undersampled,y_undersampled)

Practical examples are available on the page smote_cd.oversampling_multioutput().