Class Imbalance in Bitcoin Illicit Activity Data

by Mairo Amador Hurtado

July 23, 2025

Blockchain Security, Machine Learning, Malware

Introduction

Detecting malicious activity in cryptocurrency networks like Bitcoin relies on datasets labeled with licit and illicit transactions. However, these datasets often suffer from class imbalance: the number of illicit samples is significantly smaller than licit ones. In the Bitcoin network, which contains over 1 billion transactions, illicit transactions account for less than 0.2% of the total volume according to several industry reports (e.g., Chainalysis).

This scarcity of labeled illicit activity is further worsened by the difficulty in reliably labeling malicious behavior, meaning many illicit samples remain undetected or misclassified. As a result, machine learning models trained on such datasets risk learning only from the majority class.

Why Class Imbalance Matters in Machine Learning

Most machine learning models make these basic assumptions:

Data points are independent and identically distributed.
Class distributions are relatively balanced.

When these assumptions are violated, like in class-imbalanced datasets:

The model becomes biased towards the majority class (licit).
Minority class (illicit) instances are misclassified, resulting in a high false negative rate.

Impact on Evaluation Metrics

Let us define: TP as True Positives, TN as True Negatives, FP as False Positives and FN as False Negatives. Standard metrics like accuracy become misleading:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

With severe imbalance, a classifier predicting all samples as negative (licit) might still score $>99\%$ accuracy while failing to detect any illicit activity.

Alternative Metrics for Imbalanced Classification

In order to evaluate the model performance in an class-imbalanced set, there exist metrics that better represent the real predictive capability of the model. These metrics usually balance trade-off metric pairs like sensitivity and specificity, or true positive rate and true negative rate:

$\text{Precision} &= \frac{TP}{TP + FP}$

$\text{Recall (Sensitivity)} &= \frac{TP}{TP + FN}$

$\text{F1-Score} &= 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

$\text{Balanced Accuracy} &= \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)$

$\text{G-Mean} &= \sqrt{ \frac{TP}{TP + FN} \cdot \frac{TN}{TN + FP} }$

Strategies to Handle Imbalance

To mitigate the risks that come with imbalanced data, many techniques have increasingly emerged in the last years and most of them have highly improved the performance of models. Imbalance learning techniques generally fall into four categories:

Data-Level Approaches

These directly modify the training data. They usually rebalance the distribution of the classes within the training set:

Undersampling: Reducing instances from the majority class (e.g., KNN technology).
Oversampling: Duplicate or synthetically generate minority class samples (e.g., SMOTE).

Algorithm-Level Approaches

Algorithm-based methods consist on adapting or developing machine learning algorithms to effectively handle imbalanced datasets, prioritizing enhancing the algorithms’ capability to accurately classify minority class instances.

Cost-sensitive learning: Assign higher misclassification costs to minority class errors.
Weighted shallow neural networks: Use class-weighted loss in small, fast neural nets (e.g., ELM, RVFLN, BLS).

Algorithm-level approaches are optimizing the loss function associated with the dataset, focusing on minority classes and optimizing computational cost thus they are usually preferred.

Hybrid Approaches

These methods incorporate techniques from both data-level and algorithm-level, combining the advantages of both strategies:

SMOTE + Ensemble methods
Adaptive oversampling with boosting/bagging

Ensemble Learning Methods

It is a methodology which involves putting together multiple classifiers or models to enhance the performance of classification tasks. It focuses on the strengths of different classifiers, enhancing the predictive accuracy for minority classes. It usually involves the creation of different subsets from the imbalanced dataset through resampling techniques. Individual classifiers are then trained on these subsets, and their predictions are combined using voting or weighted averaging schemes.

Conclusion

Class imbalance is a critical challenge in the detection of illicit activity on Bitcoin. Ignoring it leads to biased models that fail to detect what matters most. Using resampling techniques, weighted learning, and appropriate evaluation metrics can significantly improve detection capabilities.