Smarter annotation strategies for bird detection models

A new paper from the TABMON team has been published in the Journal of the Acoustical Society of America: “Data-driven sampling strategies for fine-tuning bird detection models” by Corentin Bernard, Ben McEwen, Benjamin Cretois, Hervé Glotin, Dan Stowell, and Ricard Marxer.

The full paper is freely available here: doi.org/10.1121/10.0043947

The challenge: too much data, too few annotators

Passive acoustic monitoring produces enormous volumes of audio. Identifying bird species automatically requires fine-tuning pre-trained models like BirdNET on local recordings — but expert annotation is slow, costly, and a scarce resource. With an annotation budget covering only 0.2% of the available data (500 samples out of ~18,000), which recordings should you label to get the biggest improvement in model performance?

This paper provides concrete, practical answers.

Three contributions

1. Fighting catastrophic forgetting with L2-SP regularisation

When fine-tuning BirdNET on new local data, the model tends to “forget” species it already knew — a phenomenon called catastrophic forgetting. The authors show that a simple regularisation technique (L2-SP) that penalises large deviations from the original BirdNET weights largely prevents this, improving detection performance across 106 European bird species without any additional data.

2. A new “influence score” to find the most valuable samples

The core methodological contribution is the influence score: a data-driven measure of how much including a given audio sample in the fine-tuning set actually improves model performance. It is computed using reverse correlation — running thousands of random sub-selections and correlating each sample’s presence/absence with the resulting model performance.

The analysis reveals that samples containing rare species tend to have the highest influence — intuitively, they are the ones most underrepresented and therefore most informative. Fine-tuning on the top 500 highest-influence samples achieves a cmAP of 0.456, compared to 0.373 for random selection — a substantial gain from just a different choice of what to annotate.

Because computing influence scores requires a fully annotated dataset (impractical in the field), the authors also train a linear model on BirdNET embeddings to predict influence scores for unseen data, enabling practical deployment.

3. Practical sampling strategies: acoustic indices vs. model predictions

Finally, the paper compares a range of annotation-free and model-based strategies for selecting samples to annotate:

Strategy	cmAP
Pre-trained BirdNET (no fine-tuning)	0.367
Random selection	0.373
Best acoustic index (TFSD threshold)	0.394
Model uncertainty (high entropy)	0.406
Predicted influence score	0.410
Oracle top influence (upper bound)	0.456

Model uncertainty (selecting samples where BirdNET is most uncertain) emerges as the best practical strategy when model predictions are available. When they are not — for example, with unsupervised models or entirely new species — the acoustic index TFSD (Time-Frequency Spectral Derivation, capturing spectral modulations in the 2–10 kHz bird frequency range) is the strongest alternative and can be computed directly from raw audio with no prior annotations.

Why it matters for TABMON

These findings directly inform how TABMON manages its annotation pipeline. The network continuously collects audio across four countries; deciding which recordings to send to expert annotators for labelling is a real operational constraint. The strategies validated here — particularly uncertainty sampling and TFSD-based selection — are being integrated into TABMON’s active learning infrastructure to make the most of limited expert time.

The code and data used in the study are openly available:

GitHub: github.com/mim-team/PAM_data_sampling
Zenodo: zenodo.org/records/19206665