AI-enabled supervision is often marketed with bold promises of “99% accuracy” or near-total false positive reduction. In a previous instalment of this series, the limits of such claims were examined, particularly in environments where misconduct is extremely rare.
Theta Lake recently explored what genuinely matters when evaluating AI-based surveillance systems, how false positives can be reduced responsibly, and what firms should ask vendors before signing a contract.
In corporate communications surveillance, the underlying rate of actual misconduct is typically tiny – often below 0.01%. In such low base-rate environments, “accuracy” becomes a dangerously misleading metric. A model that labels every message as compliant could technically achieve 99.99% accuracy while failing to detect a single fraudulent email. The issue lies in imbalance: when negatives vastly outnumber positives, a naïve classifier can appear statistically impressive while delivering no meaningful risk detection. What truly matters is whether the system can reliably identify the rare “needle” in the overwhelming “haystack”.
Instead of accuracy, firms should prioritise precision and recall. Precision answers a simple but critical question: of all the communications flagged as misconduct, how many were genuinely problematic? High precision directly translates into fewer false positives and greater trust from compliance teams.
Recall, by contrast, measures how many actual misconduct cases were successfully identified. High recall reduces missed risks. These two metrics are inherently in tension: increasing recall usually lowers precision, and vice versa. The F1-score, which balances both through a harmonic mean, provides a more honest single-number summary of model performance. Unlike a basic average, it penalises systems that perform well on one dimension but poorly on the other.
Even more informative than a single score is the confusion matrix. This 2×2 structure shows true positives, false positives, true negatives and false negatives, offering a transparent breakdown of error types. For compliance teams, understanding whether a model fails by over-alerting or under-detecting is far more valuable than headline accuracy claims.
Reducing false positives in highly skewed datasets is technically complex. Standard machine learning models tend to prioritise the majority class, effectively ignoring the rare events that matter most. To address this, data scientists deploy a range of specialised techniques. Contextual embeddings powered by large language models (LLMs) move beyond crude keyword matching, enabling systems to distinguish between genuinely risky phrases and harmless idioms.
Cost-sensitive learning adjusts the training process to penalise certain types of errors more heavily, depending on business priorities. Ensemble methods combine multiple models and rule-based systems to compensate for individual weaknesses and improve robustness. Threshold tuning allows firms to adjust probability cut-offs, balancing alert volume against detection sensitivity. Post-processing filters suppress clearly benign contexts, while resampling methods – such as undersampling or synthetic oversampling techniques like SMOTE – rebalance skewed training data.
However, even the most sophisticated modelling approach can be undermined by weak validation practices. Vendors may demonstrate performance using artificially balanced datasets where 50% of examples are fraudulent. While impressive in a lab setting, such results often collapse when deployed against real-world data containing less than 1% misconduct.
Firms should therefore insist on performance metrics calculated on stratified test sets that mirror their actual data distribution. Precision, recall and F1-scores should be requested alongside a confusion matrix. Definitions also matter: whether a “hit” refers to a single sentence or an entire email thread can materially alter reported figures. Out-of-domain testing – using data the model has never seen before – is another essential safeguard.
AI supervision is not static. Language evolves, misconduct tactics shift, and data distributions drift over time. Effective systems incorporate structured feedback loops, enabling human reviewers to flag false positives and feed those insights back into model training. Ongoing drift detection is equally important: sudden changes in alert rates may signal degradation or environmental shifts rather than genuine risk spikes.
For compliance leaders evaluating surveillance platforms, the key takeaway is caution combined with rigour. AI can meaningfully reduce false positives and free supervisory teams to focus on high-quality alerts. But headline claims must be interrogated. Firms should ask whether evaluation datasets reflect realistic base rates, request detailed precision and recall metrics, understand expected daily alert volumes, and clarify how the system incorporates analyst feedback. Only through this disciplined approach can AI-enabled supervision deliver sustainable and defensible improvements in risk detection.
Copyright © 2026 RegTech Analyst
Copyright © 2018 RegTech Analyst





