Many organisations are unknowingly setting themselves up for failure by building AI models that are only ever trained once. Without a framework for ongoing improvement, these systems quickly fall out of step with real-world data — becoming less accurate, more costly to maintain, and ultimately obsolete.
Theta Lake, a compliance and risk management technology firm, has made continuous learning a cornerstone of how it builds and sustains its AI classifiers.
Theta Lake recently discussed how to avoid the one-off model trap, and why continuous learning makes AI sustainable.
The data behind the model
It is easy to focus on the sophistication of the model itself, but the real differentiator in any high-performing classifier is the quality and diversity of the training data underpinning it. This lesson has been relearned repeatedly across more than two decades of machine learning practice. Because many organisations work from the same open-source libraries and fine-tuned implementations, it is ultimately the training data that separates good classifiers from great ones.
Every classifier Theta Lake builds begins with a clearly defined behaviour that needs to be detected — whether that relates to regulatory compliance, data privacy, security, or AI usage. These initial concepts are shaped by domain experts, evolving regulatory guidance, and direct input from customers. From there, specific positive examples are used to give abstract ideas a concrete form, drawing on regulatory actions, public domain sources, and other approved materials.
Feeding and refining classifiers
Once a first draft classifier is in place, the next stage is to expand its training data through text augmentation — a process that involves generating every possible variant of a given example. This might mean swapping in different locations, organisations, currencies or amounts; introducing common spelling or grammatical errors; paraphrasing; replacing words with phonetic equivalents to simulate transcription mistakes; or switching between active and passive voice. For multilingual classifiers, this process extends across multiple languages.
Theta Lake works across a notably complex mix of data sources — emails, chat logs, audio and video transcripts, AI-generated content, and optical character recognition (OCR) from documents and screens. The firm deliberately exploits this diversity, using it to understand the unique error patterns and variations found in each medium. Over time, it has built up an extensive library of these patterns and applies them to augment training data in ways that closely reflect real-world conditions. Patented techniques are also used to correct for source-specific errors, ensuring that lexicons and fuzzy text matching remain robust.
Labelling with precision
Theta Lake uses patented technology to select the most effective training data for each classifier across multiple iterations, continuously assessing performance against large volumes of unlabelled data. This same technology is used to validate label accuracy, surfacing any examples that may be borderline or incorrect. Crucially, all labelling is carried out in-house — never outsourced — and is subject to ongoing review by both internal experts and the firm’s technology systems.
Where classifiers span multiple languages, training data is incorporated in each relevant language. Large language models (LLMs) are also employed to generate new examples, create variations of existing data, and identify patterns that may otherwise be overlooked.
A particularly significant challenge in this space is the rarity of the behaviours being detected. When positive examples are scarce relative to negative ones, standard accuracy metrics can be deeply misleading — a model may appear to perform well while completely failing to catch the rare instances that matter most. Theta Lake addresses this by carefully managing the mix of positive and negative examples and using appropriate evaluation metrics.
Combining models for better results
Rather than relying on a single model, Theta Lake’s classifiers bring together combinations of machine learning techniques — including nearest-neighbour methods, tree-based approaches, maximum margin methods, neural networks, and small language models — alongside lexicons and fuzzy rules. An automated process then selects the best-performing subset of these components, an “ensemble”, which is continuously refined as new data becomes available.
The final stage of development involves calibrating the classifier against large volumes of real-world data to fine-tune hit rates, precision, and recall in line with business risk, and to set appropriate thresholds for production environments.
Learning does not stop at deployment
For Theta Lake, deploying a classifier marks the beginning of a process, not the end. The firm continuously monitors model performance and tracks for data and model drift, with updates driven by customer feedback, internal metrics, regulatory changes, and software engineering requirements.
This stands in stark contrast to a common industry pattern in which models are tuned once or twice early in their lifecycle and then left to stagnate. Vendors often struggle to sustain a business model that requires ongoing updates across numerous one-off customer implementations — a failure point that harms both providers and the clients depending on them. Theta Lake’s approach is specifically designed to prevent this kind of decay, ensuring that its classifiers keep pace with shifting business needs and regulatory landscapes.
Read the full Theta Lake post here.
Copyright © 2026 RegTech Analyst
Copyright © 2018 RegTech Analyst





