Selecting privacy detection models in regulated environments: why reported precision is not enough

By Juan F. Cobo · May 27, 2026

When a technical team selects a model to detect sensitive information in a regulated environment, the most common criterion is the model's reported precision — how well it identifies the entities it claims to handle. The model with the highest precision is preferred, the comparison seems straightforward, and the decision is documented as justified. It is not: two models with identical reported precision can produce radically different residual risks in the same domain — depending on which sensitive classes each model covers, how well it handles the entities that regulators actually require, and what happens when it fails. What the standard metric overlooks is the deployment domain itself. It does not reveal that gap. It conceals it.

Why standard metrics fail in regulated contexts

Standard metrics — precision, recall, F1 — each capture detection quality as a single value across all entity types a model handles. That consolidated view can mask critical deficiencies. A model that detects names, email addresses, and telephone numbers with high precision but misses medical record numbers, insurance identifiers, and device identifiers can still achieve a high overall score — particularly if the former classes are more frequent in the evaluation dataset.

In regulated environments, this matters because the classes a model misses are not interchangeable. The HIPAA Safe Harbor method requires the removal of a specific set of identifier categories before health information can be treated as de-identified. The absence of even one of those categories — medical record numbers, biometric identifiers, device identifiers — does not reduce compliance risk proportionally. It preserves it almost entirely.

The GDPR defines personal data by what it enables: the identification or identifiability of a natural person. Whether a piece of information is identifying depends on context — a date of birth combined with a postal code and a clinical site may be identifying in one setting and not in another. A model evaluated on general benchmarks has no knowledge of that contextual structure.

False negatives carry a different cost than false positives in this domain. A false positive causes unnecessary redaction and reduces data utility. A false negative leaves sensitive information exposed. In de-identification tasks, as Hartman et al. (2020) note in the context of clinical notes, high sensitivity is a prerequisite rather than a tradeoff — the legal and ethical exposure created by residual protected information is not recoverable through downstream controls.

What a domain-adjusted ranking measures

Cobo and Leenen (2026) propose a multicriteria function called the Domain-Adjusted Privacy Detection Score designed to answer two operational questions: what is the score of a model in a given domain, and which model should be selected for that domain. The function integrates five dimensions into a normalized multiplicative index that combines empirical performance with domain and regulatory context:

Detection performance is calculated per sensitive class using an alternative metric that assigns greater weight to missed detections than to false alarms — specifically to prevent good overall performance from hiding failures on critical categories. The class-level scores are aggregated using domain-specific criticality weights, not uniformly across all classes. For readers interested in the formal treatment, the metric and its properties are discussed in detail in Cobo and Leenen (2026).

Taxonomic coverage measures the proportion of domain-required sensitive classes that the model can actually detect. A model that supports only part of the classes required by a given taxonomy receives a coverage coefficient that reflects that gap, regardless of how well it performs on the classes it does cover.

Domain fit captures whether the model was validated in the target sector. A general-purpose model applied without adaptation to a regulated sector receives a coefficient that reflects the distance between its training context and the operational context.

Regulatory fit introduces the applicable normative framework explicitly. Different regulatory frameworks — HIPAA, the GDPR, or Argentina's personal data protection legislation — identify different categories of sensitive information and impose different obligations. The model is assessed against whichever framework governs the deployment environment.

False negative penalty applies a severity-weighted measure across critical classes, functioning as an explicit measure of residual domain risk. It penalizes models that fail on the most critical categories even when overall performance appears acceptable.

The multiplicative structure is intentional. A high score in four dimensions does not compensate for a critical deficiency in a fifth. A model with strong detection performance but incomplete taxonomic coverage should not rank above a model that covers all required classes, even if the latter shows lower overall performance. The index prevents a strong general result from masking a specific regulatory gap.

What the risk index adds

The methodology also produces a risk index separate from the score. The risk index combines two components: the failure rate in the most vulnerable class (the weakest-link risk) and the product of the false negative penalty and coverage (the multiplicative residual risk). The combination follows a probabilistic union — the system is considered at risk if either condition is met. This gives the compliance team a direct measure of exposure, not just a ranking.

A concrete case: a virtual assistant for retired beneficiaries

An obra social serving retired beneficiaries deployed a virtual assistant as its primary service channel. The assistant handles a range of queries: benefit status, reimbursement requests, provider information, and account updates.

The team evaluating detection models for this channel considered models benchmarked on healthcare PHI datasets. Those benchmarks emphasized clinical entities — diagnoses, medications, treatment codes, medical record numbers — which is appropriate for a hospital's document processing pipeline, but not for this channel.

The sensitive information that actually flows through a virtual assistant serving retired beneficiaries is predominantly financial and identity-related: bank account numbers, CBU codes, national identity numbers, home addresses, and insurance membership numbers. Diagnostic information is not typically exchanged through this interface.

A model trained and benchmarked on clinical PHI taxonomies may report a high score on its published evaluation set and still fail substantially on the entities that constitute the real risk for this deployment. The domain-adjusted ranking exposes that failure before deployment, not after an incident. The case illustrates the methodology's central premise: the sector does not define the risk. The actual data flow in the operational context defines it.

What does each stakeholder gain from a domain-adjusted ranking?

For the data protection officer, the domain-adjusted score provides a documented and defensible basis for model selection. The decision is not "the vendor's performance looked acceptable." The decision is: this model, applied to this domain, covers the classes the regulatory framework requires, handles the most critical ones with the sensitivity the domain demands, and aligns with the applicable regulatory obligations. That chain of reasoning is defensible in a regulatory review in a way that a single performance metric is not.

For the chief risk officer, the risk index is the actionable output. The score ranks models; the risk index quantifies residual exposure. A model with a higher domain-adjusted score but a higher residual risk index is a different decision than one with a somewhat lower score but substantially lower exposure. The first may be operationally suitable with targeted mitigations. The second requires either a different model or a compensating control architecture — regardless of what the vendor's benchmark shows.

For the CTO or lead architect, the framework governs procurement and architectural decisions. The selection criteria are explicit, reproducible, and tied to the operational context. When a new model is released and the vendor claims superior performance, the evaluation process is already defined — run it on the validated domain dataset, apply the same taxonomy weights, and compare scores under consistent conditions. The comparison is no longer between numbers produced by different vendors under different conditions. The selection rationale is traceable end to end.

When this approach does not apply

The domain-adjusted score requires a validated dataset for the target domain. Without labeled examples of the sensitive classes the domain actually requires, the detection scores cannot be calculated, and the ranking collapses back to vendor-reported metrics.

Assembling a validated domain dataset is not a trivial task. It requires annotators with domain knowledge, a clear taxonomy definition, and alignment between the annotation schema and the model's entity vocabulary. For organizations that lack this infrastructure, the score can still be partially applied — particularly the coverage, domain fit, and regulatory fit dimensions — but the detection component will rely on approximate estimates rather than empirical measurements.

The methodology also does not resolve the question of whether any model is sufficient. A high domain-adjusted score is operationally favorable, but it does not constitute a guarantee of regulatory compliance. The score is a structured decision tool, not a compliance certification.

Conclusion

The selection of models for sensitive information detection in regulated environments is a risk decision. It requires criteria that reflect the domain, the applicable regulatory framework, the actual taxonomy of sensitive entities in the operational context, and the asymmetric cost of false negatives over false positives.

Standard metrics answer a different question. They describe how a model performs across the classes its developers chose to evaluate, on datasets that may not represent the operational context. In low-risk settings, that is sufficient. In regulated environments, it is not.

A domain-adjusted ranking does not eliminate judgment from the selection process. Domain fit and regulatory fit require expert assessment. Class criticality weights require deliberate choices that reflect the organization's risk policy. What the methodology does is operationalize those judgments into a reproducible, defensible framework — one that compliance officers, risk officers, and technical teams can inspect, challenge, and present before regulators require it.

References

Cobo, J. F. and Leenen, L. (2026). Domain-adjusted scoring and risk for the selection of privacy detection models in regulated environments. EthiCompass. [link to be added upon publication]
U.S. Department of Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule. 2025.
European Union. Regulation (EU) 2016/679 — General Data Protection Regulation. 2016.
República Argentina. Ley 25.326 de Protección de Datos Personales. 2000.
Agencia de Acceso a la Información Pública. Resolución 4/2019: Criterios orientadores e indicadores de mejores prácticas en la aplicación de la Ley 25.326. 2019.
European Union. Regulation (EU) 2024/1689 — Artificial Intelligence Act. 2024.
PCI Security Standards Council. Can card verification codes be stored for card-on-file or recurring transactions? 2026.
National Institute of Standards and Technology. SP 800-63B: Digital Identity Guidelines — Authentication and Lifecycle Management. 2025.
Cybersecurity and Infrastructure Security Agency. ED 24-02: Mitigating the Significant Risk from Nation-State Compromise of Microsoft Corporate Email Systems. 2026.
FedPayments Improvement. Check Fraud Explained. 2026.
Information Commissioner's Office. What are identifiers and related factors? 2026.
Hartman, T. et al. (2020). Customization scenarios for de-identification of clinical notes. BMC Medical Informatics and Decision Making, 20, 14.
Neamatullah, I. et al. (2008). Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, 8, 32.
Abdalla, S. et al. (2025). Evaluating GPT models for clinical note de-identification. PMC.
Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of IJCAI 2001, 973–978.
Van Rijsbergen, C. J. (1979). Information Retrieval. Butterworth-Heinemann.
Hripcsak, G. and Rothschild, A. S. (2005). Agreement, the F-Measure, and Reliability in Information Retrieval. Journal of the American Medical Informatics Association, 12(3).
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Selecting privacy detection models in regulated environments: why reported precision is not enough

Why standard metrics fail in regulated contexts

What a domain-adjusted ranking measures

What the risk index adds

A concrete case: a virtual assistant for retired beneficiaries

What does each stakeholder gain from a domain-adjusted ranking?

When this approach does not apply

Conclusion

References

Products

Platform

Industries

Company

Legal