Data Quality Management: What It Is, How to Measure It and Why the AI Act Cares

Q: What data quality tools exist?

The main data quality management tools in 2026 are: dbt tests for validations integrated into the transformation pipeline, Great Expectations and Soda for defining quality SLAs as code, Monte Carlo and Bigeye for rule-free anomaly detection, and Collibra DQ or IBM InfoSphere QualityStage for enterprise solutions. The choice depends on the stack, budget, and team maturity.

Q: What's the difference between data quality and data validation?

Data validation is a one-time check: at the moment of ingestion or transformation, data is checked against rules. Data quality management is a continuous process: it defines the rules, applies them on every pipeline run, measures the result with metrics, alerts when thresholds are breached, and manages remediation with assigned owners. Validation is a step within data quality management, not a substitute for it.

What Data Quality Management Is

Data quality management is the set of processes, policies, roles, and tools that ensure an organization's data is accurate, complete, consistent, timely, and fit for its intended use. It isn't a one-time check at ingestion. It isn't a quality report someone generates once a quarter. It's a continuous process with defined metrics, thresholds agreed with the business, automatic alerts when they're breached, and remediation owners with a real mandate to act.

The distinction between data validation and data quality management matters: validation is a one-time check at a point in the pipeline. Data quality management is the system that surrounds that validation: who defined the rules, why, with what acceptance threshold, who receives the alert when it fails, and what remediation process kicks in. Without that system, validation is noise without context.

The Six Dimensions of Data Quality

The industry reference standard, DAMA-DMBOK, defines six main dimensions for measuring data quality. Each dimension requires specific metrics and rules, and its relative importance varies by domain and use case.

1. Completeness

Measures whether all required values are present. A null mandatory field is a completeness problem. It's expressed as a percentage: what proportion of records have the field filled in. The acceptance threshold is defined by the business: for some fields 95% is acceptable; for high-risk AI training data, 100% may be required.

2. Uniqueness

Measures whether there are duplicates where there shouldn't be. A customer registered twice with the same ID, a transaction processed twice, a product with two active records. Undetected duplicates create bias in AI models and errors in financial reports. It's measured as the percentage of unique records out of the total expected records.

3. Validity

Measures whether values comply with defined format, range, and domain rules. A six-digit postal code where there should only be five, a birth date in the future, a negative amount where only positives are allowed. Validity rules are the easiest to automate and the first to implement in any data pipeline.

4. Consistency

Measures whether the same data has the same value across different systems or different points in the pipeline. The customer who has one name in the CRM and another in the billing system. The metric that's X in the sales report and Y in the finance report despite sharing the same definition. Consistency is the hardest dimension to measure because it requires cross-referencing sources, and the costliest to get wrong because it destroys organization-wide trust in data.

5. Timeliness

Measures whether data is available when needed. Correct data that arrives late is useless for the decision that depended on it. It's measured as delay relative to the agreed SLA: if the pipeline should complete at 8:00 and completes at 10:00, that's a timeliness problem regardless of whether the data is correct.

6. Accuracy

Measures whether the value reflects the reality it's meant to represent. It's the hardest dimension to measure automatically because it requires checking against an external source of truth. A weight recorded as 75 kg when the real weight is 82 kg is an accuracy problem no validation rule will catch if 75 is within the acceptable range. In AI training data, poor accuracy is the most common source of undetected bias.

Why the AI Act Requires Data Quality Management

Article 10 of the AI Act is explicit: training, validation, and testing data for high-risk AI systems must meet quality criteria suited to the system's purpose. Specifically, the Regulation requires that the data be:

Relevant, representative, free of errors, and complete to the extent possible.
Have appropriate statistical properties, including representation of the people or groups the system will operate on.
Subject to appropriate data management practices, including analysis of possible biases.

This isn't a statement of principles: it's an auditable obligation. AESIA and the AEPD can request evidence that training data meets these criteria. Without a documented, continuous data quality management process — with metrics, thresholds, alerts, and remediation logs — that evidence doesn't exist.

To see how data quality management fits into the full governance framework and the key compliance dates, see How to Implement an Effective Data Governance Framework in the AI Act Era and AI Act Key Dates: What Your Company Must Do at Each Regulatory Milestone.

How to Implement Data Quality Management Step by Step

Step 1: Define Quality Rules With the Business, Not Just the Tech Team

Data quality rules can't be defined by the engineering team alone. They need business input: what values are acceptable for this field, what percentage of nulls this process can tolerate, what value range makes sense for this metric. The technical team translates those rules into code; the Data Owner validates that the rules reflect business reality; the Data Steward keeps them up to date as the business changes.

Step 2: Implement the Rules as Code in the Pipeline

Quality rules should live in pipeline code, not in an external document. Tools like dbt tests, Great Expectations, or Soda let you define these rules as versioned code, integrate them into the CI/CD process, and run them automatically on every pipeline run. If a rule fails, the pipeline can stop, fire an alert, or log the anomaly depending on the defined criticality.

Step 3: Define Acceptance Thresholds and Criticality Levels

Not all quality rules carry the same weight. A free-text comment field with 10% nulls is tolerable; a customer ID with 1% nulls is critical. Define for each rule: the acceptance threshold (percentage of records that must meet it), the criticality level (blocks the pipeline, fires an alert, or just logs), and the remediation SLA (how quickly it must be resolved if it fails).

Step 4: Publish Quality Dashboards Visible to Data Owners

Data quality can't be information only the technical team sees. Data Owners need visibility into their domain's quality status: which rules are failing, how often, what trend they show, and what impact they have on downstream systems. A quality dashboard in Power BI or the data quality tool itself turns that information into something actionable for whoever has the responsibility and authority to decide on the data.

Step 5: Establish Remediation Processes With Clear Owners

A quality alert without a remediation process is noise. Define for each incident type: who receives the alert, what they must do, within what timeframe, and how the resolution is documented. Remediation can be automatic — a pipeline fix — or manual — intervention by the Data Steward or the source system — but it must always have an owner and a record. That record is part of the data quality management evidence the AI Act may require.

Data Quality Management Tools in 2026

Tool	Approach	Best for	Integration	Cost
dbt tests	Transformation-stage validation	Teams standardized on dbt	Native in dbt, any DWH	Free
Great Expectations	Quality SLAs as code	Python / Spark pipelines	Snowflake, BigQuery, Pandas, Spark	Open source / Paid cloud
Soda	Quality SLAs as code	Teams that prefer YAML over Python	Snowflake, BigQuery, Databricks, dbt	Open source / Paid SaaS
Monte Carlo	Data observability (anomalies)	Detection without predefined rules	Snowflake, BigQuery, Databricks, Airflow	SaaS, mid-high price
Bigeye	Data observability (anomalies)	Teams with high table volume	Snowflake, BigQuery, Redshift	SaaS, mid price
Collibra DQ	Enterprise data quality	Large corporations using Collibra	Any JDBC source + Collibra platform	High (enterprise license)

For teams just starting out, the dbt tests + Soda combination covers 80% of use cases at minimal cost with a reasonable learning curve. Monte Carlo or Bigeye add value when table volume makes manually defining rules for everything unviable and automatic anomaly detection is needed.

Data Quality Management in Complex Environments: What Works

Two-Tier Quality Rules in Multi-Entity Environments

In environments with multiple airlines sharing a data platform, quality rules need to operate on two layers. Global rules cover shared master domains: customer IDs, product codes, operation dates. These rules are identical across all entities and apply at the integration layer. Local rules cover each market's particulars: acceptable price ranges by route, documentation formats by country, occupancy thresholds by aircraft type. Without this distinction, global rules generate false positives on locally correct data, and local rules don't scale to the group.

Integrating dbt Tests Into the Deployment Cycle

In commercial BI projects, integrating dbt tests into the CI/CD process radically changes how the team treats data quality. When a failing quality test blocks a model's deployment, quality stops being something reviewed after the fact and becomes a delivery requirement. Analysts learn to catch quality problems before they reach the dashboard, not after Data Owners report them in a meeting.

The Quality Dashboard as a Management Tool for the Data Owner

In environments with critical operational data, publishing data quality status on a dashboard accessible to Data Owners transforms the conversation about quality. The Data Owner stops receiving reactive incident reports and gains proactive visibility: which fields are trending toward degradation, which rules are near their failure threshold, which domain has the worst completeness rate this week. That visibility creates real accountability: a Data Owner who sees their domain at 88% completeness on a critical field has an incentive to act before it becomes a business problem.

Common Mistakes in Data Quality Management

Treating data quality as a project with an end date. Data quality is perishable: a pipeline that produces correct data today can produce incorrect data tomorrow if the source system changes, the schema changes, or the business changes. Without continuous monitoring, quality degrades silently until someone catches an error in production.
Defining quality rules without business input. Rules defined by the technical team alone tend to be incomplete or wrong for the real use case. The technical team knows what values are technically possible; the business knows which are operationally acceptable. Both perspectives are needed.
Measuring quality only at ingestion, not across the whole chain. Data can enter the data lake correct and exit the semantic model incorrect if an intermediate transformation introduces errors. Quality must be measured at every relevant pipeline layer, not just the entry point.
Alerts without a remediation process. An alert without a clear recipient, defined process, and resolution SLA is an alert that gets ignored. 80% of data quality management's value is in remediation, not detection.
Ignoring representativeness for AI data. In AI training data, completeness and validity are necessary but not sufficient. Representativeness — whether the dataset adequately reflects the distribution of the population the model will operate on — is the most critical dimension for the AI Act and the one most frequently left out of quality processes.
Not documenting quality results as evidence. For the AI Act, having quality rules isn't enough: you must be able to show they run, results are monitored, and incidents are remediated. Without historical records of quality runs, that evidence doesn't exist.

Conclusion: Data Quality Isn't a Technical KPI, It's a Business SLA

Effective data quality management isn't measured by the number of rules defined or the tool chosen. It's measured by whether the data feeding business decisions — and the AI models supporting those decisions — is correct, complete, and representative enough for its intended use.

With the AI Act in full effect, that question now carries a direct regulatory dimension. Organizations that already have a documented, continuous data quality management process — with metrics, alerts, remediation, and records — have a significant advantage: Article 10 compliance is a natural extension of what they already do. Those that don't should start with the most critical domains, the most important rules, and the simplest tools that work on their current stack.

Checklist: Operational Data Quality Management

Critical domains and datasets identified and prioritized for initial implementation.
Quality rules defined per domain with input from the Data Owner and Data Steward.
All six dimensions assessed per domain: completeness, uniqueness, validity, consistency, timeliness, and accuracy.
Rules implemented as code in the pipeline (dbt tests, Great Expectations, or Soda).
Acceptance thresholds defined per rule, agreed with the business, and documented.
Criticality levels assigned: block, alert, or log depending on impact.
Automatic alerts configured with a clear recipient and defined response SLA.
Remediation process documented with assigned owners by incident type.
Quality dashboard published and accessible to Data Owners and the compliance team.
Historical records of quality runs retained as auditable evidence.
Representativeness and bias analysis documented for AI training datasets (AI Act Art. 10).
Periodic rule review process when the business or source system changes.

Frequently Asked Questions About Data Quality Management

What is data quality management?

Data quality management is the set of processes, policies, roles, and tools that ensure an organization's data is accurate, complete, consistent, timely, and fit for its intended use. It isn't a one-time check: it's a continuous process with defined metrics, thresholds agreed with the business, automatic alerts, and remediation owners with a real mandate to act.

What are the dimensions of data quality?

The six main dimensions according to DAMA-DMBOK are: completeness (are all required values present?), uniqueness (are there duplicates?), validity (do values comply with format and range rules?), consistency (does the same data have the same value across systems?), timeliness (is the data available when needed?), and accuracy (does the value reflect reality?). For AI training data, add representativeness as an additional critical dimension.

What data quality tools exist?

Las principales en 2026 son: dbt tests para validaciones integradas en el pipeline, Great Expectations y Soda para Quality SLAs as code, Monte Carlo y Bigeye para detección de anomalías sin reglas predefinidas, y Collibra DQ para soluciones enterprise. Para equipos que empiezan, dbt tests más Soda cubre el 80% de los casos de uso con coste mínimo.

Why does the AI Act require data quality management?

Article 10 of the AI Act requires that training, validation, and testing data for high-risk AI systems be relevant, representative, error-free, and complete to the extent possible, and that a bias analysis has been performed. Without a documented, continuous data quality management process, demonstrating this compliance to AESIA or the AEPD with real evidence isn't possible.

What's the difference between data quality and data validation?

Data validation is a one-time check at a point in the pipeline. Data quality management is the complete system surrounding that validation: who defined the rules, with what threshold, who receives the alert when it fails, and what remediation process kicks in. Validation is a component of data quality management, not a substitute for it. An organization that only validates at ingestion has partial coverage; one that does data quality management has continuous coverage and auditable evidence.