The 6 Dimensions of Data Quality, Explained With Real Cases
When someone says "the data quality is bad," they're not saying much. Data can fail in very different ways: being incomplete, being wrong, contradicting itself, arriving late, having the wrong format, or being duplicated. Each of these failures is a different dimension of quality, requires a different measurement, and generates a different business impact. Understanding all six is the first step to building a data quality program that solves real problems.
Why Distinguishing the Dimensions Matters
A 94% overall quality rate can hide very different realities. A domain with 94% completeness and 100% accuracy has a manageable problem: there are empty fields that can be recovered. A domain with 100% completeness and 94% accuracy has a more serious problem: every field has a value, but one in sixteen values is wrong, and that's much harder to detect and fix.
The six data quality dimensions are the industry standard per DAMA-DMBOK and the implicit reference of AI Act Article 10 when it requires training data to be "relevant, representative, error-free, and complete to the extent possible." Each of those words points to a specific dimension.
1. Completeness: The Most Visible and Most Ignored Problem
Completeness measures whether all required values are present. It's the easiest dimension to detect automatically — a null field is a null field — and, paradoxically, one of the most ignored in quality programs because it's seen as a minor problem.
Qué mide exactamente
There are three levels of completeness worth distinguishing. Field completeness measures what percentage of records have a value in a specific field. Record completeness measures what percentage of mandatory fields have a value in a given record. Entity completeness measures whether all expected records exist for an entity (for example, whether every product in the catalog has at least one associated price).
Casos reales de impacto
In an e-commerce system, 8% of order records have an empty shipping address field. The data is operationally correct during checkout because the address is pulled in real time from the customer's profile. But when that order is exported to the logistics system, which expects the address embedded in the record, 8% of orders fail silently. The problem wasn't data quality at the source: it was incompleteness in the integration format.
In a training dataset for a credit scoring model, 12% of records have an empty payment history field. The model learns to make decisions without that field, but in production that field is usually present. The model doesn't generalize well because it learned from a biased subset of reality.
Cómo medirla y qué umbral fijar
In dbt: not_null test per field. In Soda: missing_count
and missing_percent. The threshold depends on the field: 100%
for primary and foreign keys, ≥ 99% for critical business fields, ≥ 95% for
relevant optional fields.
2. Accuracy: The Costliest and Hardest-to-Detect Failure
Accuracy measures whether a data value correctly reflects the reality it's meant to represent. It's the most critical dimension for decision quality and the hardest to measure automatically, because it requires knowing the reality to compare the data against.
Qué mide exactamente
Data can be present, correctly formatted, and technically valid, and still be inaccurate. A recorded weight of 75 kg when the real weight is 82 kg is complete, valid, and exactly wrong. The name "John Smith" assigned to a person actually named "John Doe" meets every validity criterion and is completely inaccurate.
Casos reales de impacto
In an asset management system, a fixed asset's book value is wrong in 3% of cases due to errors in manual depreciation entries. Balance sheets reconcile — because the errors offset each other in many cases — but per-asset profitability analyses are systematically wrong. The problem has been in the system for years because no automated control catches it: the values are within the expected range and have the correct format.
In a product recommendation model, user ratings have an accuracy bias: a bug in the rating form reversed the scale for three weeks (5 stars was recorded as 1 and vice versa). The model trained on that data systematically recommends the worst-rated products as if they were the best.
Cómo medirla y qué umbral fijar
The most effective approach is validation against authoritative sources: master reference tables, official records (tax IDs, IBANs, postal codes), or designated systems of record. For fields without an external source of truth, business rules (expected ranges, historical distributions) are the best approximation. Reference threshold: ≥ 98%.
3. Consistency: When the Data Says Different Things Depending on Where You Look
Consistency measures whether data doesn't contradict other data in the same system or related systems. It's the dimension that most frequently erodes trust in data, because its failures are visible in meetings: the same KPI gives different results depending on which system it's pulled from.
Qué mide exactamente
There are two types of inconsistencies. Internal inconsistencies occur within the same record: the country is "ES" but the phone prefix is "+1," or the cancellation date is earlier than the signup date. External inconsistencies occur between systems: the CRM shows 45,000 active customers, the billing system shows 43,200, and the data warehouse shows 46,800. Each is "the truth" according to the system generating it, but they're incompatible with each other.
Casos reales de impacto
An airline has the number of passengers carried last month calculated three different ways depending on the system: the flight operations system counts boarded passengers, the revenue management system counts tickets sold, and the data warehouse aggregates both with deduplication logic nobody documented correctly. The leadership committee can't compare operational performance with commercial performance because the underlying numbers are incompatible.
The root cause isn't technical: it's the absence of a canonical definition of "passenger carried" agreed upon by the owners of each system. That's exactly the job of the Data Steward and the Data Governance Committee.
Cómo medirla y qué umbral fijar
In dbt: relationship tests (relationships) and custom SQL tests
validating conditions between fields. In Great Expectations:
expect_column_pair_values_to_be_equal for field-level
consistency. Inconsistencies between systems require cross-comparisons in
the integration pipeline. Threshold: ≥ 99%.
4. Timeliness: Correct Data That Arrives Late Is Useless Data
Timeliness measures whether data is available when it's needed. It's the most frequently ignored dimension in quality programs because it's perceived as an infrastructure problem, not a quality one. That's a mistake: timeliness is a quality dimension as relevant as accuracy in contexts where decisions depend on real-time data or very short time windows.
Qué mide exactamente
There are two aspects of timeliness. Latency measures the time between when data changes in the source system and when it's available in the consuming system. Freshness measures whether the data in the consuming system matches the current state of reality, not a past state that's no longer relevant.
Casos reales de impacto
In a retail chain's inventory management system, in-store stock is updated in the data warehouse every 4 hours. The e-commerce system queries the warehouse to show real-time availability. During the 4 hours between updates, the system can confirm orders for products no longer available in store. The result is 2.3% of orders with availability issues, handled manually at a cost of €12 per incident. Multiplied by order volume, the annual cost of that latency exceeds the cost of implementing real-time synchronization.
Cómo medirla y qué umbral fijar
By monitoring the last-update timestamp of each critical table and comparing
it against the agreed SLA. In dbt, a custom test alerting if
max(updated_at) < current_timestamp - interval 'X hours'
is enough to catch update delays. The threshold depends on the domain:
minutes for operational data, hours for analytical data, days for master
data.
5. Validity: The Rules Data Must Meet Before Being Useful
Validity measures whether a data value meets the format, range, and domain rules defined for that field. It's the easiest dimension to implement automatically and the one that shows results fastest in a quality program starting from scratch.
Qué mide exactamente
There are three types of validity rules. Format rules verify the value has the correct structure (an email with @, a tax ID with a valid check character, a date that exists on the calendar). Range rules verify the value is within acceptable limits (an age between 0 and 120, a percentage between 0 and 100). Domain rules verify the value belongs to an allowed set (an order status is one of those defined in the catalog).
Casos reales de impacto
In a patient management system, the birth date field allows future dates to be entered without form validation. 0.7% of records have birth dates later than today, making age calculations negative and causing age-group filters to exclude them from every analysis. In a healthcare context, systematically excluding those patients from analyses can distort clinical study results.
In a training dataset for a classification model, the product category field has 47 distinct values, but the official catalog only defines 12. The other 35 values are typographical variants, encoding errors, and obsolete categories the model learns as if they were distinct categories, reducing its ability to generalize.
Cómo medirla y qué umbral fijar
In dbt: accepted_values, expression_is_true for
ranges, regex tests with custom macros. In Great Expectations:
expect_column_values_to_match_regex,
expect_column_values_to_be_between. Reference threshold:
≥ 99.5% for critical business fields, 100% for keys and identifier
fields.
6. Uniqueness: Duplicates That Cost More Than They Seem To
Uniqueness measures whether duplicate records exist where they shouldn't. It's the quality problem with the highest direct economic impact, and the one most often discovered late: after duplicates have been in the system for months or years, contaminating analyses, models, and decisions.
Qué mide exactamente
There are two types of duplicates. Exact duplicates are identical records, or records with the same unique identifier, appearing more than once. They're easy to detect with a SQL query. Fuzzy duplicates are records representing the same real entity but with variations: "John A. Smith," "J. Smith," and "JOHN SMITH" are three distinct system records representing the same person. Detecting them requires text similarity algorithms.
Casos reales de impacto
At a telecom company, migrating two separate CRMs after a merger creates a 4.2% duplicate customer rate in the consolidated database. The sales team doesn't know and assigns two different reps to the same customer. The customer gets two renewal calls the same day, with different offers and different prices. The impact isn't just operational: it creates a customer experience that destroys trust and, in some cases, triggers a switching request.
In a recommendation model trained on purchase history, duplicate customers cause the model to overweight their purchase patterns: a customer with three records carries three times the training weight of a customer with a single record. The model learns to recommend products liked by duplicated customers, not by the real customer base.
Cómo medirla y qué umbral fijar
For exact duplicates: unique test in dbt on the natural key. For
fuzzy duplicates: tools like Splink (open source) or IBM MDM (enterprise)
using Levenshtein or Jaro-Winkler algorithms. The exact-duplicate threshold
should be 0% for primary keys and < 0.5% for combinations of business
attributes.
Summary: The 6 Dimensions at a Glance
| Dimension | Question It Answers | Reference Threshold | dbt Tool |
|---|---|---|---|
| Completeness | Are all needed values present? | ≥ 99% | not_null |
| Accuracy | Does the value reflect reality? | ≥ 98% | Custom test vs source |
| Consistency | No internal or cross-system contradictions? | ≥ 99% | relationships, SQL custom |
| Timeliness | Does data arrive when needed? | Per domain SLA | Test on max(updated_at) |
| Validity | Does the value meet format, range, and domain rules? | ≥ 99,5% | accepted_values, regex |
| Uniqueness | Are there no duplicates? | 0% on PK; < 0.5% on natural key | unique |
The Dimension the AI Act Adds: Representativeness
The six DAMA standard dimensions cover data quality in operational and analytical environments. The AI Act implicitly adds a seventh dimension for training datasets: representativeness.
A dataset can be complete, accurate, consistent, timely, valid, and free of duplicates and still train a biased model if it doesn't adequately represent the distribution of the population the system will operate on in production. A credit scoring model trained mainly on urban, mid-to-high income customer data can be technically correct on all six dimensions and systematically unfair to rural or low-income customers.
Article 10.4 of the AI Act requires taking into account "the necessary statistical properties, including the representation of the persons or groups of persons the system will operate on." Measuring and documenting that representativeness is a regulatory obligation no standard data quality tool covers natively — it requires domain-specific analysis for each AI system's use case.
For a look at how this fits into the full data governance framework for the AI Act, see How to Implement an Effective Data Governance Framework in the AI Act Era.
Why the Dimensions Aren't Independent
The most common mistake when implementing a data quality program is treating each dimension independently. In practice, failures in one dimension can cause failures in others. A completeness problem — incomplete records — can cause consistency problems if the empty field gets filled with the wrong default value downstream. A uniqueness problem — duplicates — can cause accuracy problems if the deduplication process picks the wrong record as the master.
The root cause analysis of any quality problem should explore all six dimensions, not just the one where the failure became visible. The symptom and the cause can live in different dimensions.
Conclusion: The Dimensions Are the Vocabulary, Not the Solution
Knowing the six dimensions of data quality lets you diagnose problems precisely, communicate them clearly, and measure their evolution objectively. But diagnosis isn't the solution.
Knowing the Customers domain's duplicate rate is 3.2% doesn't solve the problem. What solves it is understanding where those duplicates come from, what process generates them, what business rule should be applied to identify the master record, and who has the authority to decide that rule. That's exactly the Data Governance conversation the technical team can't have alone.
To understand how to measure these dimensions with concrete KPIs and actionable dashboards, see How to Measure Data Quality: KPIs, Thresholds and Dashboards.
Checklist: Assessing the 6 Dimensions
- Completeness assessed in mandatory, relevant fields and relationships between entities.
- Accuracy validated against authoritative sources or master reference tables.
- Consistency checked both in internal record rules and across related systems.
- Timeliness monitored with a per-domain SLA and active delay alerts.
- Validity implemented as code in the pipeline (format, range, allowed domain).
- Uniqueness verified on primary keys (0%) and on natural business keys.
- Representativeness analyzed for AI training datasets (AI Act Art. 10.4).
- Root cause analysis documented when a failure in one dimension affects others.
Frequently Asked Questions
What are the 6 dimensions of data quality?
The six standard dimensions per DAMA-DMBOK are: completeness (no nulls in mandatory fields), accuracy (the data reflects reality), consistency (no contradictions between fields or systems), timeliness (available when needed), validity (meets format and domain rules), and uniqueness (no duplicates).
Which data quality dimension is the most important?
It depends on the use case. For high-risk AI systems, accuracy and representativeness are the most critical. For real-time operations, timeliness. For financial analysis, consistency and uniqueness. No dimension is universally more important: each one's weight should be set based on its impact for the specific use case.
How are data consistency problems detected?
Through cross-rules between fields in the same record or across related
tables. In dbt, with relationship tests and custom SQL tests. In Great
Expectations, with expect_column_pair_values_to_be_equal.
Inconsistencies between systems require cross-comparisons in the
integration pipeline.
What's the difference between validity and accuracy?
Validity measures whether the value meets format and domain rules (a correctly formatted tax ID). Accuracy measures whether the value reflects reality (that tax ID is the correct one for that person). Data can be valid but inaccurate, and that's impossible to detect with format rules alone.
How does data quality affect AI models?
A model learns from patterns in its training data. Inaccurate data produces models that learn incorrect patterns. Data with representativeness problems produces models that don't generalize well. Duplicate data overweights certain examples during training. The AI Act requires documenting training dataset quality analysis precisely for these reasons.
Where does your organization stand?
Free maturity assessment for AI Act, Data Governance, NIS2 and GDPR. Instant results with your priority gaps.
Take the assessment → View templates →