What Data Lineage Is and Why the AI Act Requires It

Q: What is data lineage?

El linaje de datos es la trazabilidad completa del recorrido de un dato: desde su origen en un sistema fuente hasta su punto de consumo final, pasando por todas las transformaciones intermedias. Responde a la pregunta: ¿de dónde viene este dato, qué le ha pasado por el camino y dónde se usa?

Q: What's the difference between technical lineage and business lineage?

El linaje técnico registra el recorrido físico del dato: qué tabla alimenta a qué tabla, qué pipeline ejecuta qué transformación SQL. El linaje de negocio añade el contexto: qué transformación de negocio representa ese paso, qué regla se aplicó, quién la aprobó y si el dato resultante está sujeto a alguna restricción regulatoria. Ambas capas son necesarias; el linaje técnico sin contexto de negocio no es suficiente para el AI Act.

Q: Why does the AI Act require data lineage?

El artículo 10 del AI Act obliga a documentar los datasets de entrenamiento de sistemas de IA de alto riesgo: su origen, las transformaciones aplicadas y los criterios de selección. Sin linaje, esa documentación no es trazable ni verificable. El artículo 12 exige además logging de eventos durante el ciclo de vida del sistema. Ambos requisitos presuponen una infraestructura de trazabilidad que el linaje de datos proporciona.

Q: What tools automate data lineage?

Las principales herramientas para automatizar el linaje son: dbt (linaje automático entre modelos SQL), OpenMetadata (linaje end-to-end con conectores a Snowflake, Airflow, dbt y Power BI), Apache Atlas (entornos Hadoop legacy), Microsoft Purview (ecosistema Azure) y Collibra (enterprise con linaje de negocio). La combinación dbt más OpenMetadata cubre el 80% de los casos de uso en stacks modernos en cloud.

Q: How much does implementing data lineage cost?

El coste depende del stack y de la herramienta elegida. En stacks con dbt y Snowflake, el linaje técnico básico puede automatizarse en menos de una semana con dbt Docs sin coste de licencia. Un linaje end-to-end con OpenMetadata (self-hosted) añade entre dos y cuatro semanas de configuración. Las soluciones enterprise como Collibra o Purview tienen costes de licencia significativos pero reducen el tiempo de implementación con soporte incluido.

What Data Lineage Is

Data lineage is the complete traceability of a piece of data's journey: from its origin in a source system — a transactional database, an external file, an API, a sensor — to its final point of consumption, which could be a dashboard, a business API, a machine learning model, or a regulatory report. Along that journey, the data goes through multiple transformations: it's cleaned, aggregated, joined with other data, recalculated, filtered. Lineage records every one of those steps.

In practical terms, lineage answers three questions many organizations today can't answer immediately:

Where does this data come from? What system generated it, when, and under what conditions.
What happened to it along the way? What transformations, filters, aggregations, or enrichments it underwent before reaching its destination.
Where is it used? What reports, models, APIs, or decisions depend on this data. Essential for assessing the impact of a quality problem.

Types of Data Lineage: Technical, Business and Operational

Not all lineage is the same. Depending on the level of detail and context captured, three types stand out that complement each other:

Technical Lineage

Records the data's physical journey at the system level: which table feeds which table, which pipeline runs which SQL transformation, which Airflow job runs which task on which dataset. This is the lineage tools like dbt and OpenMetadata generate automatically from code and platform metadata. It's necessary but not sufficient: knowing that table fact_revenue comes from raw_transactions doesn't explain what business rule was applied to calculate it.

Business Lineage

Adds business context to the technical journey: what business transformation each step represents, what rule was applied, who defined and approved it, and whether the resulting data is subject to any regulatory or privacy restriction. This is the lineage Data Stewards manually enrich on top of the automatic technical base. It's what answers why the data is worth what it's worth, not just how it was calculated.

Operational Lineage

Records the specific runs: when the pipeline ran, with what input data, what volume it processed, whether there were errors, and what outputs it generated. This is the lineage that lets you reconstruct exactly what happened in a specific run, essential for investigating quality incidents or answering the regulator about a specific decision made by an AI system.

Why the AI Act Makes Lineage Mandatory

The AI Act doesn't explicitly mention the word "lineage," but Articles 10 and 12 directly presuppose it. For high-risk AI systems:

Article 10 — Data governance: requires documenting training, validation, and testing datasets: their origin, transformations applied, selection and exclusion criteria, and representativeness metrics. Without technical and business lineage, that documentation isn't traceable: it's a Word document someone wrote once that doesn't reflect the data's reality.
Article 12 — Event logging: requires high-risk systems to automatically generate sufficient logs to reconstruct the circumstances of any relevant decision. That's operational lineage applied to the model's lifecycle: what input data it received, what output it produced, in what context, and with what confidence level.

The practical consequence is that an organization deploying a high-risk AI system without documented lineage can't sustainably comply with Article 10. It can generate a one-time compliance document, but that document will become obsolete the moment the dataset or pipeline changes. Only automated lineage guarantees documentation stays current without manual work.

For a look at how lineage fits into the full governance framework, see How to Implement an Effective Data Governance Framework in the AI Act Era.

How Data Lineage Works in Practice

Automatic Lineage With dbt

dbt is today the de facto standard for data transformation in modern stacks with Snowflake, BigQuery, or Databricks. One of its most valuable governance capabilities is automatic lineage: since dbt knows the dependencies between models — which SQL model depends on which other — it automatically generates a lineage graph showing the full journey from source tables (source) to final models (mart). This graph is published in dbt Docs and exposed to catalog tools like OpenMetadata without additional configuration.

dbt's lineage covers the journey within the transformation layer. For truly end-to-end lineage, you need to connect it with ingestion lineage — which source system feeds the raw tables — and with consumption lineage — which dashboards or AI models consume the final marts.

End-to-End Lineage With OpenMetadata

OpenMetadata connects lineage from the different stack layers into a single navigable graph. Through its connectors, it imports technical lineage from dbt, ingestion lineage from Airflow or Fivetran, and consumption lineage from Power BI or Tableau. The result is a complete view of the data's journey from the transactional system to the dashboard or AI model, with every step documented and navigable from the catalog interface.

On top of that technical graph, Data Stewards can add business context: descriptions of transformations, rules applied, owners of each step, and sensitivity classifications. Business lineage is built incrementally on top of the automated technical base.

Snowflake Lineage With Access History

Snowflake natively records access history and dependencies between objects: which query read which table, which materialized view depends on which base table, which user ran which operation. This information, combined with dbt lineage and OpenMetadata or Purview connectors, lets you build complete operational lineage with no extra infrastructure. In environments with multiple airlines sharing the same data platform, this operational lineage is also the technical evidence backing access reviews and security audits.

Tools to Automate Data Lineage

Tool	Lineage type	Main integration	Automation level	Cost
dbt + dbt Docs	Technical (transformation)	Snowflake, BigQuery, Databricks, Redshift	High — native in code	Free
OpenMetadata	End-to-end technical	dbt, Airflow, Snowflake, Power BI, Tableau	High — automatic connectors	Open source / Paid SaaS
Microsoft Purview	Technical + business	Azure, Power BI, SQL Server, M365	High in Microsoft ecosystem	Usage-based pricing (Azure)
Apache Atlas	Technical (Hadoop)	Hive, Spark, Kafka, HBase	Medium — requires setup	Open source
Collibra	Technical + business	Snowflake, dbt, Tableau, SAP	High with enterprise connectors	High (enterprise license)
Alation	Technical + business	Snowflake, BigQuery, Looker, Tableau	High — automatic crawling	Mid-high

For most organizations with a modern cloud stack, the dbt + OpenMetadata combination covers end-to-end lineage at minimal cost with high automation. Moving to enterprise solutions makes sense when environment complexity, business workflow requirements, or compliance needs exceed what open source options can offer.

How to Implement Data Lineage Step by Step

Step 1: Start With Critical Domains, Not Everything

The most common mistake is trying to build lineage for all of the organization's data from day one. The result is a project that never ends. Start with the three to five domains with the most business or regulatory compliance impact: data feeding high-risk AI systems, data underpinning leadership reports, or data subject to specific regulation (financial data, personal data, critical operational data).

Step 2: Automate Technical Lineage Before Documenting Business Lineage

Set up your lineage tool's connectors to the data platforms before writing a single manual description. Technical lineage should flow automatically. If the technical graph depends on manual work, it'll be outdated from the first change to code or schema. With dbt and OpenMetadata, this initial setup can be completed in under a week on standard stacks.

Step 3: Assign Data Stewards to Enrich the Business Context

On top of the automatic technical graph, Data Stewards add the business context: what business transformation each step represents, what rule was applied, who approved it, and whether there are usage restrictions. This phase is iterative: you don't need to document everything at once. Start with the most-queried nodes — the tables or models generating the most questions — and expand from there.

Step 4: Link Lineage With the Catalog and Access Control

Isolated lineage has limited value. Its full potential is unlocked when it's integrated with the data catalog — where the user can see any asset's lineage directly from its record — and with access control — where the Data Owner can see what data in their domain feeds which downstream systems before approving or denying access. To see how to structure these integrations, see What a Data Catalog Is and What It's Really For.

Step 5: Document AI Dataset Spec Sheets With Lineage as the Foundation

For every high-risk AI system, create a dataset spec sheet that uses lineage as its backbone: data origin, transformations applied (with a reference to the lineage graph node), selection and exclusion criteria, and owner of each layer. This sheet is the documentary evidence for AI Act Article 10. With automated lineage as its foundation, it stays up to date on its own when the pipeline changes; without it, it's a static document that ages from the moment it's created.

What I've Seen in Complex Environments

Broken Lineage Between Layers in a Multi-Airline Group

In an environment with several airlines sharing a Snowflake data platform, technical lineage within each airline was reasonably well documented in dbt. The problem was cross-entity lineage: data flowing from one airline's systems to the group's shared models had no documented traceability. When the audit team asked where a specific field in consolidated reporting came from, the investigation took days.

The solution was extending OpenMetadata's connectors to cover the cross-entity flow — including the integration jobs between Snowflake schemas — and designating a corporate Data Steward per domain responsible for maintaining the business context of the integration nodes. The consolidated lineage cut audit query response time from days to minutes.

Lineage as the Foundation for AI Act Dataset Spec Sheets

In projects where analytical models' training data came from multiple Snowflake tables processed by dbt, the automatic lineage generated by dbt Docs and enriched in OpenMetadata provided the technical foundation for the dataset spec sheets AI Act Article 10 requires. Instead of drafting a compliance document from scratch, the governance team exported the training dataset's lineage graph — with origin, transformations, and owners — and completed it with the quality and representativeness metrics Data Stewards had documented. The result was an automatically updatable spec sheet, not a static PDF.

The Impact of Missing Lineage on Quality Incidents

In environments without documented lineage, a data quality incident — a null field that shouldn't be, an out-of-range value, an unexpected duplicate — can take hours or days to locate because nobody knows which pipelines it affects or which downstream systems depend on the affected data. With end-to-end lineage, the same incident is located in minutes: the graph shows exactly which tables and dashboards are using the affected field, allowing remediation to be prioritized and the real impact communicated to the corresponding Data Owners.

Common Mistakes in Data Lineage Implementation

Technical-only lineage, with no business context. Knowing table A feeds table B isn't enough for the AI Act or the business. The context of what business transformation that step represents is what turns technical lineage into real compliance evidence.
Manual lineage in spreadsheets. Documenting lineage in Excel or a wiki guarantees it'll be outdated within weeks. Sustainable lineage is the kind generated automatically from code and platform metadata.
Partial coverage that creates false security. Having lineage within the transformation layer but not in ingestion or consumption gives an incomplete picture that can be more dangerous than having none at all: it suggests traceability is solved when it actually has critical gaps.
Lineage disconnected from the catalog. A lineage graph that only exists in the lineage tool and isn't navigable from the data catalog is a capability few people use. Integration between lineage and catalog is what makes lineage accessible to business users, not just the technical team.
Not updating lineage when the pipeline changes. If the code update process doesn't include updating business lineage, the technical graph will evolve on its own but the business context will become obsolete. Maintaining business lineage must be part of the change process, not a separate task done "when there's time".

Conclusion: Data Lineage Is the Memory of the Data Ecosystem

Without data lineage, the organization doesn't know where its numbers come from, can't explain why an AI model made a decision, and can't prove to the regulator that its training datasets are what it claims they are. Lineage isn't an advanced feature for exceptionally mature organizations: it's the basic infrastructure for trust in data.

With the AI Act in full effect, that infrastructure goes from being a best practice to a legal requirement. Organizations that already have automated lineage have a significant advantage: Article 10 compliance is a byproduct of their normal operations, not an additional project. Those that don't have it should start now, with critical domains and the tool that best fits their current stack.

Checklist: Operational Data Lineage

Critical domains identified and prioritized for initial lineage implementation.
Technical lineage connectors configured with data platforms (dbt, Snowflake, Airflow, Power BI).
End-to-end lineage graph visible: ingestion → transformation → consumption.
Cross-entity lineage documented in environments with multiple source systems or subsidiaries.
Data Stewards assigned for business lineage enrichment per domain.
Business context documented at critical nodes: what transformation it represents, what rule was applied, and who approved it.
Lineage integrated and navigable from the data catalog.
AI dataset spec sheets built on top of the lineage graph (AI Act Art. 10).
Operational lineage (run logging) active for high-risk systems (AI Act Art. 12).
Business lineage update process integrated into the pipeline change cycle.
Lineage coverage dashboard: percentage of assets with documented lineage per domain.

Frequently Asked Questions About Data Lineage

What is data lineage?

Data lineage is the complete traceability of a piece of data's journey from its origin in a source system to its final point of consumption, through every intermediate transformation. It answers three questions: where does this data come from, what happened to it along the way, and where is it used? It's the trust infrastructure that lets you understand, audit, and defend any of the organization's data.

What's the difference between technical lineage and business lineage?

Technical lineage records the data's physical journey: which table feeds which table, which pipeline runs which transformation. Business lineage adds context: what business transformation that step represents, what rule was applied, who approved it, and whether the resulting data is subject to restrictions. Both layers are necessary; technical lineage without business context isn't enough to comply with AI Act Article 10.

Why does the AI Act require data lineage?

Article 10 requires documenting high-risk AI systems' training datasets with origin and transformation traceability. Article 12 requires event logging throughout the system's lifecycle. Both requirements presuppose a traceability infrastructure only automated lineage can sustainably keep up to date. Without lineage, compliance documentation is a static document that ages from the moment it's created.

What tools automate data lineage?

The main ones are: dbt for lineage within the transformation layer, OpenMetadata for end-to-end lineage with connectors to multiple platforms, Microsoft Purview for Azure ecosystems, Apache Atlas for legacy Hadoop environments, and Collibra or Alation for enterprise solutions with integrated business lineage. The dbt plus OpenMetadata combination covers 80% of use cases in modern cloud stacks at minimal cost.

How much does implementing data lineage cost?

In stacks with dbt and Snowflake, basic technical lineage can be automated in under a week with dbt Docs at no license cost. End-to-end lineage with self-hosted OpenMetadata adds two to four weeks of setup. Enterprise solutions like Collibra or Purview have significant license costs but reduce implementation time with included support. The real cost isn't the tool: it's the Data Stewards' time to enrich business lineage.