top of page

Historic enterprise data is not an asset for AI adoption

  • Ingenie
  • Dec 1, 2025
  • 5 min read

Enterprise data archives are not AI-ready by default. 85% of enterprise data is ungoverned and architecturally inaccessible. The ITV case study shows why governance discipline, not data volume, drives AI value.


The training data constraint in AI investments


There is broad consensus that the availability of quality training data has become the primary constraint on AI progress. Frontier models have largely exhausted structured, high-quality public data. The next wave of capability improvements depends on proprietary, domain-specific data held within large enterprises: financial institutions, healthcare systems, broadcasters, and industrial operators.


Organisations sitting on decades of accumulated data are treating their archives as strategic assets. The investment case assumes that the data exists, the competitive advantage is proprietary, and activation is framed as an execution problem rather than a value question. This framing contains a material error.


The case for treating accumulated data as a strategic asset is well established. Technologists have long argued that data carries perpetual value: unlike physical assets it does not depreciate with use, cannot easily be deleted, and replicates at near-zero marginal cost. Historic data can surface patterns that recent data cannot reveal. Proprietary data estates, if genuinely unique, represent competitive assets that are difficult for others to replicate.


Generative AI has reinforced this position by creating institutional demand for training data at a scale that did not previously exist. Large language models ingest and embed data for inference and fine-tuning at scale, and organisations that shifted their data estates to the cloud through lift-and-shift migrations are now re-assessing the commercial terms structured around pre-AI access patterns. The assumption that archived data can serve as AI training input is where the investment case begins to diverge from commercial reality.


Accumulated data is not the same as deployable data


Gartner estimates that 85% of all enterprise data is collected and stored but never actively used for any purpose beyond the transaction that generated it. It accumulates not because it is valuable and awaiting activation. It persists because no decision was made to delete it, structure it, or establish what it is worth.


The constraints are architectural as much as commercial. Enterprise data infrastructure is built for static Business Intelligence reporting and batch data processing: periodic, scheduled, and tolerant of latency. AI model training and inference require streaming data pipelines, high-capacity compute, and low-latency access to data held in deep archive. Cold data held in deep archive, which represents the majority of most enterprise data estates, requires re-architecture before it can serve as AI training or inference input. These constraints rarely surface at the planning stage. They become visible after capital is committed.


The financial exposure compounds the architectural problem. Cloud commercial contracts are priced around predictable, recurring access patterns. Archived data retrieval at AI training scale sits outside those parameters and triggers a materially higher cost tier. Usage costs of three times or more above the expected monthly baseline are not uncommon, and they are rarely modelled correctly before commitment. Beyond cost, archived enterprise data that pre-dates current AI governance frameworks frequently lacks consent documentation, provenance records and bias controls that deployment reviews now require. Data that passes a security audit may still fail an AI deployment review. The compliance exposure materialises at the point of use, not the point of storage.


ITV case study: enterprise data audit


ITV holds one of the most valuable content libraries in British broadcasting: nearly 100,000 hours of programming across more than 1,000 formats. When Apple’s entry into streaming signalled a competitive shift, ITV’s executives made a logical decision: audit the archive and capitalise on what the business already owned. The audit returned a result that surprised the room. There was no significant new revenue to unlock. The commercially viable content was already known and in use. The rest had remained in archive not because it was awaiting activation, but because no governance framework had ever required a decision about it.


What the process did yield was more commercially significant. The audit forced ITV to confront the cost of maintaining over 1,000 content formats. The response was a governance framework that reduced permissible formats from over 1,000 to 50, imposed rigorous tagging and minimisation disciplines on programme producers, and introduced content minimisation KPIs that changed how content was created at source.


Over the following decade, ITV doubled its revenues. Not by unlocking its archive. By establishing the governance discipline that determined what constituted a usable asset. The pattern recurs consistently across enterprise AI programmes: the most valuable data is already in production, archived data persists because it was never worth activating, and the real value creation mechanism is the governance discipline the process demands, not the data itself.


What this means for enterprise AI programmes


The training data scarcity argument is valid. The enterprise response to it, specifically the instinct to mine historical data deposits, is reasonable but requires rigorous qualification before capital is committed.


Five principles hold consistently across data governance programmes:


1. The most valuable data asset is already in production. Clean, current, governed data at the core of the business model is the asset that AI models can most immediately and cost-effectively use. Capital directed at improving its quality, pipeline architecture, and accessibility consistently outperforms capital directed at activating archived data at the periphery.


2. Archived data is archived for a reason. The commercial decision not to activate it has been made repeatedly, and not by accident. There is a reason it remained unused: the cost of establishing provenance, cleaning, tagging, and making it architecturally accessible has consistently exceeded the return it generates. Before committing capital to reversing that decision at AI scale, the more productive question is whether the data is worth activating at all, or whether those resources are better directed at the governed data already in production.


3. Scale is not signal. A large data estate is not evidence of AI readiness. Accumulating data across multiple formats, storage tiers, and architectures compounds cost without improving deployment viability. The organisations where AI programmes perform are those that applied data minimisation discipline at the point of creation: defining what constitutes a governed, usable asset in the architectural design, before the data exists, not after it has accumulated. Less data, correctly governed from source, consistently outperforms larger estates built without that discipline.


4. Access costs are systematically underestimated. The financial exposure in enterprise AI data programmes materialises in retrieval and processing costs, not storage costs. Unless streaming-accessible, governed data pipelines are already in production, the cost of making archived data usable for AI deployment should be treated as unquantified until technically assessed.


5. Governance established before scaling is the investment. ITV’s outcome was produced by governance discipline, not data activation. Organisations that define what constitutes a usable, governed data asset at the point of architectural design, before the AI programme is built and before it is scaled, are the ones where the AI business case holds over the investment horizon.


The gap between the investment decisions and implementation reality


For investment committees, the data estate question belongs in the investment thesis, not in a footnote to the technology programme. A proprietary data deposit that is ungoverned, architecturally inaccessible, or lacking consent lineage is a remediation issue. The investment case that depends on it as a source of competitive advantage needs to price that remediation before capital is committed, not discover it after close.


The same constraint applies to the teams tasked with delivering AI value creation programmes once capital is committed. The governance deficit that was not identified at investment stage surfaces at the point when the programme team is committed to the delivery plan, the timeline is set, and data remediation has become the defining operational problem.



bottom of page