Is a data pipeline a platform?
All organisations would benefit from a place where they can collect and analyse data from different parts of the business. Historically, this has often been met by a data platform, a centralised data store where useful data is collected and made available to approved people. But, whether they like it or not, most organisations are, in fact, a dynamic mesh of data connections which need to be continually maintained and updated.
Following a single platform pattern often leads to a central data engineering team tasked with implementing data flows. The complexities of meeting everyone’s needs and ensuring appropriate information governance, as well as a lack of self-service, often make it hard to ingest new data sources. This leads to backlog buildup, frustrated data users, and frustrated data engineers.
Thinking of these dataflows as pipelines changes the mindset away from monolithic solutions, to a more decentralised way of thinking - understanding what pipes and data stores you need and implementing them the right way for that case whilst reusing where appropriate.
Experience report:
I recall one engagement where the client’s data engineering team was in the unfortunate position of dealing with the company’s secrets on a daily basis and drowning in their backlog. In my experience, that’s a common problem with centralised data warehouses and centralised data engineering teams!
The team I was working with was really forward thinking and technically capable, wanting to create data-driven functionality and get actionable insights. They had been unable to get it prioritised, and access to the data warehouse was only permitted for the sole data scientist on the team through a special VM running in the data warehouse project. The data scientist would run their analyses and models on an ad-hoc basis, copying outputs to the team for deployment. It really limited what was possible.
As we were in Google Cloud Platform, and we didn’t need or want access to secrets, I suggested we could use an Authorised View from our own GCP project to safely access just what we needed. I was able to work directly with the client’s compliance team to agree on what would be exposed by the view, moving us from high risk to their lowest classification. The view was trivial for the overburdened data engineering team to deploy - we went from completely deadlocked to iterating as fast as we could design and run our AB tests! Sister teams were able to steal the idea and get unstuck too.
Last updated