A Personal Take On Data Engineering

Most of the posts I’ve written so far talk about one aspect or another of data engineering. My definition can differ a bit from other people’s so I thought I’d write my understanding of it.

In short: data engineering is how you make data useful: it enables other people to do their jobs. In my case it’s mostly been training ML models, but it can also be to get some insights from the data itself - for example for science. The job involves all the steps from acquiring data to having usable datasets. Read on.

A picture of an abacus

Data acquisition: data can be found, well, almost everywhere. But making data useful means acquiring the right data for the problem at hand. I think it’s the most difficult part of data engineering because it can take weeks to months, be expensive, and in a business that keeps changing that means you’ll have to anticipate a lot. You have to talk to the people who will use your data, and understand their needs so you can collect data that will be useful to them. At the same time those needs will likely change before you have the data ready, so it’s your job to make sure the data you’re collecting now under a given time/money/effort budget can be bent to multiple usages in the future. It requires a deep understanding of the application, the business needs, and the end users. Examples of data acquisition: add some telemetry to a product to measure certain things about how a product is used; hire paid contributors to generate some new data for you (images, voice, etc.); scrape the web for public information; purchase existing datasets from a company, etc.

Data processing: even with full control over the data acquisition, the data you’ll receive will not be usable as-is. There are always mistakes, inconsistencies, missing data, and issues of various kinds. Examples: the telemetry system had a partial outage for a few hours, and part of the data is missing; the contributors you hired initially misunderstood your instructions, resulting in data changing in the middle of the campaign; you acquired two datasets from two different providers, who use different terminologies. Some postprocessing is required to eliminate or mitigate all issues and make it straightforward to use the data. As much as possible, it’s important to keep track of all the transformations that you’ve applied to the data. Keeping a read-only copy of the original data is a minimum. The ideal scenario is to keep track of every single change that has been applied to the data, and why. Obviously, in practice it’s always somewhere in the middle. A lot of the change tracking can be automated in some way: for example with a post-processing pipeline that can be re-run periodically against the original data. Tracking manual changes is much more difficult. What trips most software engineers used to working with version control system is that there’s no real data version control - although some rare companies have started doing it. These days datasets are just too big to keep all the intermediate changes (even for companies like Google). Changing data is risky, especially when done in bulk.

Privacy & legal. Yes, that’s part of data engineering - at least in my view. Unless you’re measuring something like the weather or some voltage in a lab experiment, data is usually connected to humans, which means potential privacy issues or legal restrictions. A simple example is data from an electricity meter at someone’s place: knowing the energy consumption every minute or every month makes a big difference in terms of privacy. Knowing the former, you can infer when someone wakes up and goes to bed, possibly how many people are in the house, when they are around and when they are not, etc. This can leak a lot of personal information and can even put someone at risk. Knowing the energy consumption every month lowers that risk significantly. Unless you have a really good reason to need the fine-grained data, going for the coarse-grained one is usually much better. When collecting fine-grained information, ask for the person’s consent (remember those cookie popups?), but keep in mind that recording certain type of information can be prohibited no matter what (for ex. recording ethnic data is illegal in France and in other countries).

Making the data available. That’s the final stage, and usually the easiest: put the data in a form that can be used by downstream practitioners. If the goal is to train ML models, you probably want to split the data into training, validation and test splits, publish the data or integrate it into an MLOps pipeline. If you’re working with data scientists, the postprocessed data is likely enough but you’ll probably have to provide much more metadata (you recorded that too, right?).

Maintaining the data. As I mentioned in a previous post, datasets are never finished: using the data can reveal issues that you missed during postprocessing, or you’ll need a different postprocessing because requirements have changed a bit, you’ll need to move the data to the shiny new storage system your organization is now using, the law has changed, etc. Keeping data usable is a significant amount of work that is often overlooked.

I hope this gave you a good summary of how I see data engineering. Doing it well requires a large set of skills and a quite a bit of cross-functional work. Lots of challenges are not solved yet, and the field keeps changing. That’s what kept me going for years.

Photo by Crissy Jarvis on Unsplash

LinkedIn Post - if you have comments.