In a previous article I argued that “Data Science should be treated as part of the business, not part of IT.” Of course that is not the whole story.

Over ten years ago I was working for a major re-insurance firm in their Market Risk department. I was helping them produce financial risk reports every night. These reports detailed the financial risks from a portfolio of around 6 billion dollars. They told the firm how much they would gain or lose if one of a number of things happened. Specialists set up significant mathematical calculations (which frankly was beyond me). I helped set up the software to fetch the required data each night, do the maths and calculate the reports. We didn’t call it that at the time but this was Data Engineering. Instead we called it “plumbing“.

This gives a clear example of how Data Engineering differs from Data Science. Data Engineers don’t come up with the maths themselves but they make sure that the maths can be done in a repeatable and reliable manner: ie Engineering. They may have some idea of what the business data means, but are rarely experts in that.

 

The Case For Consistent Data Engineering Design

Can you do without Data Engineers? If the company is small enough then Data Engineering might be done by other engineers or data scientists. But such companies may fall foul of these problems. 

There are self service analytics systems where the business person designs the reports they want, and pushes that report into production. This may work fine for one-offs and small cases but can get out of hand. 

Likewise a Data Scientist may have the technical knowledge to repeatedly run a model. They can ask a question repeatedly based upon the most recent data. However, no one wants to spend their work life repeatedly cleaning data and running reports. To do it automatically relies on their data being available, and cleaned regularly and reliably. Data Scientists probably won’t have the experience to know if their model is running efficiently or how to interact with the Operations team in case there are problems. 

A good Data Engineer will take both of these problems and implement them according to the architectural design of their enterprise data systems. These systems take the complexity inherent in the business and make it manageable. There may be hundreds or thousands of data feeds, reports, and models. Data may be of some known or unknown quality. It may be necessary to track the provenance of data so that reports can clearly be connected to data sources. It may be necessary to label data as containing Personal Information resulting in a strict set of governance rules to be followed.  

Having a design authority for your data engineering should result in the following positives:

  1. Maintainability – using consistent styles means that engineers have to learn less in order to understand their systems
  2. Efficiency – improvements to ingestion and data transformation techniques can be applied throughout your data engineering landscape.
  3. Predictability – if you run data engineering workload which is roughly of the same type then you have a good understanding of the resources required. If some jobs use radically different technology then they may either swamp the systems or run slowly.
  4. Better Data Quality – using a consistent set of technologies means that you can have a standard set of rules for quality checking and error detection. 

 

Now let’s look at some aspects of Data Engineering. It is usually split into two styles: Batch or Streaming. (I am ignoring ‘real time’ which is similar to Streaming, but usually much faster).

Batch

Historically a lot of Data Engineering was done in batch mode. The challenge was usually to periodically read data from Relational Databases with ETL tools and periodically write results back to them. (ETL stands for Extract Transform Load). This morphed into ELT with the prevalence of Big Data systems such as Hadoop or Data Lakes. (ELT is ETL but done in the order Extract Load, Transform) Although they took a little while to catch up, large expensive ETL tools now often do Big Data ELT as well. This appeals to large enterprises as they can use more junior engineers who are not full blown Data Engineers, but junior techs with some SQL experience. 

Batch processing of data is still valuable. For example machine learning models might be trained against the last n months of data. These models might be recalculated every few days so as to keep up to date with the most recent conditions, but any faster would be a waste of resources.

Event Streaming

Nowadays the business is less keen on reports which are produced just once a month:  It needs to know the current state right now… Almost all events must be treated as urgent and be used straight away. Although we have had message queue technology for many years in the transaction processing world, it is rarely capable of processing the required volume of information. For example a traditional message queue might be sensible for all purchases made on a web shop, but would probably be overwhelmed by recording all the pages on the site a customer looked at, along with the products they considered, and the shopping carts which were not checked out, and so on. These last examples may not affect the balance sheet directly, but learning about why potential customers don’t purchase is obviously a good thing.

 

If you factor in “Internet of Things” devices where we start to consider thousands of data generators then you realise that this is not a trivial problem. 

  • You need some way of getting data from where it is generated to where it is needed (eg with tools like Apache NiFi). 
  • You probably need a scalable landing zone for these events and a way of making them available to clients (eg with the popular system Apache Kafka). 
  • Your design may also offer big data techniques for processing that data such as with Apache Flink, Spark Streaming, or even internally to Kafka. 

Data Engineers are skilled in such tools, but need design coordination so they don’t all do their own thing. Even the best engineers will often do the best for their own department rather than the corporation as a whole. A good designer should be able to act as Design Authority in such a case. They will provide a shared vision to keep all Data Engineers working towards a common goal.

Credits

Photo of fields by Marc Pell on Unsplash

Author

Alex McLintock has over 25 years in the IT industry with the last 8 of those focussed on the Architecture of Big Data Analytics Systems