What are the Data Roles in a typical analytics or big data system? It used to be that there was a DBA – a Database Administrator, and that was all you needed to run a complicated data system. That is not the case now. DBAs seem to be rarer, but many other roles have grown up to do related overlapping tasks. In this document I lay out my view of those roles. I, Alex McLintock, am a Big Data consultant and co-founder of Alephant, a Big Data Consulting Firm.
- DBAs – Database Administrators
- Data Architects
- Data Warehouse Engineers
- ETL Engineers
- Data Analysts
- Data Governance Team
- Data Scientist
- Data Journalist
- Data Engineer
- Platform Engineer
- Operations
- ML Engineer
- DataOps Engineer
- Cloud Architects/Engineers
- ELT Engineers
- BI (Business Information)/Data Visualisation
Some Definitions
OLTP vs OLAP
Online Transaction Processing or “OLTP” is the sort of database which drives most applications. It is typically relational (RDBMS) and ACID compliant (atomicity, consistency, isolation, durability) (ref https://en.wikipedia.org/wiki/ACID) Typically just one or two records are read at a time.
Online Analytical Processing or “OLAP” is the sort of database which is designed for reporting. Typically large numbers of records are read at once and statistics done on them. Although records are added to OLAP databases most activity is “reads”, with little to no “updates”.
Data Warehouse
A type of database which is usually strongly relational and usually distributed. It is aimed at OLAP workloads such as reports.
Big Data
I cannot give an absolute definition for Big Data because people have been arguing about it for years. I can say however that there are two important aspects of Big Data:
- Scale. (or “Volume“) If you can do the work with just one or a few machines then it is not Big Data
- Flexible or No Schema. This is often called “Variety“. If all you have is a well known schema then you can use a distributed RDBMS database which does not really have much similarity to Big Data.
Consultant Jesse Anderson described big data using a different heuristic in his recent book “Data Teams”. He described Big Data problems as what you get when your normal engineers say that you can’t do something with the normal system for some technical reason to do with scale or speed. Though I am sure he said it better.
Other ‘V’s are important, but less helpful. For example “Velocity” can be true of big and small data. “Veracity” is important for any database.
The Roles
In this list I am starting with my understanding of the oldest roles, and progressing to roles which are more popular now, or have been created in recent years.
It should be remembered that this is my personal opinion. Individual jobs may be any combination of these roles, or a subset of individual roles.
DBAs – Database Administrators
Where databases are used for applications the design of that database is usually defined by the application developers. DBAs typically set up those database systems, and run them. They will be responsible for the hardware and software, and may advise on the schema, such as recommending indexes based upon activity. Historically they specialised in one particular software supplier such as Oracle, Sybase, or Microsoft (SQL Server).
They may or may not know anything about distributed databases.
This role might be done by more general operators or other support staff.
I saw a blog recently which split up DBAs into Infrastructure DBAs who design and implement the database infrastructure, and the Application DBAs who design the databases which applications use. It is common for application software developers to pretend they can do the last job and for proper DBAs to complain that no, they really can’t.
Data Architects
This role changes depending on who you talk to but typically there are two aspects:
- Data Architects are designers who design the data, including logical and physical models.
- Data Architects sometimes design the physical infrastructure for data stores in a similar way to DBAs.
Some people just do the first of these.
Data Architects should also have a strong understanding of the organisation’s data governance policies and data strategy even though large organisations may have separate people for that..
Some Data Architects specialise in Data Warehouse development which is different because the schema needs to be designed around the sorts of queries needed for it to be efficient.
Personally I call myself a designer of Big Data Infrastructure and sometimes that gets shortened to Data Architect.
Data Warehouse Engineers
These are typically Data Architects with some capability to design and implement a Data Warehouse. For the purpose of this document a Data Warehouse is an RDBMS system with a well defined schema which is optimised for OLAP workloads instead of OLTP workloads. They are often distributed nowadays and implemented in the cloud. If this is done well, and allows for swift scaling it makes them ‘almost Big Data’ in some ways.
ETL Engineers
“Extract, Transform, Load”. Traditionally Databases needed batch jobs to take data out of one database, manipulate it and insert that data into other databases. This was typically done through SQL, but did not always require software engineering skills. Instead GUIs were often given to business people who understood the data, and dragged symbols around a screen to create the ETL processes.
Some large companies developed just selling ETL software.
Data Analysts
For some years it was a specific job to look at the data and produce reports out of it. This fell to Data Analysts. They are typically mathematicians or statisticians. They understand the data, and are typically not software engineers.
This role is becoming rarer nowadays as the work is typically done by the business people themselves through the use of GUIs such as Tableau, Qlik, or PowerBI.
Some Data Analysts specialise in Data Visualization – typically generating graphs and charts from the data which are used by business people.
Data Governance Team
The people who decide what the rules are for the organisation as a whole within the context of the legal framework surrounding the organisation and its data strategy.
Applying the data governance rules is the responsibility of everyone in the organisation no matter what their roles.
Data Scientist
Data Scientists are in essence advanced Data Analysts with extra techniques. They typically come from an academic background and often have a PhD in a numerical subject. Their statistics skills are usually more advanced than a traditional data analyst, and typically know how to program, though usually less skilled than a full time software engineer. Languages are usually ones suitable for large scale mathematics such as python or R.
Rather than just create reports a Data Scientist’s task is usually to create hypotheses useful to the business, and test whether the hypothesis is consistent with the historical data. Or to put it another way – a data scientist’s job is to guide the business by supplying insights which the business can use to make decisions or change its operations in some way.
Data scientists should be considered part of the money making business, rather than just a sunk IT cost.
It is often said that 80% of a data scientist’s job is cleaning data. This is because the data they are working with is typically not already being cleaned as part of normal day to day operations.
It is possible that data scientists train mathematical models based on historical data. They may hand over these models over to the Data Engineers and ML engineers in the organisation which may incorporate the model in the normal day to day operation of the business. It is a common mistake to think that just because a data scientist built a model which they were able to train once and test once they can productionise it to do the same thing every day automatically.
Data Journalist
A Data Journalist is a special, and quite rare, form of Data Scientist and Data Visualisation expert. They are most famous for working in newspapers and other news organisations, but not always.
It is often said that a data scientist needs to explain a narrative – for instance to explain causation rather than just correlation. A Data Journalist conveys that narrative to a wider audience, perhaps through infographics and explanation.
Data Engineer
People doing Data Engineering started to be called that with the Big Data revolution, but to some extent the discipline existed before. These people are typically software engineers who develop data pipelines. These are resilient and reliable processes for transferring data from one place to another, or one state to another. These pipelines are still typically either batch or event streaming, and the two fields require slightly different skill sets.
For example, many many years ago I looked after a system which fetched some data files from an FTP site, checked that they were valid, and pushed them into a Sybase database. There was a system for scheduling these jobs, rerunning them if they failed, and also for running downstream jobs if they succeeded. Subsequently we ran reports on the Sybase database which calculated the Market Risk for the organisation, and these reports were read by specialists the next morning. I did not understand all the maths at the time as the valuation methods were defined by financial specialists.
Data Engineers may use all aspects of software engineering including version control, Test Driven Development, DevOps, and so on.
It is possible that they may be taking the work of a Data Scientist and productionising it – eg making sure that it runs regularly and reliably with no human intervention.
It is possible that a Data Scientist and a Data Engineer may be the same person, but in general I think it is unwise because they have quite different skills sets. However there is a special case of ML Engineers. Machine Learning Engineers are a special kind of Data Engineer who knows how to do a lot of the training of models and applying those models in production. It may be that they understand the maths involved in the training, but it is not absolutely necessary. For example an ML Engineer might be able to build a recommendation engine while using algorithms supplied from a library without being able to do the maths themselves.
Platform Engineer
Data Engineers are typically solving business problems in their work. What they manipulate are data flows or pipelines. Their customers are typically parts of the business who need the output from their work.
However it is typically the case that a certain amount of software development needs to be done to provide for a system or platform which is used by large numbers of Data Engineers. So for example Platform engineers may:
- run and maintain the data lake storage in a Big Data system,
- run the scheduling software which kicks off data pipelines when the source data arrives
- run the Apache Kafka system which acts as a landing zone for incoming data and works in a similar way to a message queue on steroids. (OK, Kafka, is not really a message queue but go with it for now)
- Create chef/puppet/ansible scripts for creating working machines which form part of a Big Data system
It may be useful to think of the Platform Engineers as people who help the Data Engineers do their job.
Operations
If you have any kind of long running software systems you typically need some kind of operations staff to monitor it and deal with any problems. It may be true that with the cloud you need fewer of them, but possibly more highly skilled.
It may also be the case that in a DevOps orientated environment a lot of the problems get pushed straight to Data Engineers and Platform Engineers who may be doing 2nd line support.
ML / Machine Learning Engineer
See Data Engineer…
DataOps Engineer
Data Ops is a specialised form of DevOps which tries to generalise and make sense of the lifecycle that data goes through. I have not seen many DataOps people as it is more likely that they are specialised forms of Data or Platform Engineers.
Cloud Architects/Engineers
Cloud Architects are a specific form of platform engineer that (unsurprisingly) specialise in creating data pipeline environments in a cloud. They will typically specialise in one of the largest public clouds (eg Google Cloud Platform, Amazon AWS, Microsoft Azure) with some knowledge of the others. However it is also possible that they will be aware of private cloud systems such as Openstack.
Due to the rise of “Infrastructure as Code” there is very little difference between a Cloud Architect and a Cloud Engineer. In either case the design of cloud based infrastructure can be specified in code (Terraform, Cloudformation, chef, puppet, ansible for example) and that code used to repeatedly spin up the necessary cloud hosted hardware.
Almost unique in this list there are lots of certificates for evaluating the skills of Cloud Architects such as the Amazon AWS Cloud Solution Architecture certificates which come in different grades including “Associate” and “Professional”.
ELT Engineers
I have included this for completeness, though In reality few, if any, people have this as a job title. In a large Big Data environment the business people who may previously done ETL (Extract-Transform-Load) into traditional RDBMS databases now find themselves using similar tools to do “ELT” (Extract-Load-Transform). The change in order is down to the size of the data – it is easier to transform *Big* Data once it is already in a big data system rather than work hard to manipulate it outside of the system scaled for that task.
BI (Business Information)/Data Visualisation
There are still people doing business information, even in a Big Data world. They typically still use Tableau, PowerBI and other reporting or graphing tools. Their techniques may change slightly due to the scale of the data, but their job remains roughly the same.
Author
Alex McLintock runs Alephant which helps companies in London to design new systems for Big Data Analytics and Data Science. He has written several articles here.Contact us for all your Big Data design problems.
Image of leaves in autumn taken from a photo by Robert Thiemann originally from Unsplash
Photo of Alex taken by Ellen Sturm.
I mentioned Jesse Anderson’s book Data Teams. For more information about it please visit https://www.datateams.io/
Discussion
If you want to comment on this blog please do so on my LinkedIn profile