What is the difference between a Database, Data Warehouse and Data Lake?

Introduction

The terms Database, Data warehouse and Data lake are becoming more frequent in todays world as we collect and store more and more data. In this article I would like to explain the difference between each of these terms and why you should care about the subtle differences as an engineer.

Why should reliability engineers care?

You may not be directly involved in the use of these different data storage systems; however, it is useful to have an awareness and not be blinded by jargon. The chances are that if you are working with large datasets and 3rd party software packages, these terms will arise at some stage. To set the scene and jump straight into this topic we will begin with an example.

A reliability engineering example

Imagine you are working in an operational factory of some sort. You perform system reliability simulations as part of your APM programme. This requires you to combine data about asset cost, downtime cost, maintenance schedules, number of expected failures, maintenance and spare parts cost. These are the inputs to your reliability modelling. The question is, how should you acquire and store all of this data?

After identifying the required data, you need to source that data. In some cases we will find exactly what we are looking for, and in other we have to derive it from other data. Once we have sourced our data, we must then transfer it into a single source and store this combined data. These steps are explained in more detail below.

1. Sourcing your data

The first thing you need to do is source your data, which exists in various databases across your business, such as:

Enterprise Asset Management (EAM)
containing data on: Asset cost, lifecycle, safety information etc.
Computerised Maintenance Management System (CMMS)
containing data on: maintenance tasks, schedules, spares etc.)
Data Recording, Analysis & Corrective Action System (DRACAS)
containing data on: failures, problems and resolutions etc.

2. Combining your data

The next thing you need to do is bring the data together into a single source, let’s say into a Data Warehouse, requiring you to extract the data from your Databases, perform some transformations and then load the data into your warehouse.

Finally, you import your up to date, clean data into your reliability analysis tool and run your analysis. The data may also be used by other members of your team for reports and other analysis - a major perk of having your Data Warehouse all set up.

Similar examples requiring this type of data handling could be criticality analysis, life data analysis, RCM etc. At the end of the day, whether you are accessing qualitative or quantitative data, it is likely stored in one of the aforementioned sources. So, let’s break down some of these data storage terms.

What is a Database?

Database summary:

Data is real-time.
Transactions are regular in frequency (e.g. daily).
Data is stored in a strict manner and therefore has a higher level of trust within an organisation.
Databases use Online Transactional Processing (OLTP) to create, update and delete records in real time.

Database description

Ok so what does that mean. A Database is likely the most familiar of our 3 terms, chances are that you interact with multiple databases daily, whether you know it or not. They are widely used and capture data from all kinds of transactions. This is the primary aspect of Databases; they provide up to date, live information.

Note that Databases come in many forms, the highest level of distinction is if a database is relational or not. We won’t be delving into the details between these here, but the more common flavour is (for now) the relational Database, which we interact with using Structured Query Language (SQL).

Various systems are available to manage these relational databases, appropriately named Relational Database Management Systems, or RDBMS for short, examples include Microsoft SQL Server, Oracle and MySQL.

What is a Data Warehouse?

Data Warehouse summary:

Data is collected from multiple sources.
Data is historic (not real-time).
Data is used for analysis and reporting.
Online Analytical Processing (OLAP) is used to allow fast execution of more complex queries.
The Extract, Transform, Load (ETL) process is used.

Data Warehouse description:
Now onto the Data Warehouse, you will notice some key differences from the Database; in essence the Data Warehouse is the playground of the analyst.

The analyst does not want to repeatedly run large queries to collect data, re-shape, group, summarise etc. this is why we have a Data Warehouse. The data within the Data Warehouse will have already been subjected to these kinds of pre-processing.

The data may not be real-time, but the environment is optimised to favour analysis, with the ultimate aim to support business decisions. The data warehouse often becomes the source of our analysis data, and may be used for various types of analysis. You may even have different Data Warehouses for different types of analysis or teams.

What is a Data Lake?

Data Lake summary:

Data is collected from multiple sources.
Stores structured and unstructured data in its raw form.
Data is stored more efficiently than in a Data Warehouse.
The Extract, Load, Transform (ELT) process is used.

Data Lake description:
The Data Lake holds raw unprocessed data collected from multiple sources, this data may be structured or unstructured. There can be a real mix of data in a Data Lake, everything from images, bodies of text, sensor data, transaction data, and everything else you can think of. Because of this, Data Lakes typically hold massive amounts of data.

The purpose of the data in a Data Lake is not necessarily defined, unlike that in the Database and Data Warehouse (this is why I didn’t include it in the example above). Not everyone needs a Data Lake, and it really depends on the scale of your business.

The analogy of a lake is very relevant, imagine the water (data) being brought in from multiple sources, some may be ‘processed’ some may not, it is likely that not all of the water will be used, but it can be sent to different locations then further processed for specific uses.

Here is an example of a Data lake with a variety of data sources, you can see that a Data warehouse is just one of the destinations from the Data Lake.

Summary

I hope this article has helped you understand the main different types of data storage you might encounter on your journey as a reliability engineer. Don’t be afraid to ask for clarification when you hear these terms being used. There are no hard and fast rules on how the data must flow from one system to the next, and for this reason every company will be different. Performing a gap analysis and developing a thorough reliability programme plan is a good place to start if you are unsure about what your data needs are.

How we can help you

Wilde analysis can help you with this whole process, from form design, to tool configuration, and finally data migration. If you would like to discuss how we can support you in any of these activities please get in touch by contacting us on +44 (0) 333 996 9930 or email: info@wilderisk.co.uk to discuss your specific needs.

Close the GAP between your Performance and Potential

Do your FMEAs fit within a wider Business Reliability Strategy?

Do your Alarm Systems Meet Industry Standards? Check with EEMUA191

ReliaSoft: Request a Trial

Ansys medini analyze: Free Trial

Alarm Systems Management aligned with EEMUA 191