A quick guide to understanding the different types of databases and common data architecture patterns | by Hanson Chiu | April 2022
I have been working on data related jobs and projects for over 6 years. With the evolution of data technologies (also buzzwords, especially with different layers of data), more confusions and misunderstandings arise, which makes communication technical.
This story aims to clarify the different data terms in a simple way with supporting diagrams to illustrate the high-level data architecture model.
If I have to rank the top 3 misleading data terms, I have to vote for these three terms, even some data practitioners with many years cannot tell them apart.
Database — This is the most generic term to describe the storage of an organized collection of stored data. Generally, when people compare database to data warehouse, database normally refers to the “Operational Database” including 1) Backend data storage for applications, 2) For transaction processing and 3) Flat relational data structure.
The data architecture for the operational database is relatively simple and directly related to the application.
Data storage — centralized data storage that integrates various transaction data with historical data from operational databases / source systems to analysis objective (e.g. data science, BI)
The data architecture for the data warehouse is the next level of that of the operational database which consolidates the data from different operational databases and performs the transformation/computation via ETL. There is a staging area to retrieve source files from operational databases / source systems before ingesting them into the data warehouse.
the data store concept is also introduced with the data warehouse. Different marts are designed and created for different purposes with regular transformations/calculations via ETL within a single data warehouse. These subsets are often a siled segment of a business (for example, sales, finance, or marketing). In other words, a data warehouse can contain multiple data stores.
data lake– unlike a database, the data lake is not in the database system engine/structure, it is a storage repository which contains a large amount of raw data in its native format (which can be image, MP3, JSON, CSV in structured, semi-structured, unstructured).
The data architecture for the data lake is the evolution of the data warehouse that further expands the storage area with the ability to capture and store the raw files (into semi-structured/unstructured data) from from different sources in a consolidated landing zone before loading them into the data warehouse.
Data Lake House — it is a new concept of data architecture combine elements of the data warehouse with those in the data lake. With modern technologies, different data citizens (including data scientists, data engineers, business analysts, BI developers) could benefit from centralized data storage for different analytical activities.
We will dive deeper into the design of Data Lakehouse in another story with the comparison with different common data architectures.
When people decide on the database, they connect to OLAP (“online analytical processing”) and OLTP (“online transaction processing”). It uses to define the purpose of the system and then understand the usage and need for the underlying database.
OLAP — OLAP applications are designed for analytical purposes for decision making. These applications could be machine learning/AI and business intelligence applications for data visualization.
The data warehouse (along with data marts) is normally the primary database for consuming OLAP applications. data cube is also a popular form of OLAP application database that pre-calculates aggregation/rollup to increase analysis efficiency.
OLTP — OLTP applications allow the execution of transactions in real time by a large number of users. These operations applications could be banking transaction systems (e.g. ATMs), hotel reservations/reservation systems.
Operational databases are normally designed for OLTP applications with a flat/simple relational database structure
DBMS (“Database Management System”) – his software used to store, retrieve and query data. It gets a user interface to interact with the database (eg create, read, update and delete data in the database). Some examples are MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS, dBASE.
Normally, these DBMSs only support flat datasets without indicating the relationships between the data.
RDBMS (“Relational Database Management System”)– it is a type of database management system (DBMS) that stores data in a row-based table structure that connects related data items to form a relational database.
In other words, users can define the relationship using primary keys, foreign keys with integrity checks, and ACID properties.
Some examples of specific systems that use RDBMS include IBM, Oracle, MySQL, Microsoft SQL Server and PostgreSQL.
As a data scientist, understanding the different data solutions with their natures and uses is essential to help you choose the best backend data storage for different business use cases.
There could be horrible consequences if the wrong solution is selected and implemented. Mismatches in performance, efficiency and cost could lead to real losses for organizations.
In the next story, we’ll dive deeper into modern data architecture design in Data Lakehouse and compare the pros and cons with traditional data architecture. Stay tuned.