thebigdatainsights

Big Data: An Evolutionary Perspective on Data Warehouse Architecture

By Moises J Nascimento, Chief Data Architect, PayPal

Moises J Nascimento, Chief Data Architect, PayPal

The challenge of developing an enterprise data system that is able to meet millisecond transaction response times—and, simultaneously, integrate data fast enough for near real-time analysis across multiple data platforms—can be overwhelming. Most companies struggle with multiple data silos built across platforms that blend transactional database systems, data warehousing, Big Data systems, NoSQL, in-memory stores and message buses, dragging years of technical debt in their wake.

What is the way out? Let’s take a look on how we got here first.

From Mainframe to Data Warehousing

In the early 90’s, when I was around seventeen years old I started working as an intern on the database architecture team at large car factory in Brazil. We were using the “state of the art technology”–an IBM mainframe running IMS and DB2 (hierarchical and relational) databases. On the mainframe, we had a very consistent and mature data management system, with data models, physical schemas, metadata and standardization. However, reporting capabilities were very limited.

At the time, delivery of month-end reports meant loading stacks of boxes—containing the printouts—into a pickup truck. The reports were delivered to an army of analysts, who would laboriously entered this data in Excel so that they could run analysis, aggregations and do graphical work.

When client/server became an alternative to mainframe, I started working as an Oracle DBA. Though I would expand and leverage the same data architecture principles from the mainframe, the new architecture presented a major challenge: how to integrate data across platforms.

“In the early days, the common challenges were on the physical design, ETL architecture, network latency, unstable storage systems, and database servers that were just starting to add features like parallel processing”

The answer to that was Data Warehousing; I started building a marketing and financial database that would bring all the data back together. During those early days, the common challenges were on the physical design, ETL architecture, network latency, unstable storage systems, and database servers that were just starting to add features like parallel processing.

While most DW literature would make references to data mining and unstructured data management, there was no technology available outside the mainframe to process massive amounts of data. Yet, even as relational EDW databases matured, they were not designed to deal with unstructured data sets. While we attempted to use creative solutions, the growth of data and its use would soon reveal that the EDW model was not going to scale or make sense from a cost perspective.

The EDW limitation worsened with growth of e-commerce when the amount of unstructured data started to increase: Web traffic data, logs, and data from social networks. We clearly needed a new massive parallel processing paradigm.

Hadoop: A Data Warehouse Architect Dream Come True

Data Warehouse architecture helped us to address a lot of the data management frameworks in the context of a largely distributed database environment. However, unstructured data management, as well as scientific data processing and mining, constituted a major gap.

When I started researching Hadoop, I felt really excited with the possibilities of addressing these gaps. However, with the new technology came the hype and it looked like, all of sudden, Data Warehousing was a thing of the past. What the new data professionals missed is that to successfully harvest the power of the vast amount of unstructured data, it was critical that the core company transactional data was integrated and modeled together—in order to add context to the unstructured data.

Therefore, in order to be able to architect a data system that leverages both EDW and Hadoop, we need to revisit some old
EDW facts that are no longer true:

1) All data must be in one monolithic EDW server.
2) Analytical data duplication and redundancy is bad.
3) EDW is a downstream system.

From the “Warehouse” into a“Store” next to you!

Before we cover my view on the next generation of data systems, let me talk a bit about how do I see the Hadoop ecosystem today. The system is in its initial phase of evolution and I believe that the paradigm will continue to evolve to perform DW RDBMSlikefunctions on a much cheaper open source, shared nothing architecture. Solutions like hBase, Impala and Drill are a proof of this trend but we are still not there.

Therefore, considering the current state of maturity of the Hadoop ecosystem and applying all key Data Warehousing and Data Architecture core functions, let’s take a look into the role of the EDW traditionally and how it can integrate with Hadoop in an effective architecture that considers storage and access patterns to rationalize the data across the platform and its lifecycle.

In the new architecture, we expand all layers to create a flexible, on-time (online, near real time, batch) and democratic data platform without loosing control over the data governance, quality and source of record. To achieve that, we combine the Operational Data Sore (ODS) with the core EDW layer were we achieve a lower latency repository for the all source of record and core metrics so that the data gets distributed for all different kinds of analytical usages. We also achieve lower latency by bringing data streaming, real time analytics engine (such as Storm) and

Hadoop into the Integration layer to perform all data processing closer to the source with controlled performance and SLA. The three EDW facts I mentioned earlier could then be rewritten in this new architecture:

1) Data is rationalized and stored into an integrated data platform dedicated for real time streaming, batch and core EDW metrics processing.

2) Data is replicated and distributed into the end user RDMBS and Hadoop environments and, at the same time, will maintain the EDW concept of single source of truth.

3) EDW and ODS concepts merge and become an enterprise data store and source of record for the core data sets and metrics. EDW becomes an integration environment without ad-hoc query.

Concluding, when we leverage the strengths of data warehousing architecture, Big Data technologies and cloud computing principles, we can build a data platform were data and insights can be delivered as a Service.