Everything You Know About Business Intelligence, Data Warehousing and ETL is Wrong — Part I

June 24, 2013
For at least the last 25 years or so, certainly ever since researchers Barry Devlin and Paul Murphy coined the term “business data warehouse”, various vendors and technologies have been carving up and attempting to lay exclusive claim to overlapping slices of the data warehouse ecosystem - the sum total of the tools and methods required to support a data warehouse from source systems to end-users. Y

A History of Yesterday

For at least the last 25 years or so, certainly ever since researchers Barry Devlin and Paul Murphy coined the term “business data warehouse”, various vendors and technologies have been carving up and attempting to lay exclusive claim to overlapping slices of the data warehouse ecosystem - the sum total of the tools and methods required to support a data warehouse from source systems to end-users. You are familiar with these slices, they go by names and acronyms like: Extract, Transform & Load (ETL); Extract, Load & Transform (ELT); Data Quality (DQ); Data Profiling (DP); Master Data Management (MDM); Datamarting and Cubing; Database Federation; Data Warehouse Appliances (DWA); Business Intelligence (BI); Decision Support Systems (DSS); Executive Information Systems (EIS); Query & Reporting (Q&R); Enterprise Information Integration (EII); Advanced Analytics (AA); and Visualization, among many others. For each of these you can probably name at least two or three distinct vendors off of the top of your head. The thing of it is though, these are first and foremost marketing distinctions, driven by the needs of these vendors to differentiate themselves; and secondarily these slices are historical atavisms, reflective of sometimes decades-old technological limitations. In truth, data warehousing begins with data and it ends with data, and there is nothing in between but data. To understand how fundamentally this should impact both your strategic and operational approaches to information architecture, data governance, and vendor management, we need a quick review of the history of data warehousing.

In the beginning, there were systems, usually mainframes, optimized for the processing of business transactions. These systems were rather straight-forwardly known as OLTP, or on-line transactional processing, systems. OLTP systems were (and still are) great for handling large numbers of concurrent transactions which require the application of complex business rules. OLTP systems were (and still are) terrible at organizing, aggregating and trending either their input or output data values, in other words, they are terrible at actually reporting on the business processes they support. The amount of effort required to collect, clean, organize, aggregate and store these input and output data values for reporting was dear in terms of time, people and dollars. What was worse was that the efforts were often repeated independently for each new report. It was in response to this business pain that Devlin and Murphy in 1988 proposed an architecture for a “business data warehouse”.

Their architecture made use of some new and some old technologies – most notably dimensional data schemas which had been around since the ‘60s, and the database management systems developed in the ‘70s which were optimized to query them. Within 5 years of Devlin and Murphy, a series of firsts: the first database optimized for data warehousing; the first software for developing data warehouses; the first book on data warehousing; and the first publication of the 12 rules of on-line analytical processing (OLAP) which has provided the conceptual and architectural underpinnings for every relational database management system since. By 1996 the two major philosophies of data warehousing were established and doing battle to the death. Bill Inmon’s top-down, subject-oriented, non-volatile and integrated corporate information factory versus Ralph Kimball’s bottom-up, departmentally-oriented, versioned and conforming datamarts.

The chip, memory, disk, bus and software architectures of the early- to mid-‘90s severely restricted both the size and the speed of the data warehouse relative to the amount of data that was available for collection and processing. Furthermore, the implementation of a data warehouse architecture created an absolute need for the movement and manipulation of relatively large amounts of data between physical devices and logical schemas. This was the fertile soil in which a profusion of vendors and proprietary technologies germinated, each trying to define and grow into a niche from which to out-compete both their direct and next-nearest rivals. What had begun as a somewhat academic exercise in the ‘60s and ‘70s was a crowded and growing, multi-billion dollar, world-wide market by the turn of the millennium.

It was also around this time in the early ‘00s that many companies which had been relying on extremely labor-intensive processes such as custom-coded applications, manual data extracts, and analyst-maintained spreadsheets, began to become aware of a better way to manage their data. As they began to look to the consultants and vendors who could help them understand and implement this better way, they encountered and internalized the sprawling ecosystem of acronyms with which we began this editorial. The model for a data warehouse implementation was to engage the services of a systems integration consulting firm in order to recommend the purchase of several distinct, and expensive best-of-breed tools each with it’s own dedicated hardware and then to spend years stitching all of these pieces together while integrating them into the existing corporate business processes and IT infrastructure.

Suddenly data warehouses were big, expensive, inefficient and prone to failure. Somewhere, somehow “state-of-the-art”, tool-centric data warehousing had recreated nearly every one of the business pains which had inspired the original “business data warehouse” architecture.

If much of the ‘90s and early ‘00s were about the proliferation of specialized vendors, technologies, methodologies and proprietary hardware/software, the latter half of the ‘00s have been about the consolidation of vendors through acquisition, the integration of technologies via either metadata “glue” or operating system coupling, the convergence of methodologies and the commoditization of data warehousing hardware and software. Unfortunately, this consolidation has been driven less by a vision of what data warehousing should be than it has been driven by a defensive strategy in an attempt to forestall market disruption. Over the last 5 years, Open Source Software (OSS), especially Free Open Source Software (F/OSS) and Hybrid Open Source Software (H/OSS), has matured from a fringe movement of academics and anti-corporate radicals into the mainstream of enterprise software development. In fact, it is essentially impossible to find an enterprise software suite or platform today which does not contain a significant amount of OSS code. Furthermore, for just about every acronym in the first paragraph of the first post in this series, there is now one or more OSS applications, with anywhere from 40%-80% of the functionality, features, performance and stability of their proprietary progenitors.

One final development completes our quick review of the history of data warehousing, and this is the rise of The Cloud. The Cloud is an overhyped buzzword and it is in many ways simply a repackaging and updating of old mainframe timesharing technologies from the ‘60s and ‘70s and/or and/or client-server technologies from the ‘80s and/or grid computing technologies from the ‘90s and/or virtualization technologies from the ‘00s. But this view misses the point. The Cloud is really three different on demand, scalable, zero-latency services; it is Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). IaaS eliminates the need to install, configure and maintain server and network hardware, while PaaS and SaaS eliminate the need to install, configure and maintain enterprise platform and application software. All three eliminate, and this is the key, the need to purchase and maintain excess capacity as a buffer against both anticipated and unanticipated changes in future demand. This is where the bulk of the cost savings in The Cloud comes from.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Sponsored Recommendations

A Cyber Shield for Healthcare: Exploring HHS's $1.3 Billion Security Initiative

Unlock the Future of Healthcare Cybersecurity with Erik Decker, Co-Chair of the HHS 405(d) workgroup! Don't miss this opportunity to gain invaluable knowledge from a seasoned ...

Enhancing Remote Radiology: How Zero Trust Access Revolutionizes Healthcare Connectivity

This content details how a cloud-enabled zero trust architecture ensures high performance, compliance, and scalability, overcoming the limitations of traditional VPN solutions...

Spotlight on Artificial Intelligence

Unlock the potential of AI in our latest series. Discover how AI is revolutionizing clinical decision support, improving workflow efficiency, and transforming medical documentation...

Beyond the VPN: Zero Trust Access for a Healthcare Hybrid Work Environment

This whitepaper explores how a cloud-enabled zero trust architecture ensures secure, least privileged access to applications, meeting regulatory requirements and enhancing user...