In my first article, I shared the observation that healthcare organizations are behaving like technology early adopters and deploying next-generation analytic architectures. This is counter to the healthcare industry’s customary, more conservative behavior regarding technology adoption. I offered the explanation that the well-known challenges and limitations of the traditional enterprise data warehouse (EDW) approach to analytics was the impetus for healthcare organizations being willing to try something new and different. In my first article, I shared the observation that healthcare organizations are behaving like technology early adopters and deploying next-generation analytic architectures. This is counter to the healthcare industry’s customary, more conservative behavior regarding technology adoption. I offered the explanation that the well-known challenges and limitations of the traditional enterprise data warehouse (EDW) approach to analytics was the impetus for healthcare organizations being willing to try something new and different. I also made the point, however, that the EDW approach is far from obsolete and absolutely has a role to play in a modern-day analytics architecture. This role is as a centralized repository, of trustworthy data, that has been tightly governed and can be used broadly across the enterprise by a variety of data consumers with differing skills. With an EDW, we design a place for everything – and everything must be in its place!This stands in contrast to the data lake approach, which has the benefits of far greater flexibility and agility in getting data in, but at the cost of greater complexity, skills and expertise to derive valid data and insights. The data lake also presents the challenge of effectively cataloging and managing an inventory of the data that is in the lake, along with the associated access permissions, given the diverse sources and types of data that can be supported by the data lake approach.As with any technology solution, there is seldom a 100% right answer to any question, and choosing a EDW approach versus a data lake approach to any particular data management and analytics need is no different. However, we can pick a couple of examples that illustrate a situation where an EDW is clearly the preferred approach, and conversely, an example where a data lake is clearly the preferred approach.In evaluating the preferred approach, we’ll look across six dimensions or characteristics of the data management and analytics use-cases that I have found helpful. These are:⦁ Who is the consumer of the data? Will the data be accessed and used by report writers who largely know only SQL and may be unfamiliar with the nuances and different interpretations of what the data means? Or will the data be accessed by a limited audience of data scientists and “data jockeys” with in-depth statistical knowledge and familiarity with the variety of possible interpretations of what the data may mean?⦁ How broadly will the data be used across the enterprise? Will the data be used broadly across the enterprise in support of many, diverse analytical and reporting scenarios where consistency and reliability are highly important and valued? Or will the data be used by a small group of individuals to conduct analysis or reporting, potentially on an ad hoc basis or answering questions of interest to a limited audience?⦁ How well understood is the data itself? Is the data and its meaning well understood – either through longstanding familiarity with the data domain, meaning established through regulation, or perhaps even business practices such as financial reporting? Alternatively, is the data unknown, subject to multiple potentially conflicting interpretations, or in need of combining/augmenting with other sources of data in order to be useful?⦁ How well understood and defined are the analytic or reporting requirements from the data? What’s the plan for using the data? Are the reports and formats and analytics that will be applied to the data well-defined and understood? Or do we first need to figure out what the data might be useful for and discover how the data might be valuable?⦁ What are the sources of the data? Does the data come from internal business applications or systems that are owned and controlled by the enterprise? Where the data comes from as far as a source system or application is important, since it largely influences our ability to both understand the data and the context in which it was acquired, as well as influence this acquisition to resolve any ambiguities or other data quality issues that may be present in the data. Where we control the upstream source of the data, we have more options to correct data issues at the source and make data “right.” Absent this control, we are left with the larger challenge of getting value from data that is suspect or known to be of poor quality.⦁ Who is the audience for the insights and conclusions drawn from the data? Will the reports and/or analytic insights be used to satisfy regulatory reporting requirements? Perhaps something that needs to be certified or signed off on by the CEO or CFO? Will it be presented to the executive committee or board of directors? Alternatively, will the insights be more informational in nature to a smaller audience, perhaps interested in identifying “order-of-magnitude” effects of difference scenarios where “close is good enough” is the standard of proof?In reading these dimensions and the associated descriptions, my hope is that your reaction is one of “Well shoot! You’ve stacked the deck so badly the conclusions are obvious!” That is exactly my hope: when we decompose the who, what, how and why of the data and who will use it – the appropriate approach becomes self-evident. Guilty as charged for selecting dimensions that drive us to a conclusion to favor one approach over another, since there are literally a countless number of other dimensions that are relevant in any evaluation.So, let’s take these dimensions out for a test drive and see how they apply in some real-world situations as described in Table A, below: UES-CASE: MONTHLY/QUARTERLY/ANNUAL FINANCIAL REPORTING USE-CASE: DISCOVERY ANALYTICS COMBINING HEALTH SYSTEM ENCOUNTER DATA, WITH CLAIMS DATA, WITH GEOSPACIAL DATA TO DETERMINE ANY CORRELATIONS THAT ARE PREDICTIVE OF COMPLIANCE WITH FOLLOW-UP APPOINTMENTS. Who is the consumer of the data? Financial analysts and report writers with SQL skills only. Data scientists and “data jockeys” with skills in design of experiments, advanced analytics and statistical techniques.How broadly will the data be used across the enterprise? Data is used throughout finance and virtually every department across the enterprise to understand and quantify business performance. The data will be accessed by a small, limited group of individuals with knowledge of the data, its limitations, and the therefore the reliability of any insights derived.How well understood are the data themselves? Meaning of the data is unambiguous and frequently determined by accounting standards or other documentation. Data sets are being seen for the first time and meaning must be derived and inferred rather than established through documentation.How well understood are the analytic or reporting requirements from the data? Financial reporting requirements are well-established through regulation, accounting standards and best practices and customs throughout the industry and enterprise. While there are some ideas of what types of insights the data might yield, these ideas are uncertain and must be tested and the validity of the results vetted.What are the sources of data? Data is sourced from enterprise and departmental business applications under the direct ownership or control of the organization. Data is from external sources and must be used “as-is” with little ability to influence up-stream sources for either understanding or data integrity/quality purposes.Who is the audience for the insights and conclusions drawn from the data? Reports and analysis are submitted to regulators, used by the executive team, reviewed by the board of directors, and frequently must be certified and signed-off by the CEO/CXO. A limited audience of interested individuals with knowledge and understanding of the vagaries of the analytics and approach.CONCLUSION Enterprise Data Warehouse Approach is Clearly Preferred Data lake is Clearly Preferred
Table A: Illustrative Use-cases for the EDW and the Data LakeThe FINANCIAL REPORTING use-case clearly favors an EDW approach since nothing about the use-case requires the agility to accommodate new sources of data or answering new questions. This use-case also strongly favors accurate and reliable data that can be used broadly across the enterprise. Moreover, these data sets are absolutely of known value and worthy of the significant up-front design, governance and rigorous data management practices required to get the data right. If I’m a CEO or CFO, I can feel confident that data sourced from a well-designed, well-implemented and well-managed EDW is accurate and reliable.Conversely, the DISCOVERY ANALYTICS use-case clearly favors agility rather than intensive data management and governance discipline – and therefore a data lake approach. We know little about the sources of data, or whether the analytics that result will reveal any useful insights, and therefore we really want to “give it a whirl” and see what comes out. We certainly do not want to put a lot of upfront effort in designing a data model, or strict data quality processes and rules with accompanying data governance, for data that may prove to be of little value and not yield any real business insights. From an investment perspective, the data lake is the right amount of effort to apply – at least until we know the value of the insights may be worth more upfront effort. This is a situation where the value and trustworthiness of the result is strongly dependent on the skill and knowledge of the individual working with the data.Modern-day EDW’s are absolutely succeeding at the FINANCIAL REPORTING use-case, and this is an affirmation of their value within the enterprise. Additionally, there are many other, similar use-cases around customer acquisition, inventory management, customer retention and others that possess similar characteristics that lend themselves to the EDW approach. Where we have traditionally struggled in the past is trying to make the design- and investment-intensive EDW approach work for use-cases of uncertain value and with data that is not well-understood and is poorly behaved. Namely the DATA DISCOVERY use-case. This data discovery use-case has always been the Achilles heel of the EDW, since the upfront investment required meant we just couldn’t take on data of unknown value. Or, if we did put in the effort and it turned out to not be useful, we got a black eye for putting all that effort into something that ultimately did not yield any value. However, the data discovery use-case has not been left undone; rather, it has been accomplished through the heroic efforts of data scientists, or super-financial-analysts – using spreadsheets, scripting and other dark-arts of data manipulation to come up with an answer to an immediate, pressing need. The opportunity with the data lake is to provide an enterprise-class platform to support these sorts of initiatives so that they are part of an overall enterprise data management strategy and architecture, rather than islands of ad hoc analytics with little transparency or ability to inspect and validate the approaches used. At this point, I have hopefully established two clearly differentiated use-cases that lead us down the path to either an EDW approach or a data lake approach. Ultimately, we need both. But of course, the real world is never so clear. Therefore, I will suggest the EDW and data lake should not be implemented or operated in isolation, but instead should operate in combination to meet the diverse needs of the enterprise.What about the situation where I need to quickly support financial reporting across a newly merged enterprise and have new data sources from this merged entity? This is a FINANCIAL REPORTING use-case but where agility is critical. Alternatively, what about a situation where a quick-and-dirty analysis conducted by a data scientist using the data lake (a classic DATA DISCOVERY use-case) yielded valuable insights that I now want to make broadly available and repeatable to data consumers with only SQL skills? Figure 1: Modern Analytics Architecture Incorporating both an EDW and Data LakeThese situations illustrate the need for the EDW and data lake to operate in unison as illustrated in Figure 1. Equally important, a strong reason to not allow the independent islands of “data heroes” persist, as these sorts of isolated efforts do not lend themselves to an easy on-ramp for inclusion into the EDW. For quickly delivering financial reports following a merger, I can use the data lake with a “data jockey” with deep technical skills in collaboration with a financial analyst with deep business knowledge to quickly derive accurate, reliable financial reporting from data in unfamiliar formats from unfamiliar sources. Using the data lake need not imply results that are slipshod or somehow of less accuracy or validity than the EDW; rather, getting these trustworthy results just requires a different set of skills and tools. Similarly, once I have done the data discovery work to figure out the format and meaning of new external data sources, and establish the valuable questions that can be answered, it should be a straightforward and predictable process to “industrialize” this use-case by enhancing the EDW data model and building the corresponding data loading routines and appropriate data governance to promote this use-case to the EDW.In summary, the EDW is alive-and-well and has a critical role to play in a modern analytics architecture. In-fact, pairing and EDW with a data lake allows the EDW to be focused on what it really does well – and to finally shed the “costs-too-much, takes-to-long, isn’t agile” label it has for labored under for far too long. Similarly, many of the indictments of the data lake as being prone to becoming a “data swamp” can be mitigated by pairing the data lake with a well-designed, well-implemented and well-managed EDW –and then using the data lake when and where it’s virtues of agility are merited. Next up – how the do we govern data in this new world where data can exist in the EDW, or the data lake, and be used by different people, with varying skills, and for different purposes?