Читать книгу The Informed Company - Dave Fowler - Страница 17
Stage 3. Warehouse (Single Source of Truth)
ОглавлениеAs more people begin to work with the data lake, questions begin to multiply: What data is where? Why? What particular criteria should queries use when looking for data insights? What do these schemata mean? Unavoidable complexities make it harder to obtain data, especially by less‐technical colleagues. Even among in‐house experts, more schemata and entities (i.e. tables and views) in turn cause more communication headaches. In time, the data lake serves all data but makes it harder to obtain the right data. It gets harder to write queries and share the knowledge within an organization.
All of these problems can be addressed with a clean and simplified version of the data, something we refer to as “a single source of truth.”
This stage—creating a data warehouse—has historically been quite a nightmare, and there are many books written on how best to model data for analytical processing. But these days, there are more straightforward paradigms that have been tried and tested: ones that not only streamline having to document the oddities found across an organization's schemata but also save time in having to repeat, edit, and maintain messy “boilerplate” query steps (e.g. “every time you query the order's table, make sure to adjust all orders from England to be in local time”).
In the data warehouse section of the book, we review how to clean data lakes and investigate standard practices for managing data complexity. In addition, we offer ways to establish an architecture with data integrity in mind. We provide modeling tool suggestions and an example SQL style guide. Finally, we give our recommendations for team structure, such as a lead to oversee this process and warehouse maintenance.