Cloud Pak for Data

 View Only
Expand all | Collapse all

Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

  • 1.  Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Tue May 10, 2022 03:57 PM
    Edited by System Test Fri January 20, 2023 04:39 PM
    We'd like to answer your questions about the differences between Data Fabric, Data Mesh, Data Lakes, and Data Warehouses. 
    We've arranged for experts from across IBM to answer your questions right here in this forum thread on on May 26 at 2pm Eastern/11am Pacific for a whole hour of AMA (Ask Me Anything).  Our topic is Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need or do you need all of them? If you have questions, please start posting them as a response to this post

    Our experts will hop on the Cloud Pak for Data Community discussion forum on May 26 at 2pm Eastern/11am Pacific and start answering your questions right here in this thread. 

    To learn more, or to get this AMA on your calendar, go to the AMA Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need? event page. This event will take place entirely in the discussion forum, so there is no meeting to join.  If you can't be online during the hour, don't worry; you can post your questions in advance and read the responses later.  


    ------------------------------
    Shannon Rouiller
    Content strategist, Cloud Pak for Data
    ------------------------------
    #CloudPakforDataGroup


  • 2.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Wed May 25, 2022 04:01 PM
    Hi, I'm looking forward to this AMA. My question is:  What's the difference between a data fabric and a data mesh?

    ------------------------------
    Karin Moore
    ------------------------------



  • 3.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:18 PM

     

    A data fabric and data mesh both aim to help control the proliferation of data sources across enterprises.  They both fundamentally use a lot of the same technology (virtualization, ETL, governance tools) but how it gets deployed organizationally in an enterprise differs around ownership. The data mesh is less centralized, with lines of business owning data products and maintaining that themselves, having the "contract".  In a data fabric, while teams own data, it's a more centralized approach.  There are still separate data stores in a fabric, but the management is centralized.  In a mesh, it is not – it's much more peers working together as they see fit.  Both are powerful and are built out of a lot of the same tech, but the organizational structure in each is quite different.

    ------------------------------
    Trent Gray Donald
    Distinguished Engineer
    IBM
    Ottawa ON
    ------------------------------



  • 4.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Wed May 25, 2022 04:15 PM
    Thanks for hosting this AMA!

    When should I be using a data warehouse vs. a data lake? 


    ------------------------------
    Kelley Tai
    ------------------------------



  • 5.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:36 PM
    Data warehouses + data lakes each have distinct characteristics which optimize towards different use cases. The data lake is more flexible in terms of the ability to land raw data in its raw format without any predefine schema, and to allow that data to be manipulated via multiple open source tools interoperating on open formats. It is also designed to be able to efficiently process massive volumes (many petabytes) of data making it ideal for data exploration, data engineering, and data science tasks on less "refined" data. The data warehouse on the other hand is optimized for high performance with many concurrent consumers, and is characterized by having data in a defined schema stored in a highly optimized format and operated on by a highly optimized engine in order to predictably meet stringent SLAs. A warehouse is typically leveraged for high performance BI workloads which have requirements for both high performance and high concurrency that the data lake engines cannot meet.

    ------------------------------
    DAVID KALMUK
    ------------------------------



  • 6.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Wed May 25, 2022 04:23 PM

    Thanks for this AMA, lots of great question here! My question is:  what is a lakehouse and what is it used for? 



    ------------------------------
    [Trish] [Smith] [MBA, BMath, Mom]
    [Content Developer]
    [IBM]
    [Ottawa] [ON]
    [613-356-5435]
    ------------------------------



  • 7.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:20 PM

    A Lakehouse is a new architecture that tries to combine the best aspects of Data Lakes and Data Warehouses. In general, it combinese the low cost storage of Data Lakes, with the management and query features of a data warehouse. We see Lakehouses as a new paradigm that sits somewhere between data lakes and warehouses in the data management spectrum, allowing users to perform both BI and Data Science on your data lake data.


    There are some key technology that are being evolved to help in Data Lakehouses:

    • Open data formats make data more easily accessible
    • High performance query engines that can provide warehouse-grade performance on data lake data
    • Metadata layers that provide ACID transactions


    ------------------------------
    Joshua Kim
    ------------------------------



  • 8.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Wed May 25, 2022 05:17 PM
    Looking forward to reading through this AMA. Here's my question: What are the differences between a Data Lake and a Data Lakehouse?

    ------------------------------
    SHARYN RICHARD
    ------------------------------



  • 9.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:29 PM
    It could be said that data lakehouses evolved from data lakes. Both are used to store Big Data, unstructured and structured. However, data lakehouses address the challenges that data lakes have - namely performance for BI and data reliability. Data Lakehouses are able to support a wider variety of workloads, with higher performing query engines for BI than traditional data lake technologies. Further, data lakehouses are ACID compliant, to make data consistency much easier than data lakes. It is important to note that Lakehouses are still relatively new. Although promising, it will be some time before we can claim that it replaces data lakes. Nonetheless, it is definitely offering solutions to challenges that data lakes have had.

    ------------------------------
    Joshua Kim
    ------------------------------



  • 10.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Wed May 25, 2022 10:02 PM
    Edited by System Test Fri January 20, 2023 04:20 PM
    Hello. Thanks for answering these questions. Here is my question: Is Data Lakehouse replacing data warehousing?

    Thank you,
    Jennifer

    ------------------------------
    Jennifer Smith Gray
    ------------------------------



  • 11.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:41 PM
    I think it's important to differentiate "Lakehouse" as a marketing term vs. Lakehouse as technology. As a marketing vehicle some industry participants are using "Lakehouse" to claim they will enhance their open data lake platform to cover the full range of data warehousing use cases, and at this point that concept is purely aspirational.

    In terms of technology the picture's a little bit more nuanced - a lot of the open lake vendors are working very hard to build data warehousing technology, but in so doing they are needing to introduce more and more proprietary elements into their platforms which are required to match warehouse performance / behavior. On the other side warehouses are evolving to increasingly integrate with the data lake and open formats. Ultimately I think where you end up is a set of distinct capabilities + engines optimized around data interchange and loosely defined schemas which are ideal for data exploration / engineering, and another set of capabilities + engines optimized for performance and heavy workload which requires more tightly defined schemas, and more controlled data - but moving forward they will increasingly integrate better. So you can call that "Lakehouse" and say it supersedes both data lakes + warehouses, but technology-wise you still have different technologies for different use cases.


    ------------------------------
    DAVID KALMUK
    ------------------------------



  • 12.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 10:27 AM
    How does IBM Cloud Pay for Data integrate with Guardian and Optim?

    ------------------------------
    Mike Ferguson
    ------------------------------



  • 13.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:15 PM
    IBM Cloud Pak for Data has with Watson Knowledge Catalog (WKC) a enterprise data governance catalog solution. As part of WKC data stewards, chief information security officers or chief data privacy officers can author policies for data protection, data access, data placement, etc. Guardium is an IBM solution for monitoring database access to ensure only authorized access is happening and otherwise preventing access in real-time. As part of upcoming releases in H2 2022 WKC and Guardium deliver integrated capabilities so that policies for access and data protection via masking can be enforced with Guardium as well. 

    ------------------------------
    Martin Oberhofer
    ------------------------------



  • 14.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 10:27 AM
    How can IBM enforce data governance policies (defined in Watson Knowledge Catalog) across non-IBM databases when none of these vendors support a standard like Egeria or ODRL?

    ------------------------------
    Mike Ferguson
    ------------------------------



  • 15.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:25 PM
    IBM Cloud Pak for Data has a comprehensive set of capabilities to enforce policies on many different persistency types from non-IBM vendors as well as IBM. For example, with Watson Query, we can enforce during SQL execution policies to mask sensitive data if needed before returning it to the application which issued the SQL request. Watson Query supports a broad range of persistencies from many vendors. In addition, Cloud Pak for Data has dozens of connectors for traditional persistencies like Oracle, Db2 and many other databases as well as cloud native and non-SQL persistencies. All these connectors have been build based on a common connectivity framework supporting policy enforcement. This allows that whenever data is accessed - for example on a Watson Knowledge Catalog data asset preview screen - to enforce policies which means if there are sensitives fields the user should not see in plain text, the data values would be shown in a masked fashion. 


    ------------------------------
    Martin Oberhofer
    ------------------------------



  • 16.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 10:27 AM
    How does IBM Watson Knowledge Catalog connect to and discover data in SaaS applications when several SaaS application vendors do not allow access to the underlying data stores holding the data for those applications and there is no metadata API?

    ------------------------------
    Mike Ferguson
    ------------------------------



  • 17.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:32 PM
    We have for some aaS applications connectors out of the box which allow metadata access as well as read-write access to the data. However, if as per your question there is no API to interact with that aaS application for data access and the only path is through a UX - then we currently cant pull the metadata from that aaS application. 

    ------------------------------
    Martin Oberhofer
    ------------------------------



  • 18.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 11:13 AM
    At what level of an organization should a data fabric/lake/house be instantiated? Should you try to deploy it at the highest level possible, one WKC to rule them all, or break it down into smaller domains and sub-domains? Is there a way in WKC to pull in data from multiple WKCs? Like a network of catalogs or something like that? Perhaps asked another way, what are the considerations for how to deploy these new data paradigms into large organizations?

    ------------------------------
    Jim Herrmann
    ------------------------------



  • 19.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:31 PM

    I'm going to narrow the answer down to catalogs / metadata – as the question around data lakes/houses is around physical data storage, vs metadata – and is a very different discussion.

     

    There are a few approaches, but we often see customers approaching the problem by having a single main catalog for the enterprise which is highly curated (along with business terms, etc..), and then other catalogs that can be considered "feeders".  A lot of the power of a catalog comes from global awareness of what's in it, and the policies on it, so that's important to preserve if possible.  However, there are circumstances where private catalogs make sense, especially ones where even the existence of certain metadata could be problematic. Tools (such as Egeria) are emerging that can help bridge between metadata repositories and allow for different shapes as befit the desired patterns.



    ------------------------------
    Trent Gray Donald
    Distinguished Engineer
    IBM
    Ottawa ON
    ------------------------------



  • 20.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:11 PM
    Thank you for this AMA! My question is: what are some important things to consider when embarking on an enterprise data warehouse migration to the cloud using DB2 in an effort to both reduce excess data and simplify data?

    ------------------------------
    Brian Bui
    Cloud Engineer
    IBM
    ------------------------------



  • 21.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:53 PM
    Edited by System Test Fri January 20, 2023 04:21 PM
    I would say that there are two approaches to migrating warehouses to the cloud.


    A simple lift and shift, whereby your warehouse on prem moves as is to the cloud, which is possible as we have Db2 both on prem and in cloud. This is most appropriate for the smaller data marts. There are nuances there about number of MLNs, but honestly not too much to be concerned about.


    The second approach is rearchitecting, which I am actually more in favor of. Many clients have very large EDWs on prem. Often with many database instances within a single EDW that sat in a single appliance. A migration to cloud is often an opportunity to break down large EDWs, which were complex to manage. Breaking these down to something like each data set becomes its own Cloud Data Warehouse can greatly help simplify data access and control. Although more work, but in the long run, it will allow you to take full advantage of a cloud data warehouse, allowing you to granularly scale workloads better etc.


    Db2 is great in terms of its hybridity, on prem and in the cloud. In many cases, you can simply back and restore databases from on-prem to cloud. We also offer tools like Lift and capbility built into the console to help with database migrations to the cloud.


    One thing to watch out for is performance. The performance characteristics from on-prem and cloud are different. However, cloud data warehouses do have the advantage of being able to scale up and scale down. If your query is slow, just scale up your warehouse.


    ------------------------------
    Joshua Kim
    ------------------------------



  • 22.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:31 PM
    Yesterday , we used ETL to build a clean data and well governed data warehouse based on operational data, the sources can be multiple but they are well optimized in datamart and data reconciliation allow us to build a model that can answer a large variety of business questions.
    This DWH is a single source of truth and multiple users cans easily compare their results and take the correct decision
    This DWH since it's already clean and keep all history , it can be used to build predictive results , data scientist can leverage it to build their models.
    ==> the total amount of tool needed : 4 (ETL, DB, Data science tool, data visualization tool (or a Business intelligence suite if we need to address governance))
    ==> this approach, I believe was, and still suitable for the majority of companies     
    Today : we are providing a huge number of tool with different level of complexity, and the marketing behind it seems to focus mostly on AI and data science , multiplying the way to access data , and to process it , so one tool for governance , one tool for lineage, one tool for datascience one tool for this and one tool for that 
    I don't understand now :
    ==> where is the single source of truth ? how do you know from where the info is from and is the result is showing is accurate (or we need a tool on top of the others tools to make tat possible ?)
    ==> is data warehousing is still , the way we was doing it or its outdated ? the is still a huge amount of companies that do not have real time data, or Very large volume of data  ? they will be lost in a data fabric as we are presenting it 
    ==> there is a point in time where DWH and business intelligence have been replaced with data science and data fabric and a total end user independence that is in my opinion a non achievable objective , or at least for the second part to be achieved , the old architecture is still mandatory , I'm I mistaken ?

    ------------------------------
    Mhamed Ben Jmaa
    ------------------------------



  • 23.  RE: Questions for AMA: Data Fabric, Data Mesh, Data Lake, Data Warehouse, which one do you need?

    Posted Thu May 26, 2022 02:58 PM

    We are now living in a much broader ecosystem of data consumers.  Some are highly skilled and deeply technical, and others are much more business users that want to interact at the business term level.  EDWs still fulfill a key role around providing normalized and trusted data to specific users, but without additional capability around self serve consumption, is insufficient.

     

    Using an EDW and/or data lake is still a viable and possible approach for some companies.  However, we've seen customers struggle with velocity, complexity, and the fundamental reality that most businesses have data in many locations (often for good reasons).  This leads to significant delays between "I want data for business reason X" and "I have the data in a form I can use".  The challenge faced is that consumers are multiple: they often can't discover the data without significant expertise (due to challenges mapping from technical naming to the business meaning, additional software to find it in the various data stores, etc..).  The next challenge is around tracking access and ensuring data is only being given to the right folks (ie: following business policies).  Finally it's about ensuring the use can scale.

     

    Data Lake Houses bring an additional option to the table around the age-old EDW vs data lake discussion.  While not a direct replacement for either, they are helping cover more use cases at attractive price points.

     

    On the topic of ML/data science vs BI users - I do not see data science replacing BI.  It's just a new class of users that want data (often in different shapes).  BI users are still the majority of consumers and need a scalable and safe system to use that scales with their skill sets.   The core technology of a DWH is great, but it needs help around things like discoverability, quality, and policy.  That's really what fabrics are aiming to add.



    ------------------------------
    Trent Gray Donald
    Distinguished Engineer
    IBM
    Ottawa ON
    ------------------------------