Cloud Pak for Data

 View Only
  • 1.  How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Mon September 12, 2022 04:55 PM
    Hello All,

    • I want to remove fuzzy duplicates using ibm cloud Pak
    • Can someone please me in this?


    ------------------------------
    RAJNI HARYANI
    ------------------------------

    #CloudPakforDataGroup


  • 2.  RE: How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Tue September 13, 2022 02:17 AM
    hi Rajni, what do you mean by "remove fuzzy duplicates"? Can you elaborate further as to what you want to do?! Thanks

    ------------------------------
    JOHN MATTHEWS
    ------------------------------



  • 3.  RE: How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Tue September 13, 2022 02:34 AM
    Hello All,

    In IBM Cloud pak, I want to remove fuzzy duplicates(If your column data is almost similar not identical).
    Can you please explain to me how to achieve this?

    Input data :
    ---------------
    In the input data, for records 1 and 2 , email ID is same but the names are almost similar, I want to remove this type of duplicacy from my data.
    Similarly, records 4 and 5 have the same email id but almost similar names; it looks like the same person data repeated twice.

    ID NAME EMAIL ID
    1 RAJNI HARYANI abcd@gmail.com
    2 RAJNI H abcd@gmail.com
    3 JOHN MATHEW xyz@gmail.com
    4 DEB CARRY aaaa@gmail.com
    5 DEB C  aaaa@gmail.com
    OUTPUT DATA below-
    ID NAME EMAIL ID
    1 RAJNI HARYANI abcd@gmail.com
    3 JOHN MATHEW xyz@gmail.com
    4 DEB CARRY aaaa@gmail.com


    Thanks & Regards
    Rajni haryani





  • 4.  RE: How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Tue September 13, 2022 03:12 AM
    Edited by System Fri January 20, 2023 04:35 PM
    hi Rajni

    You can achieve this in Watson Knowledge Catalog (cloud pak for data) by using the data refinery capability. See the "remove duplicates" section in this link: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=data-gui-operations

    This will work for the simple example you gave; however, your example does not have enough data to truly determine whether the records are the same person or not. You will need more data to accurately match the person records together and then you will need to determine survivorship rules to determine which data values you keep when you match/merge one or more records together. Another service in cloud pak for data enables you to do this, match 360. See this link for more information on match 360: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=services-match-360-watson . You can match the input data together to form entities & then export the entities as a file.

    Lastly, the DataStage service (ETL/ELT/Data pipelines) also has a duplicates removal stage https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=stages-remove-duplicates
    as well as some matching stages, e.g. the One-source Match stage: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=stages-one-source-match

    ------------------------------
    JOHN MATTHEWS
    ------------------------------



  • 5.  RE: How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Wed September 14, 2022 01:42 AM
    Hello John,

    Thank you so much for the information.

    I have 2 more doubts-

    DOUBT 1 :
     If I have records with the same Cust code( like record 1 and 2) then I want to remove records from the table where the ADDRESS field matches 90% 
    How to archive this in IBM Cloud pak?

    INPUT
    ID NAME EMAIL ID cust code Address
    1 MOHIT HARYANI abcd@gmail.com 1 100 main Road, Indore
    2 MOHIT H abcdefgh@gmail.com 1 100 main Rd
    3 JOHN MATHEW xyz@gmail.com 2 200 central region, Bhopal
    4 DEB CARRY aaaa@gmail.com 3 300 south Junction , Hyderabad
    5 DEB C  aaaabbbccc@gmail.com 3 300 south Junct , Hyderabad
    OUTPUT
    ID NAME EMAIL ID cust code Address
    1 MOHIT HARYANI abcd@gmail.com 1 100 main Road, Indore
    3 JOHN MATHEW xyz@gmail.com 2 200 central region, Bhopal
    4 DEB CARRY aaaa@gmail.com 3 300 south Junction , Hyderabad


    DOUBT 2 :
    In IBM Cloud pak, when we are doing METADATA IMPORT than along with metadata, DATA ASSET is also coming either in project or catalog.
    This metadata & data asset is stored in IBM cloud?
     
    Thanks in advance.

    Regards
    Rajni








  • 6.  RE: How to remove fuzzy duplicate in IBM Cloud Pak

    Posted Wed September 14, 2022 10:15 AM
    Edited by System Fri January 20, 2023 04:15 PM
    DOUBT 1:

    This is exactly what match 360 does. It uses a matching engine which is based on probabilistic matching concepts. You decide which record attributes from your data source(s) you want the matching engine to use, e.g. it might be dob, first name, last name, email address, address. The more attributes you select, and the better the underlying data quality; then the better the overall match results will be. The match process uses advanced comparison features to determine how closely an attribute from one record matches with another (they don't need to be exact matches), and it creates a score. It determines whether records match by using all the attributes you specified for matching, and creates an overall score. If that score is greater then the match threshold (which you also specify), then the records are matched (merged) to create an entity. You can choose which attributes, from which records, survive into the entity; to create an overall composite entity view. All of this is made easy within an intuitive UI and embedded AI/ML to help you and to suggest things for you.

    Look at these two tutorial links from our match 360 SaaS offering on IBM Cloud. Each link contains a video which you can watch to gain a better understanding of how all this works:
    - Customer 360 tutorial: Configure a 360-degree view: https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/df_360_configure.html?adoper=178484_1_GS1
    - Customer 360 tutorial: Explore your customers: https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/df_360_explore.html?adoper=178484_1_GS1
    You can watch all of the video content of course; but in video 1, the specific discussion on the matching algorithm starts at minute 7.30. In video 2, the specific discussion on the match threshold starts at minute 1.50.
    You can of course work through the tutorials yourself as well, using free IBM Cloud SaaS services. All the information you need is in the links provided.

    DOUBT 2:

    Cloud Pak for Data generally stores metadata in secure internal databases (relational or graph) depending on the service being used. Watson Knowledge Catalog uses both / can use both.

    ------------------------------
    JOHN MATTHEWS
    ------------------------------