Global AI and Data Science

 View Only
Expand all | Collapse all

Big Churn: Subsetting and Other Tricks to Support Rapid ML Progress: Chat with The Lab Webinar Series

  • 1.  Big Churn: Subsetting and Other Tricks to Support Rapid ML Progress: Chat with The Lab Webinar Series

    Posted Sun June 07, 2020 03:57 PM
    Edited by System Fri January 20, 2023 04:19 PM

    Most of the work we do as data scientists is highly experimental and iterative. Analyzing, cleansing, and preparing data, engineering features, and selecting model algorithms all involve a lot of discovery through trial and error. While we generally like to work with as much data as we can get our hands on, sometimes it's too much to trawl through repeatedly. In addition, the data may be at the wrong level of detail and require aggregation. In such cases, a little basic data management and manipulation can go a long way toward making the fun parts of problem solving and machine learning model building a great deal more productive.

     

    This talk describes how we managed a customer-churn use case in which we were presented with daily usage and transaction data for millions of customers of a telecommunications provider. It initially took hours just to read in one table, which meant we could make almost no progress. Instead, creating a subset selector and applying aggregation at the same time as performing the selection allowed us to extract manageable data sets on which we could iterate rapidly. In addition, this approach lets us address the class imbalance-common in these kinds of classification problems-in a natural way, without resorting to techniques such as synthetic-data generation. By appropriately using randomization seeds, we can both create repeatable experiments and test whether subsets are large and unbiased enough to represent the entire data set. Finally, the subset selection takes into account the time-dependent nature of churn use cases, where objects (customers) enter and drop out of the data set at different times and a minimum residence time must be enforced for a pattern to emerge.

     

    We will share a link to a GitHub repository with generic Python code, based on Spark for handling big data, which demonstrates this approach. It comes complete with a small set of fake data, ready for you to try and customize to your data and use case.

     Big Churn: Subsetting and Other Tricks to Support Rapid ML Progress: Chat with The Lab Webinar Series. Share any of your questions below and watch the on demand recording here



    ------------------------------
    JORGE CASTANON
    Chat with labs webinar series: https://ibm.co/Chat-With-The-Lab-Webinar
    ------------------------------
    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: Big Churn: Subsetting and Other Tricks to Support Rapid ML Progress: Chat with The Lab Webinar Series

    Posted Mon June 29, 2020 07:23 PM
    Edited by System Fri January 20, 2023 04:25 PM
    Hello everyone, 
    You can watch the full webinar recording and demo here. The slides can be downloaded here and please reply with any of your questions below. 



    ------------------------------
    Robert Uleman
    Data Science Engineer
    IBM Data Science and AI Elite
    CA
    ------------------------------