Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Data tables: Long or Wide?

By DIANE REYNOLDS posted Thu July 26, 2018 10:16 AM

  

Data modeling isn’t always considered a core data science skill (it should be!), but it came in handy during this project. It allowed us to separate the project into a development application component and a data science engine component – one to collect and disperse data and the other to do different kinds of analysis on it. 

It may be “old fashioned” but so many machine learning models rely on having tabular data that it just makes sense for me to plan for the data to end up that way. Most analytics require a single table of input where each row (or record) contains a separate observation or experiment and each column represents an observed element or attribute. Most tools in data science today use the dataframe construct to embody a table ready for analysis. However, some constraints are imposed for the statistics to work efficiently. It can get very large and very messy! Fortunately, there is a large body of work on efficient data modeling largely written in the period 1970-2000 by data management experts. For a data scientist’s take on it, check out this paper by Hadley Wickham.

Essentially, this paper and its relatives in the world of data modeling are discussing the concept of normalization. Normalization is commonly used in relational database systems (e.g., Oracle, DB2, Sybase) to create small, coherent tables and then link them together with automatically-generated identifiers. A version of it, third normal form, is a popular data warehousing approach while another version, dimensional models, are popular in business analytics tools (e.g., Cognos, Tableau, Business Objects). There are many benefits to this approach including efficient of space usage, clarity and maintainability. However, it is not overly compatible with statistical analyses.

The project I’ve been working on, called IBM Customer Insights, involves allowing a client to run predictive analytics to support client engagement activities (e.g., marketing, support centres, coupons/offers). Each client has a small team using the system in the cloud. Data owners work with their IT department to provide relatively clean and consistent data to the engines running the analyses. Those engines can run either automatically with pre-built models, or a data scientist can extend the engine with additional models. So, we need a model that is easy for the data owners to move data into and easy for data scientists to use.

First things first, we looked at options for analytics and chose Spark as the foundation. It has file-based, big-data style parquet files that optimize storage of large tables, and the ability to interoperate with various data science and development tools. Our principle has been to store large tables of highly denormalized data, each with a theme or domain. Different tables are used only for different domains or very different levels of granularity of data. This approach led to a significant reduction in the number of tables (compared to a previous dimensional data mart model), and more importantly, an exponential decrease in the complexity of the data model.

Some examples of domains include: customer, account, policy, advisor/agent, and asset. Much of the difficult-to-maintain metadata has been integrated into the core tables directly, maximizing flexibility. For example, instead of having tables for elements like geographic region, status, education level or marital status, the actual word or words representing each category are written into the corresponding field in the customer table. Spark has abilities to cope with such categorical fields during the analysis, and so a lot of time and energy is saved. So far, so good. We have a small number of tables (20), each with a large number of nullable attributes (circa 100). 

It all makes sense, right? I certainly think so, when I consider the application. However, this pattern of data is at odds with decades of best practice. We do lose a fair amount of certainty and efficiency in this denormalization. For example, instead of storing a short identifier of another record, and looking up the larger-to-store label, we are storing the label itself repeatedly, taking up space. Fortunately, space is very cheap these days, so the extra space requirements are no longer the concern they were in the past. The certainty is still an issue, however. Certainty comes from having a machine-generated ID upon which to rely for two important things: table joins and metadata identification. Rather than this specific information, we are now prey to things like “New York” and “NYC” appearing as two separate labels, or to “Springfield” being indistinguishable across states. This issue is mitigated in our system by the fact that, generally, we are receiving data that has been carefully curated elsewhere. In general, though, big data faces this challenge on a regular basis and many models and data cleansing tools are available to help cope with the most egregious issues.


table1
It also doesn’t resolve all the original issues. I’ll illustrate the issue through an example. Consider the list of products held by each customer. For example, suppose we are a firm that offers a variety of products: a chequing account, a savings account, a credit card, a car loan, a mortgage and/or a brokerage account. No customer is likely to have all of them, but each customer could have one or more products. For analysis, we might need data that looks like Table 1, whereas a data owner would typically have data that looks like Table 2. We’ve created a standard where the data owner provides “long form data” as per Table 2 and we automatically create the necessary “wide form data” shown in Table 1 prior to the analytics engine using the data for modeling. This approach allows us to take in conveniently-formatted data from its source and to use conveniently-formatted data for modeling purposes, housing standard transformations in high-efficiency, professionally written, low-level Scala code rather than more packaged transformations typically done through data-science-friendly interfaces. By preparing this data in a separate stage, we can also allow multiple models – built-in and user-created – to leverage even larger tables (200-500 attributes) effectively reducing the data cleansing and management required of the models.

table2

This manner of data collection and storage is very consistent with Big Data principles which favour denormalization over SQL-style joins across large numbers of interconnected tables. The extra layer of data, pre-transformed in high-efficiency Scala code enables data scientists to begin work with tabulated, formatted data, reducing development and implementation time for new and enhanced models. It also enables a clear line to be drawn between the data provider, the development team and the data science team – all have a clear interface and set of expectations to work towards. Together with a couple of other key principles and assumptions, this standard has facilitated the transformation of a traditional, installed solution into a big-data-compatible, REST API service in the cloud.


#GlobalAIandDataScience
#GlobalDataScience
1 comment
21 views

Permalink

Comments

Thu September 13, 2018 08:54 AM

This is certainly a problem for many of us! Due to best practices, I always start my analyses with a cup of coffee and a data transpose. :)

Tall, narrow datasets are easier to store, maintain, back-up, expand, and filter, BUT you can't do any actual analysis with them. I like the idea of having a standard transpose as an automatic part of accessing the dataset. Takes one step out of preparing the data before you can actually do anything with it.