Global Data Science Forum

Expand all | Collapse all

How to impute missing string value in a csv?

  • 1.  How to impute missing string value in a csv?

    Posted Mon November 16, 2020 01:49 PM
    Hi Community,

    We can impute missing values (especially when its integer) with SciKit's Imputer, however can you please suggest,
    How can I impute a string value in a csv file?

    For example, here is a portion of csv file with missing string values:

    managed_by os support_group
    John Wick NaN IBM-AP-Support
    NaN NaN IBM-AP-Support
    Derek Olsen Linux IBM-VDI-Support
    Mark Robinson Windows 2012 R2 Standard IBM-VDI-Support
    NaN Windows 2012 R2 Standard IBM-VDI-Support
    Nick jagger Windows 2012 R2 Standard NaN


    ------------------------------
    Tej Yadav
    ------------------------------


  • 2.  RE: How to impute missing string value in a csv?

    Posted Tue November 17, 2020 01:44 AM
    If the dataframe is big enough the most elegant solution is just dropping NaN rows. When treated numerically you can statistically infer some of the values without changing the weight of the distribution. The problem with categorical variables is that you can't infer the name of anyone or the operative system she was using.
    Something you can do in the case you want to keep those NaN rows is to fill the missing categorical variable with something meaningful to you that can appear in the final calculations you can use

    pandas.fillna('value')

    That way you can understand the role of the missing data.


    ------------------------------
    Diego Cardalliaguet
    Europe GEO Technical Sales
    IBM
    ------------------------------



  • 3.  RE: How to impute missing string value in a csv?

    Posted Tue November 17, 2020 02:51 AM
    Thanks Diego, that's good idea.

    I have found one more way to fill such categorical data with the help of pandas mode method. However, it's not accurate in case of names.

    It's useful in case of Series and Dataframes to fill missing values for each column (using its own most frequent value)

    df = df.fillna(df.mode().iloc[0])

    ------------------------------
    Tej Yadav
    ------------------------------



  • 4.  RE: How to impute missing string value in a csv?

    Posted Tue November 17, 2020 03:58 AM

    Well, in your example, that `os` string column is a categorical, which can be handle by SimpleImputer.

    Imputers are actually lightweight predictors, so it will also depend if not having a value is significant for the rest of your processing, vs having a 'guessed'/'best shot' value.

    For example, in the case of `os`, depending on the usecase you may want to handle it differently.

    If you are in a I/T support kind of usecase, you will probably want to retain missing `os` value as a category of its own. because this is likely to be correlated with the diagnostics. An unknow os can be an indication of a totally computer-illiterate user, which usually comes with benign non-issues reported to support. In a corporate I/T context, it can be an indication of a non-instrumented machine (i.e. no agent on it), which is a category which could have issues of its own.

    Trying to fill-in on-existent values may actually have an adverse effect on the predictions you'll derive from the dataset.



    ------------------------------
    Philippe Gregoire
    IBM France - TSP & ISV Technical Enablement - Data&AI, IoT Europe
    NICE, France
    ------------------------------