Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to discussions

Expand all | Collapse all

How to impute missing string value in a csv?

1. How to impute missing string value in a csv?

Tej Yadav

Posted Mon November 16, 2020 01:49 PM

Hi Community,

We can impute missing values (especially when its integer) with SciKit's Imputer, however can you please suggest,
How can I impute a string value in a csv file?

For example, here is a portion of csv file with missing string values:

managed_by	os	support_group
John Wick	NaN	IBM-AP-Support
NaN	NaN	IBM-AP-Support
Derek Olsen	Linux	IBM-VDI-Support
Mark Robinson	Windows 2012 R2 Standard	IBM-VDI-Support
NaN	Windows 2012 R2 Standard	IBM-VDI-Support
Nick jagger	Windows 2012 R2 Standard	NaN

------------------------------
Tej Yadav
------------------------------

#GlobalAIandDataScience
#GlobalDataScience

2. RE: How to impute missing string value in a csv?

Diego Cardalliaguet

Posted Tue November 17, 2020 01:44 AM

If the dataframe is big enough the most elegant solution is just dropping NaN rows. When treated numerically you can statistically infer some of the values without changing the weight of the distribution. The problem with categorical variables is that you can't infer the name of anyone or the operative system she was using.
Something you can do in the case you want to keep those NaN rows is to fill the missing categorical variable with something meaningful to you that can appear in the final calculations you can use

pandas.fillna('value')

That way you can understand the role of the missing data.

------------------------------
Diego Cardalliaguet
Europe GEO Technical Sales
IBM
------------------------------

Original Message

3. RE: How to impute missing string value in a csv?

Tej Yadav

Posted Tue November 17, 2020 02:51 AM

Thanks Diego, that's good idea.

I have found one more way to fill such categorical data with the help of pandas mode method. However, it's not accurate in case of names.

It's useful in case of Series and Dataframes to fill missing values for each column (using its own most frequent value)

df = df.fillna(df.mode().iloc[0])

------------------------------
Tej Yadav
------------------------------

Original Message

Original Message:
Sent: Tue November 17, 2020 01:44 AM
From: Diego Cardalliaguet
Subject: How to impute missing string value in a csv?

If the dataframe is big enough the most elegant solution is just dropping NaN rows. When treated numerically you can statistically infer some of the values without changing the weight of the distribution. The problem with categorical variables is that you can't infer the name of anyone or the operative system she was using.
Something you can do in the case you want to keep those NaN rows is to fill the missing categorical variable with something meaningful to you that can appear in the final calculations you can use

pandas.fillna('value')

That way you can understand the role of the missing data.

------------------------------
Diego Cardalliaguet
Europe GEO Technical Sales
IBM

Original Message:
Sent: Sun November 15, 2020 12:41 PM
From: Tej Yadav
Subject: How to impute missing string value in a csv?

Hi Community,

We can impute missing values (especially when its integer) with SciKit's Imputer, however can you please suggest,
How can I impute a string value in a csv file?

For example, here is a portion of csv file with missing string values:

managed_by	os	support_group
John Wick	NaN	IBM-AP-Support
NaN	NaN	IBM-AP-Support
Derek Olsen	Linux	IBM-VDI-Support
Mark Robinson	Windows 2012 R2 Standard	IBM-VDI-Support
NaN	Windows 2012 R2 Standard	IBM-VDI-Support
Nick jagger	Windows 2012 R2 Standard	NaN

------------------------------
Tej Yadav
------------------------------

#GlobalAIandDataScience
#GlobalDataScience

4. RE: How to impute missing string value in a csv?

Philippe Gregoire

Posted Tue November 17, 2020 03:58 AM

Well, in your example, that `os` string column is a categorical, which can be handle by SimpleImputer.

Imputers are actually lightweight predictors, so it will also depend if not having a value is significant for the rest of your processing, vs having a 'guessed'/'best shot' value.

For example, in the case of `os`, depending on the usecase you may want to handle it differently.

If you are in a I/T support kind of usecase, you will probably want to retain missing `os` value as a category of its own. because this is likely to be correlated with the diagnostics. An unknow os can be an indication of a totally computer-illiterate user, which usually comes with benign non-issues reported to support. In a corporate I/T context, it can be an indication of a non-instrumented machine (i.e. no agent on it), which is a category which could have issues of its own.

Trying to fill-in on-existent values may actually have an adverse effect on the predictions you'll derive from the dataset.

------------------------------
Philippe Gregoire
IBM France - TSP & ISV Technical Enablement - Data&AI, IoT Europe
NICE, France
------------------------------

Original Message

Global AI and Data Science

Global AI & Data Science

How to impute missing string value in a csv?

Tej YadavMon November 16, 2020 01:49 PM

Diego CardalliaguetTue November 17, 2020 01:44 AM

Tej YadavTue November 17, 2020 02:51 AM

Philippe GregoireTue November 17, 2020 03:58 AM

1. How to impute missing string value in a csv?

2. RE: How to impute missing string value in a csv?

3. RE: How to impute missing string value in a csv?

4. RE: How to impute missing string value in a csv?

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

How to impute missing string value in a csv?

Tej YadavMon November 16, 2020 01:49 PM

Diego CardalliaguetTue November 17, 2020 01:44 AM

Tej YadavTue November 17, 2020 02:51 AM

Philippe GregoireTue November 17, 2020 03:58 AM

1. How to impute missing string value in a csv?

2. RE: How to impute missing string value in a csv?

3. RE: How to impute missing string value in a csv?

4. RE: How to impute missing string value in a csv?

Related Content

Statistics in Data Science : Missing Data

Welcome to Global AI and Data Science

Data Science Community Newsletter | August

Data Science: Use the right tools

Into Data Science: Data Science Lifecycle Data Requirements

Additional Resources

Office

Quick Links

Additional
Resources