Well, in your example, that `os` string column is a categorical, which can be handle by SimpleImputer.
Imputers are actually lightweight predictors, so it will also depend if not having a value is significant for the rest of your processing, vs having a 'guessed'/'best shot' value.
For example, in the case of `os`, depending on the usecase you may want to handle it differently.
If you are in a I/T support kind of usecase, you will probably want to retain missing `os` value as a category of its own. because this is likely to be correlated with the diagnostics. An unknow os can be an indication of a totally computer-illiterate user, which usually comes with benign non-issues reported to support. In a corporate I/T context, it can be an indication of a non-instrumented machine (i.e. no agent on it), which is a category which could have issues of its own.
Trying to fill-in on-existent values may actually have an adverse effect on the predictions you'll derive from the dataset.
------------------------------
Philippe Gregoire
IBM France - TSP & ISV Technical Enablement - Data&AI, IoT Europe
NICE, France
------------------------------
Original Message:
Sent: Sun November 15, 2020 12:41 PM
From: Tej Yadav
Subject: How to impute missing string value in a csv?
Hi Community,
We can impute missing values (especially when its integer) with SciKit's Imputer, however can you please suggest,
How can I impute a string value in a csv file?
For example, here is a portion of csv file with missing string values:
managed_by | os | support_group |
John Wick | NaN | IBM-AP-Support |
NaN | NaN | IBM-AP-Support |
Derek Olsen | Linux | IBM-VDI-Support |
Mark Robinson | Windows 2012 R2 Standard | IBM-VDI-Support |
NaN | Windows 2012 R2 Standard | IBM-VDI-Support |
Nick jagger | Windows 2012 R2 Standard | NaN |
------------------------------
Tej Yadav
------------------------------
#GlobalAIandDataScience
#GlobalDataScience