Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Cancer Diagnosis Using Linear Support Vector Machine (LSVM)

By RUI WANG posted Fri November 23, 2018 03:48 AM

  

Tony, a medical researcher, obtained a data set that contains characteristics of a number of human cell samples that are extracted from patients who were believed to be at risk of developing cancer. He analyzed the original data and found that many of the characteristics differed significantly between benign and malignant samples.

He wants to develop a model by using the values of these cell characteristics in samples from other patients to predict whether their samples might be benign or malignant.

As a binary classification problem, there are many algorithms that can be used to do the analysis, such as Random Forests, C5, Neural Net and so on. Tony uses IBM SPSS LSVM algorithm in this use case.

LSVM is one of the most popular algorithms for performing binary or multi-class classification and regression on large, wide, and sparse data sets.

IBM SPSS LSVM Available in

Product Integration with UI

Spark and Python API

In this use case, Tony uses the default settings of LSVM Spark API to produce a basic model relatively quickly in a notebook in IBM Watson Studio that uses Spark 2.1 and Scala 2.11.

Data Description

Tony gets the data(breast-cancer-wisconsin.data) and description(breast-cancer-wisconsin.names) from the UCI Machine Learning repository. The data set consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields description and the range of their values are:

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (2 for benign, 4 for malignant)

Upload and Access Data in IBM Watson Studio Notebook

Tony creates the project and notebook in Watson Studio and selects the Spark 2.1 and Scala 2.11 as the notebook's kernel.

Then, he follows the guide to upload and access code. upload_access_pic

The button Insert to code  supports CSV and JSON files only. For convenience, Tony renames data file from breast-cancer-wisconsin.data to breast-cancer-wisconsin.csv before uploading and changes the header from true to false in source code.

Tony is familiar with SPSS algorithms. So he uses the enrich function of SPSS to discover more metadata information for each field from original DataFrame.

import com.ibm.spss.ml.utils.DataFrameImplicits._
val df_enriched = dfData1.enrich
println("----- Metadata information on original dataframe -----")
dfData1.schema.foreach { x => println(x.metadata.toString()) }
println("----- Metadata information on enriched dataframe -----")
df_enriched.schema.foreach { x => println(x.metadata.toString()) }
----- Metadata information on original dataframe -----
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
----- Metadata information on enriched dataframe -----
{"ml_attr":{"name":"_c0","role":"input","label":"_c0","mean":1071704.0987124462,"min":61634.0,"std":617095.7298192448,"max":1.3454352E7,"storage":"integer","validCount":699}}
{"ml_attr":{"name":"_c1","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c1","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c2","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c2","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c3","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c3","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c4","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c4","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c5","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c5","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c6","role":"input","vals":["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],"label":"_c6","storage":"real","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c7","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c7","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c8","role":"input","vals":["1","2","3","4","5","6","7","8","9","10"],"label":"_c8","storage":"integer","validCount":699,"type":"nominal","ord":true}}
{"ml_attr":{"name":"_c9","role":"input","vals":["1","2","3","4","5","6","7","8","10"],"label":"_c9","storage":"integer","validCount":699,"type":"nominal","ord":false}}
{"ml_attr":{"name":"_c10","role":"input","vals":["2","4"],"label":"_c10","storage":"integer","validCount":699,"type":"nominal","ord":false}}
Next, Tony splits the data set into two parts by using the parameter. 80% of the data is used for training model and 20% for testing model to avoid overfitting.
val Array(training, test) = df_enriched.randomSplit(Array(0.8, 0.2), seed = 123456)

Build Model

Tony gets the following relationship between the data(breast-cancer-wisconsin.data) and description(breast-cancer-wisconsin.names) of the cell characteristics.

  1. _c0: Sample code number: id number
  2. _c1: Clump Thickness: 1 - 10
  3. _c2: Uniformity of Cell Size: 1 - 10
  4. _c3: Uniformity of Cell Shape: 1 - 10
  5. _c4: Marginal Adhesion: 1 - 10
  6. _c5: Single Epithelial Cell Size: 1 - 10
  7. _c6: Bare Nuclei: 1 - 10
  8. _c7: Bland Chromatin: 1 - 10
  9. _c8: Normal Nucleoli: 1 - 10
  10. _c9: Mitoses: 1 - 10
  11. _c10: Class: (2 for benign, 4 for malignant)

Because the purpose is to predict whether patients’ samples might be benign or malignant. Tony selects "_c10" field as the target and fields from "_c2" to "_c7" as the predictors, using the default settings to build the model.

import com.ibm.spss.ml.classificationandregression.LinearSupportVectorMachine
val lsvm = LinearSupportVectorMachine().
      setTargetField("_c10").
      setInputFieldList(Array("_c2", "_c3", "_c4", "_c5","_c6", "_c7"))
      
val lsvmModel = lsvm.fit(training)

Evaluate the Model

Tony knows that the Model Viewer offers interactive tables and charts to help evaluate and improve a predictive analytics model in a notebook. So he uses the Model Viewer API in Watson Studio to evaluate the effectiveness of the LSVM model. He does this necessary step "Inserting Project Token" before starting the ModelViewer.

From the Model Viewer output, Tony sees the model accuracy, residuals, and other model-related information. output

Predictor Importance

Tony wants to know which predictors impact the target the most, so he uses the predictor importance API to do it.

import com.ibm.spss.ml.utils.PredictorImportance

val pmml=lsvmModel.toPMML()
val pi = PredictorImportance(pmml)
val piModel = pi.fit(training)
val piPMML =  piModel.toPMML()

val printer = new scala.xml.PrettyPrinter(800, 2)

println(printer.format(scala.xml.XML.loadString(piPMML)))
<GeneralRegressionModel modelType="multinomialLogistic" targetVariableName="_c10" algorithmName="linearSVM" functionName="classification" targetReferenceCategory="2">
        <Extension extender="spss.com" name="modelID" value="0"/>
        <Extension extender="spss.com" name="allowMissingFactors" value="true"/>
        <MiningSchema>
            <MiningField name="_c2" importance="0.13404081589509154"/>
            <MiningField name="_c3" importance="0.18105193857009028"/>
            <MiningField name="_c4" importance="0.056402197229650196"/>
            <MiningField name="_c5" importance="0.1301081997118011"/>
            <MiningField name="_c6" importance="0.34451844956217215"/>
            <MiningField name="_c7" importance="0.15387839903119477"/>
            <MiningField name="_c10" usageType="predicted"/>
        </MiningSchema>


Then, from the PMML of predictor importance, he finds "_c6" (Bare Nuclei) is the biggest impact to target "_c10"(Class benign or malignant).

<MiningField name="_c6" importance="0.34451844956217215"/>

Prediction

Tony uses the test data to get predictions.

val predictions = lsvmModel.transform(test) predictions.show(5)
The approximation probability for each prediction is also produced.
+------+---+---+---+---+---+----+---+---+---+----+----------+------------------+--------------------+------------------+--------------------+
|   _c0|_c1|_c2|_c3|_c4|_c5| _c6|_c7|_c8|_c9|_c10|prediction|          $LC-_c10|               $LP-2|             $LP-4|       rawPrediction|
+------+---+---+---+---+---+----+---+---+---+----+----------+------------------+--------------------+------------------+--------------------+
| 63375|  9|  1|  2|  6|  4|10.0|  7|  7|  2|   4|         4|0.7042655455523911| 0.29573445444760893|0.7042655455523911|[0.29573445444760...|
|142932|  7|  6| 10|  5|  3|10.0|  9| 10|  2|   4|         4|0.7653807235774707| 0.23461927642252933|0.7653807235774707|[0.23461927642252...|
|160296|  5|  8|  8| 10|  5|10.0|  8| 10|  3|   4|         4|0.9960750430935084|0.003924956906491644|0.9960750430935084|[0.00392495690649...|
|255644| 10|  5|  8| 10|  3|10.0|  5|  1|  3|   4|         4|0.9941445068558745|0.005855493144125...|0.9941445068558745|[0.00585549314412...|
|314428|  7|  9|  4| 10| 10| 3.0|  5|  3|  3|   4|         4|0.9703319265260232| 0.02966807347397682|0.9703319265260232|[0.02966807347397...|
+------+---+---+---+---+---+----+---+---+---+----+----------+------------------+--------------------+------------------+--------------------+
only showing top 5 rows

How about the result of predictions? Tony uses a simple method that returns the accuracy by comparing the test label column with the test prediction column. In this case, the evaluation returns 93.13% precision.

val predictionAndLabels = predictions.map{r => (r.getLong(10),r.getLong(11))}.rdd.map{r => if (r._1 == r._2) 0.0 else 1.0}
val accuracy = (predictionAndLabels.count - predictionAndLabels.sum)/predictionAndLabels.count
println("%2.2f%%".format(accuracy*100))
93.13%

Deployment

Tony gets an acceptable model that has 99.6% precision on training and 93.13% on testing. Next, Tony wants to deploy it in IBM Watson Studio Cloud as APIs to do predictions on new data instances and figure out how to represent the built model and deliver it for the deployment step.

Fortunately, Tony finds that the LSVM model supports the predictive model markup language( PMML) export that uses the following one simple line code. PMML is the leading standard for statistical and data mining models. With PMML you can share models with other applications that support this format, such as IBM Watson Studio Cloud.

val PMML=leModel.toPMML()
Finally, Tony successfully deploys the model to the IBM Watson Studio Cloud following this blog Deploying Machine Learning Models in IBM Watson Studio Cloud as APIs.

Tony put the details of this blog into the notebook LSVM-Classifying-Cell-Samples


#GlobalAIandDataScience
#GlobalDataScience
0 comments
52 views

Permalink