Recently, I've been getting familiar with the content of car crashes data in the Chicago area. This data is available from the
city of Chicago site as part of the open government initiative.
As we know, before we can use the data for modeling, we must get familiar with it, evaluate its quality, figure out if we need derived features, and if we need additional data from other sources. In my case, my scope was much smaller. I wanted to find out how I could take advantage of the information about the crashes, one file out of three. Much more could be done by using the complementary files on vehicles and people.
The crashes file is a comma separated file with 48 columns. The version of the file I worked on has 221,600 rows. A newer file would include more rows since it is updated regularly.
The first step of the exploration is to do basic statistics such as how many non-numb values do we have in each column. This way, we can easily eliminate columns since they don't have enough values. For example, the column CRASH_DATE_EST_I only has 16,521 values, so it is absent 92% of the time.
Other values may be eliminated despite the fact that the column is present in all records. The WEATHER_CONDITION field is set to CLEAR 178,041 times (80%).
One surprising one was the POSTED_SPEED_LIMIT that has 35 different values between 0 and 99. This could potentially still be valid but some values should be consolidated. It is likely that the values 0 and 99 should be considered as null values. These decisions should be made before using the field in modeling.
A bunch of basic stats can be done such as the number of accidents per month of the year, day of the week, hour of the day, street name and so on.
There are additional fields that give us the location exact location of the accident and are present 99.6% of the time: LATITUDE and LONGITUDE.
We can get a feel for the distribution of the accidents by using simple plotting with matplotlib.

It is amazing to see that the accident locations gives us a crude street map of the Chicago area. We can take it one step further by color coding the accident types.

We see the accidents with fatalities in red, the ones with injuries in yellow, and the one with only material damage in green.
The question I had at this point was: Is it possible to find center locations of accidents? Group them in a way that gives us an indication of the frequency of accident in a specific area? This is where k-means comes in.
At first, I thought I could group accidents by reducing the precision of their longitude and latitude. Then it dawned on me that grouping (or clustering) is exactly what k-means is good for. Of course, visualization is key to a better understanding of the results. So, I installed the appropriate libraries in my Watson Studio Python notebook. This meant using the PixieDust library with Mapbox to display the results.
I used the k-means algorithm that is part of the Spark environment. The code was surprisingly simple. I first created a new data frame with the longitude and latitude in the proper format:
data1 = spark.createDataFrame(
spark.sql("""
select LONGITUDE, LATITUDE from collisions
where LATITUDE is not null
and longitude is not null
""").rdd.map(lambda r : Row(Vectors.dense([r.LONGITUDE, r.LATITUDE]))), ["features"] )
Then I created the model and extracted the centers:
kmeans = KMeans(k=10, seed=123)
model = kmeans.fit(data1)
centers = model.clusterCenters()
Then it was a matter of adding the number of accidents attached to each center and I was able to get the following map:

The number of accident per center seemed well balanced. I think that 10 centers for 220,747 accidents is too low to give us all the information we need. Changing the number of centers is a simple change in the attribute of the model.
When we look at the number of fatal accidents, 180, the use of 10 centers may be more revealing and gives us the following map:

Through this exploration, we found that the use of k-means may be useful to find out where accidents are concentrated and could be used as a potential decision point for pricing car insurance policies based on the the k-means centers closest to where people live and work.
If you want to know more about data science, please take a look and follow the
byte-size-data-science youtube channel.
#GlobalAIandDataScience#GlobalDataScience