Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Spatio-Temporal Point Process Modeling for Predicting Crime Occurrences

By XIAO YAN ZHANG posted Fri December 07, 2018 01:28 AM

  

Introduction to Spatio-Temporal Point Process Modeling

Spatial (location) and temporal (time) information is embedded in nearly all business data. It is important and becomes common that business users use this information and combine it with other business data to gain critical insights for optimal decision making. There are four kinds of STEM components to consider the spatial (location) and temporal (time) information when you’re building the specific models. Spatio-temporal point process (PPM) is the one of these components.

Spatio-temporal point processes (PPM) are models for events that occur at a continuous space-time domain. The events can occur any point in time and at any location within the study area. The likelihood of an event occurring at a specific time and place is described by a measure of intensity, and it is this intensity rather than the individual occurrences that point process models express.

The goal of the spatio-temporal point process model is to give prediction of intensities either at unobserved locations or unobserved time points in the future. The method focuses not only on the extrapolation of trends in time and space, but also develops a novel model. The novel model explicitly incorporates covariate information to enable forecasting based on influential external factors and evaluation of what- if scenarios. We aggregate occurrence data into regular time intervals and well-defined spatial areas. That is, we assume that the observed responses are counts of events in equally spaced time intervals on a time-invariant spatial lattice.

 

Background of Crime Occurrences Prediction

A police department of a city wants to predict crime occurrences for the next few months to better plan and allocate resources. Monthly crime events that occurred in the city area from January 1996 to December 2003 were collected. These data might be described with a point process model and then a crime event in the next few months can be predicted. For the data quality issues, such as missing values in the crime occurrences data, it can be automatically handled during the modeling process. To provide model ready data, data preparation includes the integration of multiple data sources that cover the following information:

  • Geographic information (shape files for centroids and area size for studied regions).
  • Crime event location (point coordinates of the crime events).
  • Demographic information as influential external factors.

 

Solution in IBM Watson Studio

The model is built based on the information of past crime occurrences, local demographic profiles, and spatio-temporal dependence structure that is intrinsic to the data. The demographic information that is used in building the model includes population density, per capita income, ethnic diversity, median age, and male-to-female ratio. The model takes monthly crime occurrences and demographic data of census tracts of Atlanta, GA from January 1996 to December 2003 and then issue prediction for the future months.

PPM is integrated with IBM Watson Studio. In this blog, we demonstrate how to use PPM to analyze and forecast in IBM Watson Studio by Spark 2.1 with Python 2.


Analysis and forecasting steps

Create a notebook in IBM Watson Studio, and Python code can be used to implement this solution by using the following instructions.

Reading Data

To read data:

  1. In IBM Watson Studio, upload the use data before you build the model.
  2. Click Insert to code, and then select Insert SparkSession Dataframe to add data as Spark data frame.

The following code for data import is generated. The variable “df_data” is the data frame that is loaded by Spark from the original data.

import ibmos2spark

# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'api_key': 'ojOBv5G1wJVCDPVIW2YpfYgAIUWTws6YuUXE2cxNm2_u',
    'service_id': 'iam-ServiceId-51a5e59f-7fa7-46f1-8e2d-fc8bb413c4fe',
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token'}

configuration_name = 'os_699d0da9ddcd49efb074facb0b9536bd_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .option('nullValue', 'NA')\
  .load(cos.url('input_to_model_modeling.csv', 'xz-donotdelete-pr-9lww1siqcgbvfr'))
df_data.show(10)

The following information is a preview of a small segment of records from the imported data frame:


The input data requirement of the PPM contains the space dimension, the time dimension, the area size, the predictors (input fields for prediction), and the target field.

  • Space dimension:

“Location.Index” is the index of where each crime event occurs;

Longitude” and “Latitude” are the coordinates of tract centroids of crime occurrences.

Area” means area size of regions.

  • Time dimension:
Time.Index” indicates when the crime event occurred.
  • Predictors:

“X1”, “X2”, “X3” are predictors as influential external factors. They represent demographic information including population density, per capita income, and male-to-female ratio (MFRatio) separately.

  • Target:

“y” represents crime counts, which are regarded as the target.

Model the crime counts with PPM

In PPM modeling, we need to specify the target, input field (predictors), location field (space dimension), area size, and time index (time dimension) from the previously loaded data.

#Build PPM Model
from pyspark.sql.types import *
from spss.ml.common.wrapper import LocalContainerManager
from spss.ml.spatiotemporal.spatiotemporalpointprocessmodeling import SpatioTemporalPointProcessModeling, SpatioTemporalPointProcessModelingModel
from lxml import etree

model_estimator = SpatioTemporalPointProcessModeling().\
setTargetField("y").\
setInputFieldList(["X1","X2","X3"]).\
setLocationFieldList(["Longitude","Latitude"]).\               # Specify a list of location fields
setTimeIndexField("Time.Index").\
setIntercept(True).\
setAreaSizeField("Area").\               # Specify area size of the location Field
setArLag(1)               # Specify the maximum auto regression lag L

ppm_model = model_estimator.fit(df_data)
cons = ppm_model.containerSeq()

ppmxml = cons.entryStringContent("PPMXML.xml").encode('utf-8')
ppmstatxml = cons.entryStringContent("StatXML.xml").encode('utf-8')
print etree.tostring(etree.fromstring(ppmxml), pretty_print=True)
print etree.tostring(etree.fromstring(ppmstatxml), pretty_print=True)

PPM model will output a PMML and StatXML files after finishing the model building. The output files are saved as PPMXML.xml and StatXML.xml.

PMML is used for predictions. StatXML includes all settings and statistics for the building process. Both files contain useful information about the model building analysis.

Such as:

  • Significance of each predictor:

From the following model information, StatXML fragment Identification, the value of 'sig' attribute for x1, x2, and x3 are 0.0, which is less than 0.05, elucidate the three factors of population density, per capita income, and MFRatio are important for crime occurrences counts.

Therefore, when the police department wants to predict crime occurrences of the next few months to make a better plan and allocate resources, they can think about how to control these factors.

<ParameterEstimates parameterSource="regressionCoefficients">
        <ParameterStats paramName="P0000001" paramLabel="Intercept" estimate="6.8766405700804825" stdError="7.172726475412377E-4" sig="0.0" tTest="9587.205916262306" df="4812.0" confIntervalLower="6.8752343878280175" confIntervalUpper="6.878046752332947">
          <RegressionParameter isIntercept="true"/>
        </ParameterStats>
        <ParameterStats paramName="P0000002" paramLabel="X1" estimate="0.18994392782372838" stdError="4.953267412143501E-4" sig="0.0" tTest="383.4719832772588" df="4812.0" confIntervalLower="0.18897286099850596" confIntervalUpper="0.1909149946489508">
          <Statistic name="fixedEffectIndex" value="0"/>
          <RegressionParameter>
            <Covariate field="X1" power="1"/>
          </RegressionParameter>
        </ParameterStats>
        <ParameterStats paramName="P0000003" paramLabel="X2" estimate="0.04496844639052904" stdError="5.263995157887103E-4" sig="0.0" tTest="85.42645850111072" df="4812.0" confIntervalLower="0.04393646272386416" confIntervalUpper="0.046000430057193925">
          <Statistic name="fixedEffectIndex" value="1"/>
          <RegressionParameter>
            <Covariate field="X2" power="1"/>
          </RegressionParameter>
        </ParameterStats>
        <ParameterStats paramName="P0000004" paramLabel="X3" estimate="-0.1452183509858444" stdError="5.123556033621047E-4" sig="0.0" tTest="-283.432737014905" df="4812.0" confIntervalLower="-0.14622280216472078" confIntervalUpper="-0.14421389980696803">
          <Statistic name="fixedEffectIndex" value="2"/>
          <RegressionParameter>
            <Covariate field="X3" power="1"/>
          </RegressionParameter>
        </ParameterStats>
      </ParameterEstimates>
  • R square:
<Statistic name="RSquare" value="0.8820604531336578"/>


To evaluate the model, the R square definition in regression can be borrowed to show the model’s goodness of fit. In regression, the R square coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points.

R square summarizes the proportion of variance in the dependent variable that is associated with the predictor (independent) variables, with larger R square values indicating that more of the variation is explained by the model, to a maximum of 1. Here R square value is 0.8820604531336578. It indicates that the model fitting is good.

Predicting the future with PPM

Based on the built model, crime counts for the next few months can be predicted with the following scripts.

df_score = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .option('nullValue', 'NA')\
  .load(cos.url('input_to_model_score.csv', 'xz-donotdelete-pr-9lww1siqcgbvfr'))
df_score.show(5)

ppm_score = ppm_model.setFutureTimeSteps(20).transform(df_score) 
ppm_score.show(5)

After the prediction, three new fields will be generated. They are the predicted target ($PPM-y), the predicted interval ($PPMLCI-y, and $PPMUCI-y), as shown in the following table.

  • The predicted target($PPM-y) is a prediction of the crime counts within the specified region in a certain time in future.
  • The predicted Predicted Interval ($PPMLCI-y and $PPMUCI-y) shows a range for predicted counts, which indicates the predicted counts has a 95% probability of falling within the confidence interval between $PPMLCI-y and $PPMUCI-y.

 

Visualization for Prediction Results

The result of PPM prediction can be visualized via Python Matplotlib from different angles. The diagram can provide insights of particular patterns to help the police department make a better plan and allocate resources efficiently. 

Visualization at one region over time

The trend diagram is created to reach our goal. It indicates that crime counts occur in the same region that continues to time points in future.

The observed values here are split to model part and score part. You  get a predicted result by PPM, and then the observed counts line and the predicted counts line for fitting are plotted.

#Draw lines
from pyspark.sql.types import *
import matplotlib.pyplot as plt
import pandas as pd

df_data = df_data.withColumnRenamed("Location.Index", "Location_Index")
df_data = df_data.where(df_data.Location_Index == 1)
df_data = df_data.withColumnRenamed("Time.Index", "Time_Index")
df_data_pd = df_data.toPandas()

ppm_score = ppm_score.where(ppm_score.Location_Index == 1)
ppm_score_pd = ppm_score.toPandas()

plt.figure()

x0 = df_data_pd['Time_Index']
y0 = df_data_pd['y']

x1 = ppm_score_pd['Time_Index']
y1 = ppm_score_pd['y']
y1_score = ppm_score_pd['y_score']
y1_lci = ppm_score_pd['y_LCI']
y1_uci = ppm_score_pd['y_UCI']

lines = plt.plot(x0, y0, x1, y1, x1, y1_score, marker='None')
plt.setp(lines[0], color='lightblue')
plt.setp(lines[1], color='lightblue')
plt.setp(lines[2], linewidth=3, color='pink')
plt.legend([lines[0], lines[2]], ['Observed Crime Counts', 'Predicted Crime Counts'],loc='upper right',fontsize=8)

plt.xlabel("Time Index")
plt.ylabel("Counts of Crime")
plt.title("Prediction over Time\n",fontsize = 15, fontweight='bold')

The following diagram shows the trend of predicted counts that is consistent with the actual observed counts change.

PPM can predict the crime counts for the future. It is obvious that the prediction of count trend strongly matches the real observed counts.

Visualization at multiple regions at a certain time

The distribution of the crime counts of different locations at a certain time is in the following script.

import matplotlib.pyplot as plt

ppm_score_1time = ppm_score.where(ppm_score.Time_Index == 81)
ppm_score_1time_pd = ppm_score_1time.toPandas()

latitudes = ppm_score_1time_pd['Longitude']
longitudes = ppm_score_1time_pd['Latitude']
y = ppm_score_1time_pd['y']

#### Draw scatter plots
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.scatter(longitudes,latitudes,c=y,cmap='RdYlBu_r',vmin=0,vmax=1000,s=50,edgecolors='none',marker='o')
fig.colorbar(cax)
ax.set_xlabel('Longitude') 
ax.set_ylabel('Latitude')

plt.title('Distribution of Observed Crime Counts\n', fontsize = 15, fontweight='bold')

plt.show()

The following scatter chart displays the distribution of observed crime counts at different regions for a certain time.

import matplotlib.pyplot as plt

y_score = ppm_score_1time_pd['y_score']

fig = plt.figure()
ax = fig.add_subplot(111)

cax = ax.scatter(longitudes,latitudes,c=y_score,cmap='RdYlBu_r',vmin=0,vmax=1000,s=50,edgecolors='none',marker='o')
fig.colorbar(cax)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')

plt.title('Distribution of Predicted Crime Counts\n', fontsize = 15, fontweight='bold')
plt.show()

The following scatter chart displays the distribution of predicted crime counts at different regions for a certain time.

From the previous two charts, the red points indicate the high crime counts, whereas the blue points indicate the low crime counts. The locations of high crime occurrences can be observed from a combination of the the longitude and the latitude fields.

Making a decision after analysis

As revealed in the PPM modeling result, the three external predictors, the population density, the per capita income, and male-to-female ratio, are the important factors for determining crime occurrences. The police department needs to pay close attention to the changes of these factors and determine their resources based on those factors.

For example, a large-scale sports activity is held in a region and it’s expected that more male fans will come. The population density and male-to-female ratio will be change. The police department can predict the crime occurrences on the certain time by using PPM and adjust resources for the locations and the time windows by using the high crime counts prediction.

Crime occurrences can be affected by many factors.  PPM can build a model and predict the occurrences using all the related factors.

The police department can then allocate more resources and increase the police presence for these locations with high crime counts prediction on the certain time based on the predicted results.

 

Reference

API Documentation for Spark and Python
  • You can get the API Documentation for Spark and Python here.

#GlobalAIandDataScience
#GlobalDataScience
0 comments
33 views

Permalink