SPSS Statistics

 View Only

Estimating group based trajectory models using SPSS and R

By Archive User posted Tue August 12, 2014 10:34 AM

  
For a project I have been estimating group based trajectory models for counts of crime at micro places. Synonymous with the trajectory models David Weisburd and colleagues estimated for street segments in Seattle. Here I will show how using SPSS and the R package crimCV one can estimate similar group based trajectory models. Here is the crimCV package help and here is a working paper by the authors on the methodology. The package specifically estimates a zero inflated poisson model with the options to make the 0-1 part and/or the count part have quadratic or cubic terms - and of course allows you specify the number of mixture groups to search for.

So first lets make a small set of fake data to work with. I will make 100 observations with 5 time points. The trajectories are three separate groups (with no zero inflation).
*Make Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 100.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME OrigData.
*Make 3 fake trajectory profiles.
VECTOR Count(5,F3.0).
DO REPEAT Count = Count1 TO Count5 /#iter = 1 TO 5.
COMPUTE #i = #iter - 3.
COMPUTE #ii = #i**2.
COMPUTE #iii = #i**3.
DO IF Id <= 30.
COMPUTE #P = 10 + #i*0.3 + #ii*-0.1 + #iii*0.05.
COMPUTE Count = RV.POISSON(#P).
ELSE IF Id <=60.
COMPUTE #P = 5 + #i*-0.8 + #ii*0.3 + #iii*0.05.
COMPUTE Count = RV.POISSON(#P).
ELSE.
COMPUTE #P = 4 + #i*0.8 + #ii*-0.5 + #iii*0.
COMPUTE Count = RV.POISSON(#P).
END IF.
END REPEAT.
FORMATS Id Count1 TO Count5 (F3.0).
EXECUTE.

Note The crimCV package wants the data to be wide format for the models, that is each separate time point in its own column. Now we can call R code using BEGIN PROGRAM R to recreate the wide SPSS dataset in an R data frame.
*Recreate SPSS data in R data frame.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(variables=c("Id","Count1","Count2",
"Count3","Count4","Count5")) #grab data from SPSS
casedataMat <- as.matrix(casedata[,2:6]) #turn into matrix
#summary(casedata) #check contents
#casedataMat[1:10,1:5]
END PROGRAM.

Now to fit one model with 3 groups (without calculating the cross validation statistics) the code would be as simple as:
*Example estimating model with 3 groups and no CV.
BEGIN PROGRAM R.
library(crimCV)
crimCV(casedataMat,3,rcv=FALSE,dpolyp=3,dpolyl=3)
END PROGRAM.

But when we are estimating these group based trajectory models we don't know the number of groups in advance. So typically one progressively fits more groups and then uses model selection criteria to pick the mixture solution that best fits the data. Here is a loop I created to successively estimate models with more groups and stuffs the models results in a list. It also makes a separate data frame that saves the model fit statistics, so you can see which solution fits the best (at least based on these statistics). Here I loop through estimates of 1 through 4 groups (this takes about 2 minutes in this example). Be warned - here are some bad programming practices in R (the for loops are defensible, but growing the lists within the loop is not - they are small though in my application and I am lazy).
*looping through models 1 through 4.
BEGIN PROGRAM R.
results <- c() #initializing a set of empty lists to store the seperate models
measures <- data.frame(cbind(groups=c(),llike=c(),AIC=c(),BIC=c(),CVE=c())) #nicer dataframe to check out model
#model selection diagnostics
max <- 4 #this is the number of grouping solutions to check

#looping through models
for (i in 1:max){
model <- crimCV(casedataMat,i,rcv=TRUE,dpolyp=3,dpolyl=3)
results <- c(results, list(model))
measures <- rbind(measures,data.frame(cbind(groups=i,llike=model$llike,
AIC=model$AIC,BIC=model$BIC,CVE=model$cv)))
#save(measures,results,file=paste0("Traj",as.character(i),".RData")) #save result
}
#table of the model results
measures
END PROGRAM.

In my actual application the groups take a long time to estimate, so I have the commented line saving the resulting list in a file. Also if the model does not converge it breaks the loop. So here we see that the mixture with 3 groups is the best fit according to the CVE error, but the 2 group solution would be chosen by AIC or BIC criteria. Just for this tutorial I will stick with the 3 group solution. We can plot the predicted trajectories right within R by selecting the nested list.
*plot best fitting model.
BEGIN PROGRAM R.
plot(results[[3]])
#getAnywhere(plot.dmZIPt) #this is the underlying code
END PROGRAM.

Now the particular object that stores the probabilities is within the gwt attribute, so we can transform this to a data frame, merge in the unique identifier, and then use the STATS GET R command to grab the resulting R data frame back into an SPSS dataset.
*Grab probabiltiies back SPSS dataset.
BEGIN PROGRAM R.
myModel <- results[[3]] #grab model
myDF <- data.frame(myModel$gwt) #probabilites into dataframe
myDF$Id <- casedata$Id #add in Id
END PROGRAM.
*back into SPSS dataset.
STATS GET R FILE=* /GET DATAFRAME=myDF DATASET=TrajProb.

Then we can merge this R data frame into SPSS. After that, we can classify the observations into groups based on the maximum posterior probability of belonging to a particular group.
*Merge into original dataset, and the assign groups.
DATASET ACTIVATE OrigData.
MATCH FILES FILE = *
/FILE = 'TrajProb'
/BY ID
/DROP row.names.
DATASET CLOSE TrajProb.
*Assign group based on highest prob.
*If tied last group wins.
VECTOR X = X1 TO X3.
COMPUTE #MaxProb = MAX(X1 TO X3).
LOOP #i = 1 TO 3.
IF X(#i) = #MaxProb Group = #i.
END LOOP.
FORMATS Group (F1.0).

Part of the motivation for doing this is not only to replicate some of the work of Weisburd and company, but that it has utility for identifying long term hot spots. Part of what Weisburd (and similarly I am) finding is that crime at small places is pretty stable over long periods of time. So we don't need to make up to date hotspots to allocate police resources, but are probably better off looking at crime over much longer periods to identify places for targeted strategies. Trajectory models are a great tool to help identify those long term high crime places, same as geographic clustering is a great tool to help identify crime hot spots.








#CrimeAnalysis
#data-manipulation
#R
#SPSS
#SPSSStatistics
#time-series
7 comments
7 views

Permalink

Comments

Mon September 10, 2018 12:48 PM

No you can't use it for logit unfortunately. I know there are different R packages for mixture modelling, but not sure what the best alternative would be for this (maybe the lcmm package using ordinal). I know proc traj (in either SAS or Stata) has a logit model as well.

Tue September 04, 2018 12:57 PM

I know this is an old discussion but I will still ask since i am stuck, can i use crimCV for a logit model instead of ZIP?

Mon December 28, 2015 01:59 PM

The error message means that the extension command STATS GET R is not installed. Use the Utilities message to install it. You may also need to quote the * in the file specification.

Fri December 11, 2015 04:58 PM

Have you installed the extension? You would also have needed to install the R essentials plug in as well.

Fri December 11, 2015 03:54 AM

about "STATS GET R FILE=* /GET DATAFRAME=myDF DATASET=TrajProb."

when execute in spss syntax, it shows behind:

Error: unexpected symbol in "GET R"

>Error # 1. Command name: STATS
>The first word in the line is not recognized as an SPSS Statistics command.
>Execution of this command stops.

what wrong? how can i fix it?

Wed August 13, 2014 01:09 PM

Yes the STATS GET R command is much simpler than doing all of the variable specifications to create the dictionary yourself (thank you!). I remember in one case in some of my older code I simply exported to csv and then read the csv in SPSS when I got frustrated troubleshooting code to set the dictionary!

Wed August 13, 2014 12:59 PM

One thing that caught my eye in this is the use of STATS GET R to create the new data generated by the R code as a Statistics dataset. Since there are already apis in the R plug-in to do that, it never occurred to me when I wrote GET R that using it would sometimes be an easier way to do this. The apis are a little difficult to figure out, and using GET R really simplifies it.