I have planned a series of posts on some data manipulation for network data. Here I am going to show how to go from either a list of network relations in long or wide format to a list of (non-redundant) edges in SPSS.
So to start off lets define what I mean by long, wide and edge format. Long format would consist of a table where there is an ID column defining a shared instance, and a sperate column defining the nodes that have relations within that event. Imagine an event is a criminal incident, and a sharing relation can be offender(s) or victim(s).
So a long table might look like this;
Incident# Name Status
--------------
1 Mary O
1 Joe O
1 Steve V
2 Bob O
2 Ted V
2 Jeff V
Here, incident 1
has three nodes, Mary
, Joe
and Steve
. The Status
field represents if the node is either an offender, O
, or a victim, V
. They are all related through the incident number field. Wide format of this data would have only one record for each unique incident number, and the people "nodes" would span across multiple columns. It might then look something like below, where a .
represents missing data.
Incident# Offender1 Offender2 Victim1 Victim2
---------------------------------------------
1 Mary Joe Steve .
2 Bob . Ted Jeff .
I've encountered both of these types of data formats in police RMS databases, and the following solution I propose to make an edge list with go back and forth between the two to produce the final list. So what do I mean by an edge list? Below is an example;
Incident# FromID ToID FromStatus ToStatus
-----------------------------------------
1 Mary Joe O O
1 Mary Steve O V
1 Joe Steve O V
2 Bob Ted O V
2 Bob Jeff O V
2 Jeff Ted V V
Here we define all possible relationship among the two incidents ignoring the FromID
and the ToID
fields order (e.g. Mary Joe
is equivalent to Joe Mary
). Why do we want an edge list like this? In further posts I will show to do some data manipulation to find neighbors of different degrees using data formatted like this, but suffice to say many graph drawing algorithms need data in this format (or return data in this format).
So below I will show an example in SPSS going from the long format to the edge list format. In doing show I will transform the long list to the wide format, so it is trivial to adapt the code to go from the wide format to the edge list (instead of generating the wide table from the long table, you would generate the long table from the wide table).
So to start lets use some data in long format.
data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.
Now I will make a copy of this dataset, and reshape to the wide format. Then I merge the wide dataset into the long dataset.
*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".
*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.
From here we will reshape the dataset to wide again, and this will create a full expansion of possible pairs. This produces much redundancy though. So first, before I reshape wide to long, I get rid of values in the new set of Name?
variables that match the original Name
variable (can't have an edge with oneself). You could technically do this after VARSTOCASES
, but I prefer to make as little extra data as possible. With big datasets this can expand to be very big - a case with n people would expand to be n^2
, by eliminating self-referencing edges it will only expand n(n-1)
. Also I eliminate cases simply based on the sort order between Name
and the wide XName?
variables. By eliminating cases based on the ordering it reduces it to n(n-1)/2
total cases after the VARSTOCASES
command (which by default drops missing data).
*Reshape to long again!
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR Name > XName.
compute Xname = " ".
compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.
So you end up with a list of non-redundant edges with supplemental information on the nodes (note you can change the DO IF
command to just Name > XName
, here I leave it as is to further distinguish between them). To follow are some more posts about manipulating this data further to produce neighbor lists. I'd be interested to see in anyone has better ideas about how to make the edge list. It is easier to make pairwise comparisons in MATRIX programs, but I don't go that route here because my intended uses are datasets too big to fit into memory. My code will certainly be slow though (CASESTOVARS and VARSTOCASES are slow operations on large datasets). Maybe an efficient XSAVE
? (Not sure - let me know in the comments!) The Wikipedia page on SQL joins has an example of using a self join to produce the same type of edge table as well.
Below is the full code from the post without the text in between (for easier copying and pasting).
data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.
*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".
*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.
*Reshape to long again! - and then get rid of duplicates.
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR Name > XName.
compute Xname = " ".
compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.
#data-manipulation#network#SPSS#SPSSStatistics