SPSS Statistics

 View Only

Working with American Community Survey Data in SPSS

By Archive User posted Thu February 20, 2014 05:11 PM

  

Going through the documentation and downloading data from the Census is quite a chore. Here I am going to give some example SPSS functions I have created for working with the plain text 5 year summary files available from the Census's FTP site. I mainly use this for mapping purposes, in particular mapping the small area census geographies. Here I have posted the code used for this analysis.


To start off, last time I checked you can not get block group data from the Census's GUI interface that allows you to point and click certain data downloads, so if you want small geographies you have to grab it from the plain text files. Of course, if you check out the technical document you will see there are hundreds of tables which each have hundreds of variables. So if you navigate to Appendix E (page 45) of the Tech Doc. You will see here that a set of variables in a table, say table B01001 (which contains variables related to Sex by Age) is available at the block group level and is in the summary file sequence number 1.


Slightly confusingly, the sequence number is what signals which plain text file the data is located in, and if you download and uzip the state table you will see a set of text files that look like e20125ny0002000.txt or m20125ny0002000.txt. The e stands for estimates, and the m stands for margin of error. These comma separated files (with no text qualifiers, as they do not have strings with commas) contain a set of 6 consistent variables at the start, and then a variable number of variables at the end of the file. From here on when I refer to a table, I don't mean the B01001 descriptor, I mean either the sequence number and/or the actual text file the data is located in.


Associating the particular variable in a table to its definition is accomplished with the sequence number and table number lookup file. I think I am just going to say look at my code on how to associate those two tables - I'm pretty sure anything I narrate will only confuse matters. Unfortunately the line number field does not correspond to the actual variable order in the text file - you have to take into account that the same text file contains multiple sequences of line numbers that restart at 1.


So again I have all of the materials I will use in the post available to download (linked earlier in the post), but to follow along with your own data you will need;



  • The ACS Technical Doc (to look up what variables you want).

  • The sequence number and table number lookup file (to figure out what the variables represent)

  • An unzipped file of the actual data

  • The SPSS MACRO to grab the ACS data (contained in the ACS_MACRO.sps file) and the VariableLabels.sps file that helps to figure out what the variables are.


Here I placed that and my syntax all into the same folder. So to reference these files I only need to define one file handle. So to start lets define a file handle named data and then insert my two other syntax files. The first grabs the sequence number table lookup (and names the SPSS dataset MetaACS) and does some data manipulations on that lookup table. The second INSERT command defines our macro to grab the actual ACS data. (You can open up the ACS_Examples.sps syntax to follow along - the example tables are for New York State block groups only file.)



FILE HANDLE data /name = "!Your file location Here!".
INSERT FILE = "dataVariableLabels.sps" CD=YES.
INSERT FILE = "dataACS_MACRO.sps".


So now from looking at the technical document I know I want to grab the information from the Sex by Age table. This happens to be sequence number 2. So first I run the command:



!ACSTable Seq = 2.


And this produces a table that looks like below:





In this table the TableTitle is the description of the variable, and the Order column tells you what number the variable is in the subsequent text file. Not all rows will refer to a variable, and so we see here that for the SEX BY AGE table (first row), for the subsequent variables, V1 is the Total population, V2 is the Male population, and V3 is the Male population Under 5 years of age. Most of the variables provided by the ACS have this subsequent nesting structure, and so although thousands of variables exist in all of the tables, they just tend to be various demographic breakdowns into more specific categories.


The variable in the right most column tells us that in this table (besides the 6 that are at the start of every table) there ends up being 235 total variables in the table. So now we can call the syntax to grab the actual data.



!ImportACS File = 'datae20125ny0002000.txt' Table = T2 Cells = 235.


This !ImportACS macro takes as parameters:



  • File - the full file location (in quotes) of the text file that contains the data

  • Table - this token assigns the dataset name and the prefix for all of the variables in the file (excluding the 6 consistent ones). So it needs to follow the conventions for naming those files.

  • Cells - this defines the total number of variables that the table contains after the 6 consistent ones.


So after you run this syntax it will open a table that has the variables as below:





So we can see the variables FileID, Filetype, State, chariter, sequence, and LOGRECNO will always be the first six variables. After those we have a set of 235 variables of the form T2_1, T2_2 .... T2_235.


As I noted from the original !ACSTable macro, we can look up each individual value, and so we know T2_1 is the total population, T2_2 is the male population, and T2_3 is the male population under 5 years of age. So when I grabbed this table I actually wanted the entire population between 5 and 17 years old (not just males or females). So to calculate that variable I need to sum several variables together.



COMPUTE Under17 = SUM(T2_4 to T2_6,T2_28 to T2_30).


I have some further examples in the ACS_Example.sps syntax that grabs data on race, children in female headed households, Spanish speaking households, and households below poverty. I then merge the tables together using the LOGRECNO variable (which is the census geography id).


From this you can grab whatever tables you want and then merge them together. Digging through the documentation tends to be the hardest part, given how large it is. I originally wrote this for the 5 year estimates in 2010 and recently needed to revisit with 2012 data. The format of the data is the same, but the sequence numbers differed from 2010 to 2012. I only provide examples with the estimates data here, but the macro should work just fine with the margin of error data files as well.








#data-manipulation
#MACRO
#SPSS
#SPSSStatistics
2 comments
2 views

Permalink

Comments

Mon February 24, 2014 07:58 AM

Awesome - thank you Steven!,

If you want to work with more data (or migrate the info to a spatial database like PostGIS) those scripts are likely a better workflow solution than my SPSS scripts. They are more for if you want to look at half a dozen variables (and even then they aren't that convenient!)

Mon February 24, 2014 07:36 AM

Andrew, this is very helpful, we'll take a close look.

Just an FYI: a few years ago a CUNY Graduate Center graduate student (Lee Hachadoorian, now at Dartmouth) developed a set of similar scripts, but for PostgreSQL - see http://www.gc.cuny.edu/Page-Elements/Academics-Research-Centers-Initiatives/Centers-and-Institutes/Center-for-Urban-Research/CUR-research-initiatives/ACS-2005-09--Getting-It-All

And the IRE project censusreporter.org based a set of PostgreSQL scripts on Lee's work, see their writeup here http://censusreporter.tumblr.com/post/55886690087/using-census-data-in-postgresql

Hope this is interesting/helpful.