SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only
Expand all | Collapse all

Address Matching

  • 1.  Address Matching

    Posted Mon March 14, 2022 12:28 PM
    Morning,
    I have two files that I need to match on address.  One file has address in one field with varying degrees of format.  The second file has both parsed out and single fields with a pretty standard format.  I'm trying to figure out what would be an efficient way to merge the two.  Thanks Arthur.

    ------------------------------
    Art Jack
    ------------------------------

    #SPSSStatistics


  • 2.  RE: Address Matching

    Posted Mon March 14, 2022 12:37 PM
    You will need to make a variable or set of variables that match in the two files (and then use MATCH FILES).  The details on how to do that depend on the formats of the address items.  If you can supply samples from the two files, we may be able to help.

    --





  • 3.  RE: Address Matching

    Posted Mon March 14, 2022 12:38 PM
    Hi. So in one file you have one variable, ADDRESS, that has something like: "123 Main Street Chicago IL 60014" and in the other you have maybe 4 fields, for example, "123 Main Street" "Chicago" , "IL", "60014"? 

    If so, then my advice would be in the 2nd file, use CONCAT to create a new string variable named ADDRESS with a consistent format (the A width of the string in the first file), then MATCH the files on that key. If case is a problem, then in both files, create a new key variable based on ADDRESS that is in all lower or upper case.

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 4.  RE: Address Matching

    Posted Mon March 14, 2022 01:14 PM
    Below is sort of example.  In one it's standard with parsed fields & concat.  The second file is one field but formatting is not standard.

    file_1_address file_1A file_1b file_1c file_1d file_1e file_2_address
    123 Main Dr, 265, 37244 123 Main Dr 265 37244 The Apartments at Chicago Land, 123 Main DR, APT 265


    ------------------------------
    Art Jack
    ------------------------------



  • 5.  RE: Address Matching

    Posted Mon March 14, 2022 01:39 PM
    Interesting. Clearly a 1-to-1 match is out, case issues aside.

    Looks like multiple keys are needed. You need to create separate string variables in file2 that are the same as those in file1, and like file1 they have to be the same "things": street_number, street, city, state, zip, etc.

    I guess I'd try breaking up the file 2 addresses into those variables using (CHAR) SUBSTR and INDEX. I'm not sure what to break them up by; not spaces, maybe commas. I can't know without looking at a lot of the addresses. Obviously, there may be parts of the file 2 address that you won't be able to use; that "The Apartments at Chicago Land" part of file2 address will have no match in file1, at least for this case.




    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 6.  RE: Address Matching

    Posted Mon March 14, 2022 01:45 PM
    It's really going to depend on how much variation there is in the formats.  I would be inclined to use Python pattern match along with the SPSSINC TRANS extension command, but the pattern variations need to be defined.  It might not be possible to get 100%, but a high percentage is conceivable.

    --





  • 7.  RE: Address Matching

    Posted Mon March 14, 2022 01:54 PM
    Jon - I guess I don't know enough about pattern matching in Python, but wouldn't the files have to be already joined to do that? What are you pattern-matching in each of the two separate files before you MATCH them? I was thinking of using Python to create a "checksum" variable on the address for each case, but I still don't know what the resulting MATCH on that checksum var would look like. Maybe I'll experiment when I find the time. ;)

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 8.  RE: Address Matching

    Posted Mon March 14, 2022 02:03 PM
    What I would do would be to "canonicalize" the addresses in both files.  For example, you might discard everything before the first numerical sequence and take those numbers as the street number and the following items up to a comma as the street name (and lower case), etc.

    --





  • 9.  RE: Address Matching

    Posted Mon March 14, 2022 02:07 PM
    OK, I see.

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 10.  RE: Address Matching

    Posted Mon March 14, 2022 02:33 PM
    Is there an example of that somewhere?  Thanks.

    ------------------------------
    Art Jack
    ------------------------------



  • 11.  RE: Address Matching

    Posted Mon March 14, 2022 02:43 PM
    It would need some Python tweaking.  If you can send me samples of the two files, I'll see if I can come up with something (jkpeck@gmail.com).

    --





  • 12.  RE: Address Matching

    Posted Tue March 15, 2022 11:27 AM
    Hello,

    I'm saler on the IBM SPSS

    ------------------------------



    Jeannette MAGNE
    Business Intelligence & Predictive Analytics
    IBM Hybrid Cloud
    Tel. +33 (0)6 31 05 61 04
    Cognos : https://youtu.be/NvKjf6JM-ZU
    Contact Support : https://www.ibm.com/analytics/us/en/spss/spss-support/index.htmlor https://www.ibm.com/mysupport/s/

    ------------------------------