SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only
  • 1.  Possible Error with SPSS's Excel Writer

    Posted Fri January 05, 2024 03:26 PM

    A string value containing "_x0001_", when exported into an Excel file from SPSS, is replaced with the control character "start of header" (a.k.a. SOH).

    Actually this happens for any hexadecimal value (e.g. "_x0009_" becomes a tab, "_x0030_" becomes a zero). It's just comparatively more inconvenient with control characters, since they can register as corruption in some downstream consumers of Excel data files.

    I investigated how Excel itself wants data like this to be stored. I created an Excel file directly with Excel and included "_x0001_" in a cell. I unzipped the resultant file and found that the value was actually stored as "_x005F_x0001_". In other words, the leading underscore was encoded as a hexadecimal value to prevent "_x0001_" from being read as a hexadecimal value. The file that SPSS wrote contains "_x0001_" exactly.

    It seems like SPSS's Excel writer is not properly encoding data before writing. Would this qualify as a defect? Or if this is considered a feature, is this behavior documented anywhere?

    Here's some syntax for reproducing the behavior. It creates a range of test data, exports to Excel, then re-importing it into SPSS. You'll also see the issue if you manually open the created Excel file.

    * Create the test data.
    DATA LIST /testvar 1-50 (A).
    BEGIN DATA
    _x0000_
    _x0001_
    _x0002_
    _x0003_
    _x0004_
    _x0005_
    _x0006_
    _x0007_
    _x0008_
    _x0009_
    _x000A_
    _x000B_
    _x000C_
    _x000D_
    _x000E_
    _x000F_
    _x0030_
    also works with leading _x0030_ or trailing data
    END DATA.
    DATASET NAME DataSet1 WINDOW=FRONT.
    
    * Export to Excel and re-import.
    DATASET ACTIVATE DataSet1 WINDOW=FRONT.
    SAVE TRANSLATE OUTFILE='exported.xlsx'
      /TYPE=XLS
      /VERSION=12
      /FIELDNAMES
      /CELLS=VALUES
      /REPLACE.
    GET DATA
      /TYPE=XLSX
      /FILE='exported.xlsx'
      /CELLRANGE=FULL
      /READNAMES=ON.
    EXECUTE.
    DATASET NAME DataSet2 WINDOW=FRONT.
    
    * See the Data View and how values were mutated. (Note that most control characters display as an empty box.)
    
    * Alternatively, export to a text file for easier inspection of the byte data.
    DATASET ACTIVATE DataSet2 WINDOW=FRONT.
    SAVE TRANSLATE OUTFILE='exported.csv'
      /TYPE=CSV
      /MAP
      /REPLACE
      /FIELDNAMES
      /CELLS=VALUES.

    I've reproduced this behavior in SPSS Statistics 27.0.1.0 and SPSS Statistics 20.0.0. Obviously the latter is out of support, but it seems like this is a long-standing issue with the writer.

    And incidentally, why is "SPSS" not an option for the languages in "Insert/edit code sample"?



    ------------------------------
    Dominic Ricottone
    ------------------------------


  • 2.  RE: Possible Error with SPSS's Excel Writer

    Posted Sat January 06, 2024 03:41 PM
    SPSS does not have a specific hex datatype,  so I would expect it to export such values literally.  Excel does have hex to decimal and decimal to hex conversion functions, though.

    Note that 0x0005F  is just the ascii code for "_".

    I am not familiar with all the nuances of the Excel sheet schema, but if I look at the xml for a value like _x0001 entered directly in Excel, it appears to just have the value 1..

    <c r="A3" t="s">
    <v>1</v>
    </c>


    --





  • 3.  RE: Possible Error with SPSS's Excel Writer

    Posted Mon January 08, 2024 11:33 AM

    Hi Jon,

    I expect that, if we use different versions of Excel, we will get slightly different results from a manual file creation. I have some more notes on this below. But I am interested in knowing: Were you able to make SPSS produce an Excel file with un-mutated data?

    Admittedly a bit off topic for an SPSS forum, but here's my (limited) understanding of how modern Excel works:

    I believe that the "1" you see is an index in the shared strings table. You'll need to look in "xl/sharedStrings.xml" instead of "xl/worksheets/sheet1.xml" for the actual data. Any cell (c tag) in a spreadsheet that has the t="s" attribute actually stores an index to an external shared strings table. This is a space saving strategy, because duplicate cells can reference to the same actual data on the shared strings table.

    To be perfectly explicit, if I manually create an Excel file as described above, I get a "xl/worksheets/sheet1.xml" that looks like:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <worksheet
    	xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
    	xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
    	xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac xr xr2 xr3"
    	xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"
    	xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
    	xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2"
    	xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" xr:uid="{00000000-0001-0000-0000-000000000000}">
    	<dimension ref="A1:A4"/>
    	<sheetViews>
    		<sheetView tabSelected="1" workbookViewId="0">
    			<selection activeCell="A5" sqref="A5"/>
    		</sheetView>
    	</sheetViews>
    	<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"/>
    	<sheetData>
    		<row r="1" spans="1:1" x14ac:dyDescent="0.25">
    			<c r="A1" t="s">
    				<v>0</v>
    			</c>
    		</row>
    	</sheetData>
    	<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
    </worksheet>

    (I ran the raw XML through https://codebeautify.org/xml-pretty-print to make it easier to read.) (Not a sponsorship, other prettify-ers are available.)

    The "xl/sharedStrings.xml" file then contains:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <sst
    	xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
    	<si>
    		<t>_x005F_x0001_</t>
    	</si>
    </sst>

    If I repeat this process but instead write "leading data leading data_x0001_ trailing data trailing data" into a cell, the only change is that the "xl/sharedStrings.xml" file then contains:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <sst
    	xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
    	<si>
    		<t>leading data leading data_x005F_x0001_ trailing data trailing data</t>
    	</si>
    </sst>


    ------------------------------
    Dominic Ricottone
    ------------------------------



  • 4.  RE: Possible Error with SPSS's Excel Writer

    Posted Mon January 08, 2024 01:07 PM
    I see that with SPSS Statistics v29, the file still exports wrong. I don't know how that code works, though.  If I export using the Database Wizard, it does export correctly in either xls or xlsx format.  The DBW uses ODBC, so it would use Microsoft's own driver rather than the direct Excel export which is based on what was known about the Excel file format.  I suspect that this problem didn't occur with xls format.

    --





  • 5.  RE: Possible Error with SPSS's Excel Writer

    Posted Mon January 08, 2024 12:13 PM
      |   view attached

    I'm attaching an Excel file here. This is what SPSS 27.0.1.0 wrote on my machine with the syntax quoted above. If you open it directly or read it into SPSS, I expect that you will not see the original string data.



    ------------------------------
    Dominic Ricottone
    ------------------------------

    Attachment(s)

    xlsx
    exported.xlsx   5 KB 1 version