I see that with SPSS Statistics v29, the file still exports wrong. I don't know how that code works, though. If I export using the Database Wizard, it does export correctly in either xls or xlsx format. The DBW uses ODBC, so it would use Microsoft's own driver rather than the direct Excel export which is based on what was known about the Excel file format. I suspect that this problem didn't occur with xls format.
--
Original Message:
Sent: 1/8/2024 11:33:00 AM
From: Dominic Ricottone
Subject: RE: Possible Error with SPSS's Excel Writer
Hi Jon,
I expect that, if we use different versions of Excel, we will get slightly different results from a manual file creation. I have some more notes on this below. But I am interested in knowing: Were you able to make SPSS produce an Excel file with un-mutated data?
Admittedly a bit off topic for an SPSS forum, but here's my (limited) understanding of how modern Excel works:
I believe that the "1" you see is an index in the shared strings table. You'll need to look in "xl/sharedStrings.xml" instead of "xl/worksheets/sheet1.xml" for the actual data. Any cell (c tag) in a spreadsheet that has the t="s" attribute actually stores an index to an external shared strings table. This is a space saving strategy, because duplicate cells can reference to the same actual data on the shared strings table.
To be perfectly explicit, if I manually create an Excel file as described above, I get a "xl/worksheets/sheet1.xml" that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac xr xr2 xr3"
xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"
xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2"
xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" xr:uid="{00000000-0001-0000-0000-000000000000}">
<dimension ref="A1:A4"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="A5" sqref="A5"/>
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"/>
<sheetData>
<row r="1" spans="1:1" x14ac:dyDescent="0.25">
<c r="A1" t="s">
<v>0</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
(I ran the raw XML through https://codebeautify.org/xml-pretty-print to make it easier to read.) (Not a sponsorship, other prettify-ers are available.)
The "xl/sharedStrings.xml" file then contains:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
<si>
<t>_x005F_x0001_</t>
</si>
</sst>
If I repeat this process but instead write "leading data leading data_x0001_ trailing data trailing data" into a cell, the only change is that the "xl/sharedStrings.xml" file then contains:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
<si>
<t>leading data leading data_x005F_x0001_ trailing data trailing data</t>
</si>
</sst>
------------------------------
Dominic Ricottone
------------------------------
Original Message:
Sent: Sat January 06, 2024 03:40 PM
From: Jon Peck
Subject: Possible Error with SPSS's Excel Writer
SPSS does not have a specific hex datatype, so I would expect it to export such values literally. Excel does have hex to decimal and decimal to hex conversion functions, though.
Note that 0x0005F is just the ascii code for "_".
I am not familiar with all the nuances of the Excel sheet schema, but if I look at the xml for a value like _x0001 entered directly in Excel, it appears to just have the value 1..
<c r="A3" t="s">
<v>1</v>
</c>
--
Original Message:
Sent: 1/5/2024 3:17:00 PM
From: Dominic Ricottone
Subject: Possible Error with SPSS's Excel Writer
A string value containing "_x0001_", when exported into an Excel file from SPSS, is replaced with the control character "start of header" (a.k.a. SOH).
Actually this happens for any hexadecimal value (e.g. "_x0009_" becomes a tab, "_x0030_" becomes a zero). It's just comparatively more inconvenient with control characters, since they can register as corruption in some downstream consumers of Excel data files.
I investigated how Excel itself wants data like this to be stored. I created an Excel file directly with Excel and included "_x0001_" in a cell. I unzipped the resultant file and found that the value was actually stored as "_x005F_x0001_". In other words, the leading underscore was encoded as a hexadecimal value to prevent "_x0001_" from being read as a hexadecimal value. The file that SPSS wrote contains "_x0001_" exactly.
It seems like SPSS's Excel writer is not properly encoding data before writing. Would this qualify as a defect? Or if this is considered a feature, is this behavior documented anywhere?
Here's some syntax for reproducing the behavior. It creates a range of test data, exports to Excel, then re-importing it into SPSS. You'll also see the issue if you manually open the created Excel file.
* Create the test data.DATA LIST /testvar 1-50 (A).BEGIN DATA_x0000__x0001__x0002__x0003__x0004__x0005__x0006__x0007__x0008__x0009__x000A__x000B__x000C__x000D__x000E__x000F__x0030_also works with leading _x0030_ or trailing dataEND DATA.DATASET NAME DataSet1 WINDOW=FRONT.* Export to Excel and re-import.DATASET ACTIVATE DataSet1 WINDOW=FRONT.SAVE TRANSLATE OUTFILE='exported.xlsx' /TYPE=XLS /VERSION=12 /FIELDNAMES /CELLS=VALUES /REPLACE.GET DATA /TYPE=XLSX /FILE='exported.xlsx' /CELLRANGE=FULL /READNAMES=ON.EXECUTE.DATASET NAME DataSet2 WINDOW=FRONT.* See the Data View and how values were mutated. (Note that most control characters display as an empty box.)* Alternatively, export to a text file for easier inspection of the byte data.DATASET ACTIVATE DataSet2 WINDOW=FRONT.SAVE TRANSLATE OUTFILE='exported.csv' /TYPE=CSV /MAP /REPLACE /FIELDNAMES /CELLS=VALUES.
I've reproduced this behavior in SPSS Statistics 27.0.1.0 and SPSS Statistics 20.0.0. Obviously the latter is out of support, but it seems like this is a long-standing issue with the writer.
And incidentally, why is "SPSS" not an option for the languages in "Insert/edit code sample"?
------------------------------
Dominic Ricottone
------------------------------