Cloud Pak for Data

 View Only
Expand all | Collapse all

How to work with Parquet local files?

  • 1.  How to work with Parquet local files?

    Posted Fri December 11, 2020 06:07 PM
    Hi!

    Sorry, I'm a CPD newbie and am just getting started, but how exactly do I work with Parquet files?  Here's why I ask.  I started a new Project, then added one new data asset from a local file, which is a Parquet file.  When I click "Preview" for this file, I get the error "An error occurred attempting to preview this asset.  The data for this data asset can't be retrieved. The data for this asset can't be retrieved. Ensure your connection is working on this asset."  Please see the attached screenshots.  I generated the file from a Pandas DataFrame in Python using PyArrow.  Note that I tried both Parquet versions 1 and 2, with the same result.  The DataFrame has 5 columns and 152900 rows.  Three columns are in64, 2 are string, 4 columns have names, and one column is unnamed.  I searched the forums for related questions but didn't find any.  Sorry if this has been asked/answered already.  Anyway, thanks!


    P.S.  Here's how I wrote out the file.
    >>> tb = pq.read_table('data/hcv_sample.train.2.parquet')
    >>> tb.schema
    Unnamed: 0: int64
    q_id: int64
    q_content: string
    HCV: int64
    q_body: string
    metadata
    --------
    {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
                b'stop": 152900, "step": 1}], "column_indexes": [{"name": null, "f'
                b'ield_name": null, "pandas_type": "unicode", "numpy_type": "objec'
                b't", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "U'
                b'nnamed: 0", "field_name": "Unnamed: 0", "pandas_type": "int64", '
                b'"numpy_type": "int64", "metadata": null}, {"name": "q_id", "fiel'
                b'd_name": "q_id", "pandas_type": "int64", "numpy_type": "int64", '
                b'"metadata": null}, {"name": "q_content", "field_name": "q_conten'
                b't", "pandas_type": "unicode", "numpy_type": "object", "metadata"'
                b': null}, {"name": "HCV", "field_name": "HCV", "pandas_type": "in'
                b't64", "numpy_type": "int64", "metadata": null}, {"name": "q_body'
                b'", "field_name": "q_body", "pandas_type": "unicode", "numpy_type'
                b'": "object", "metadata": null}], "creator": {"library": "pyarrow'
                b'", "version": "0.15.1"}, "pandas_version": "0.25.3"}'}
    >>> tb.num_rows
    152900
    >>> tb.num_columns
    5
    ​


    ------------------------------
    David Ventimiglia
    ------------------------------

    #CloudPakforDataGroup


  • 2.  RE: How to work with Parquet local files?

    Posted Mon April 19, 2021 09:58 AM
    The simplest solution would be to read the "Unnamed: 0" column as the index. So, what you have to do is to specify an index_col=[0] argument to read_csv() function, then it reads in the first column as the index.

    pd.read_csv('file.csv', index_col=[0])

    In some other cases, it is caused by your to_csv() having been saved along with an "Unnamed: 0" index. You could have avoided this mistakes in the first place by using "index=False" if the output CSV was created in DataFrame.

    df.to_csv('file.csv', index=False)


    ------------------------------
    quincy batten
    ------------------------------