BUG: read_feather doesn't work when columns are shuffle #33878

Benjamin15 · 2020-04-29T20:54:57Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

# Your code here
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2],
    "B": ["x", "y"],
    "C": [True, False]
})
df.to_feather("./test_data.feather")

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Error message

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-4-1e23cf201732> in <module>
     15 
     16 
---> 17 df2 = pd.read_feather("/misc/labshare/datasets3/rating/data/preprocessing/tests/test_data.feather", columns=['B', 'A'])

~/.conda/envs/venv/lib/python3.6/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
    101     path = stringify_path(path)
    102 
--> 103     return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    206     """
    207     _check_pandas_version()
--> 208     return (read_table(source, columns=columns, memory_map=memory_map)
    209             .to_pandas(use_threads=use_threads))
    210 

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
    237         return reader.read_indices(columns)
    238     elif all(map(lambda t: t == str, column_types)):
--> 239         return reader.read_names(columns)
    240 
    241     column_type_names = [t.__name__ for t in column_types]

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names()

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
B: string
A: int64
vs
A: int64
B: string

Problem description

We don't always know the order in which our columns are.
The issue is when we update pyarrow to 0.17.0

This line work fine:

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Should we apply a fix here or in the pyarrow repository ?

Expected Output

df2 = pd.DataFrame({
"A": [1, 2],
"B": ["x", "y"],
})

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : 0.29.15
pytest : 5.3.2
hypothesis : 5.5.4
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.2.3
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-04-30T06:41:26Z

@Benjamin15 Thanks a lot for the report! This is indeed a regression. I opened an issue for this on the Arrow side (since the bug is in the latest pyarrow 0.17 release): https://issues.apache.org/jira/browse/ARROW-8641

jorisvandenbossche · 2020-06-20T13:46:15Z

Closed by #34883

Benjamin15 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 29, 2020

jorisvandenbossche added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2020

alimcmaster1 mentioned this issue Jun 20, 2020

TST: Feather RoundTrip Column Ordering #34883

Merged

4 tasks

jreback added this to the 1.1 milestone Jun 20, 2020

jorisvandenbossche closed this as completed Jun 20, 2020

asfimport mentioned this issue May 12, 2020

[Python] Regression in feather: no longer supports permutation in column selection apache/arrow#24802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_feather doesn't work when columns are shuffle #33878

BUG: read_feather doesn't work when columns are shuffle #33878

Benjamin15 commented Apr 29, 2020

INSTALLED VERSIONS

jorisvandenbossche commented Apr 30, 2020

jorisvandenbossche commented Jun 20, 2020

BUG: read_feather doesn't work when columns are shuffle #33878

BUG: read_feather doesn't work when columns are shuffle #33878

Comments

Benjamin15 commented Apr 29, 2020

Code Sample, a copy-pastable example

Error message

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Apr 30, 2020

jorisvandenbossche commented Jun 20, 2020

Output of `pd.show_versions()`