Skip to content

BUG: read_feather doesn't work when columns are shuffle #33878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
Benjamin15 opened this issue Apr 29, 2020 · 2 comments
Closed
3 tasks done

BUG: read_feather doesn't work when columns are shuffle #33878

Benjamin15 opened this issue Apr 29, 2020 · 2 comments
Labels
Bug IO Parquet parquet, feather
Milestone

Comments

@Benjamin15
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2],
    "B": ["x", "y"],
    "C": [True, False]
})
df.to_feather("./test_data.feather")

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Error message

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-4-1e23cf201732> in <module>
     15 
     16 
---> 17 df2 = pd.read_feather("/misc/labshare/datasets3/rating/data/preprocessing/tests/test_data.feather", columns=['B', 'A'])

~/.conda/envs/venv/lib/python3.6/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
    101     path = stringify_path(path)
    102 
--> 103     return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    206     """
    207     _check_pandas_version()
--> 208     return (read_table(source, columns=columns, memory_map=memory_map)
    209             .to_pandas(use_threads=use_threads))
    210 

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
    237         return reader.read_indices(columns)
    238     elif all(map(lambda t: t == str, column_types)):
--> 239         return reader.read_names(columns)
    240 
    241     column_type_names = [t.__name__ for t in column_types]

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names()

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
B: string
A: int64
vs
A: int64
B: string

Problem description

We don't always know the order in which our columns are.
The issue is when we update pyarrow to 0.17.0

This line work fine:

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Should we apply a fix here or in the pyarrow repository ?

Expected Output

df2 = pd.DataFrame({
"A": [1, 2],
"B": ["x", "y"],
})

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : 0.29.15
pytest : 5.3.2
hypothesis : 5.5.4
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.2.3
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

@Benjamin15 Benjamin15 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 29, 2020
@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2020
@jorisvandenbossche
Copy link
Member

@Benjamin15 Thanks a lot for the report! This is indeed a regression. I opened an issue for this on the Arrow side (since the bug is in the latest pyarrow 0.17 release): https://issues.apache.org/jira/browse/ARROW-8641

@jreback jreback added this to the 1.1 milestone Jun 20, 2020
@jorisvandenbossche
Copy link
Member

Closed by #34883

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

3 participants