Skip to content

read_csv: if parse_dates dont appear in use_cols, we get a trace #31251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
teto opened this issue Jan 23, 2020 · 5 comments · Fixed by #32320
Closed

read_csv: if parse_dates dont appear in use_cols, we get a trace #31251

teto opened this issue Jan 23, 2020 · 5 comments · Fixed by #32320
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@teto
Copy link

teto commented Jan 23, 2020

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
import io

content = io.StringIO('''
time,val
212.23, 32
''')

date_cols = ['time']

df = pd.read_csv(
    content,
    sep=',',
    usecols=['val'],
    dtype= { 'val': int },
    parse_dates=date_cols,
)

Problem description

triggers

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    parse_dates=date_cols,
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 463, in _read
    data = parser.read(nrows)
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 1154, in read
    ret = self._engine.read(nrows)
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 2134, in read
    names, data = self._do_date_conversions(names, data)
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 1885, in _do_date_conversions
    keep_date_col=self.keep_date_col,
  File "/nix/store/k4fd48jzsyafvcifa6wi6pk4vaprnw36-python3.7-pandas-0.25.3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 3335, in _process_date_conversion
    data_dict[colspec] = converter(data_dict[colspec])
KeyError: 'time'

i.e., if you use columns in parse_dates that dont appear in use_cols, then you are screwed.

Either parse_dates sould be added to use_cols or the documentation precise it. An assert could make the error more understandable too.
pandas 0.25.3

@murphybrendan
Copy link

I think this might have been fixed. I'm not able to reproduce that error in 1.0.0rc.

@mroeschke
Copy link
Member

Looks fixed on master as well. Could use a regression test since it doesn't appear explicitly fixed in the whatsnew notes for 1.0.0rc

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jan 24, 2020
@ShaharNaveh ShaharNaveh removed their assignment Jan 24, 2020
@ShaharNaveh
Copy link
Member

I am getting the same error as well. (Python 3.8.1)

INSTALLED VERSIONS

commit : aad377d
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.13.a-1-hardened
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+190.gaad377dd7
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

@teto
Copy link
Author

teto commented Jan 24, 2020

Thanks for looking into it.
One thing I don't get in the api is why dates are considered different than converters. Because they are multicolumn ? we could have a multicolumn converter or ignore dates altogether/have multicolumn converters (I mention this just in case 1.0 is allowed to break API).

sathyz added a commit to sathyz/pandas that referenced this issue Feb 1, 2020
@jschendel jschendel added this to the 1.1 milestone Feb 1, 2020
sathyz added a commit to sathyz/pandas that referenced this issue Feb 8, 2020
sathyz added a commit to sathyz/pandas that referenced this issue Feb 8, 2020
sathyz added a commit to sathyz/pandas that referenced this issue Feb 8, 2020
@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Feb 25, 2020
@ishanema1 ishanema1 removed their assignment Mar 5, 2020
@sathyz
Copy link
Contributor

sathyz commented Mar 8, 2020

take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants