Skip to content

Dates are parsed with read_csv thousand seperator #4678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hayd opened this issue Aug 26, 2013 · 5 comments · Fixed by #4945
Closed

Dates are parsed with read_csv thousand seperator #4678

hayd opened this issue Aug 26, 2013 · 5 comments · Fixed by #4945
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Aug 26, 2013

When reading a csv with a date column, the date is sometimes parsed as a number:

In [1]: s = '06.02.2013;13:00;1.000,215;0,215;0,185;0,205;0,00'

In [2]: pd.read_csv(StringIO(s), sep=';', header=None, parse_dates={'Dates': [0, 1]}, index_col=0, decimal=',', thousands='.')
Out[2]:
                        2      3      4      5  6
Dates
6022013 13:00   1.000,215  0.215  0.185  0.205  0

Here 06.02.2013 is read as a number 0602013 before the date is parsed (which fails)... I think dates are sometimes written this way on the continent (along with . thousands).

This was found in #4322 (but that issue was more about . being ignored), I guess another test case would be with -:

In [3]: s = '06-02-2013;13:00;1.000,215;0,215;0,185;0,205;0,00'

In [4]: pd.read_csv(StringIO(s), sep=';', header=None, parse_dates={'Dates': [0, 1]}, decimal=',', thousands='-')
Out[4]: 
           Dates          2      3      4      5  6
0  6022013 13:00  1.000,215  0.215  0.185  0.205  0

@jreback suggests:

but it should ignore dates columns entirely (for thousands parsing...)

cc #4598 @guyrt

@guyrt
Copy link
Contributor

guyrt commented Aug 26, 2013

I'm not an expert on this IO code just yet, but it would seem that maybe the numeric parser is running first? In that case, we wouldn't even try the datetime converter, would we?

https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L1648

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

things are parsed (with thousands/decimal substituions) then passed to the dtype converter (and na converter), so I think this would have to change based on if parse_dates is True for a particular column; might be tricky (or not)

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@guyrt having a look at this?

@guyrt
Copy link
Contributor

guyrt commented Sep 23, 2013

@jreback I am. Got sidetracked on a few other things, but I'll carve out some time to look at it over the next few days. What I know so far is that the second example works on the python parser. It's not clear yet what is causing it to fail on the c parser but I'll keep digging.

The first example is a problem with the date parser, which doesn't parse the day part correctly.

@guyrt
Copy link
Contributor

guyrt commented Sep 23, 2013

Fix for C parser submitted, but I found an error in Python parser as well. That one will come in next commit.

#4945

guyrt added a commit to guyrt/pandas that referenced this issue Sep 23, 2013
Fixes issue where thousands separator could conflict with date
parsing.

This is only fixed in the C parser.

Closes issue pandas-dev#4678
guyrt added a commit to guyrt/pandas that referenced this issue Sep 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants