Skip to content

BUG: Conflict between thousands sep and date parser. #4945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 26, 2013

Conversation

guyrt
Copy link
Contributor

@guyrt guyrt commented Sep 23, 2013

closes #4678
closes #4322

I've fixed the C parser portion. The issue there was that it did not handle the case where parse_dates is a dict.

Python parser fix yet to come. That test still fails.

Example:
s = '06-02-2013;13:00;1-000,215;0,215;0,185;0,205;0,00'
d = pd.read_csv(StringIO(s), header=None, parse_dates={'Date': [0, 1]}, thousands='-', sep=';', decimal=',')

Then d should have a column 'Date' with value "2013-06-02 13:00:00"

Fixes issue where thousands separator could conflict with date
parsing.

This is only fixed in the C parser.

Closes issue pandas-dev#4678
@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

@guyrt go ahead and put in this same PR.....(both c and python)

@guyrt
Copy link
Contributor Author

guyrt commented Sep 23, 2013

@jreback I've pushed a fix for Python parser, but I'm not completely happy with it.

The issue is that the Python parser has special handling for the first line when headers=None (https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1414). In this section, we read the first line without having access to the column names in every case. Therefore, if a column is identified by name as in a date (so we skip thousands parsing) then we are unable to identify which column to skip. This is a pretty rare case requiring:

  • no header
  • parse_dates = a dict that refers to columns by name
  • those columns use the thousands separator.

I don't see an easy way to fix this problem without rethinking the first line processing completely.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

hmm....but in that case you would HAVE to have names specified (otherwise its an error to put the name in parse_dates)...pretty sure that is available in the passed kwds

@guyrt
Copy link
Contributor Author

guyrt commented Sep 23, 2013

You would have to now. Previously, any use of self.names was deferred until after the first line was read so names could be inferred set. Now, we have to be prepared to read that line and selectively remove thousands separators, so we have to use names before the first line is read.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

makes sense...whole ordering bizness is somewhat complicated....mk if you need help

@guyrt
Copy link
Contributor Author

guyrt commented Sep 23, 2013

Turns out there was an easy fix. The first line is pushed into a buffer, but it isn't fully processed until later. I deferred checking for thousands until later.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

gr8
ping when u think ready

@guyrt
Copy link
Contributor Author

guyrt commented Sep 23, 2013

@jreback tests pass

@@ -1500,7 +1541,7 @@ def _next_line(self):
line = next(self.data)

line = self._check_comments([line])[0]
line = self._check_thousands([line])[0]
#line = self._check_thousands([line])[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just delete this line

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

do both test cases from the issue pass? or is there something else going on?

@guyrt
Copy link
Contributor Author

guyrt commented Sep 24, 2013

There were two open issues:
#4322 (thousands separator)
#4678 (date/thousands conflict)

Both are fixed.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

does this also fix #4382 ?

@guyrt
Copy link
Contributor Author

guyrt commented Sep 24, 2013

So such luck on #4382

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

can you add the #4322 reference to the release notes and a test for that as well?

@guyrt
Copy link
Contributor Author

guyrt commented Sep 24, 2013

#4322 is already in the release notes and was tested in a previous PR.

What happened is that I fixed 4322 (thousands operator ignored) which uncovered this bug about a conflict between dates and the thousands operator. Someone reopened 4322 but also created a new ticket, so we can close them both.

@jreback jreback merged commit fedb26d into pandas-dev:master Sep 26, 2013
@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

@guyrt thanks!

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

@guyrt interested in #4335, #4201?

@guyrt
Copy link
Contributor Author

guyrt commented Sep 26, 2013

I'll take a look. Working on #3866 right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dates are parsed with read_csv thousand seperator read csv thousands separator
2 participants