-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Conflict between thousands sep and date parser. #4945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Conflict between thousands sep and date parser. #4945
Conversation
Fixes issue where thousands separator could conflict with date parsing. This is only fixed in the C parser. Closes issue pandas-dev#4678
@guyrt go ahead and put in this same PR.....(both c and python) |
@jreback I've pushed a fix for Python parser, but I'm not completely happy with it. The issue is that the Python parser has special handling for the first line when headers=None (https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1414). In this section, we read the first line without having access to the column names in every case. Therefore, if a column is identified by name as in a date (so we skip thousands parsing) then we are unable to identify which column to skip. This is a pretty rare case requiring:
I don't see an easy way to fix this problem without rethinking the first line processing completely. |
hmm....but in that case you would HAVE to have |
You would have to now. Previously, any use of self.names was deferred until after the first line was read so names could be inferred set. Now, we have to be prepared to read that line and selectively remove thousands separators, so we have to use names before the first line is read. |
makes sense...whole ordering bizness is somewhat complicated....mk if you need help |
Turns out there was an easy fix. The first line is pushed into a buffer, but it isn't fully processed until later. I deferred checking for thousands until later. |
gr8 |
@jreback tests pass |
@@ -1500,7 +1541,7 @@ def _next_line(self): | |||
line = next(self.data) | |||
|
|||
line = self._check_comments([line])[0] | |||
line = self._check_thousands([line])[0] | |||
#line = self._check_thousands([line])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just delete this line
do both test cases from the issue pass? or is there something else going on? |
does this also fix #4382 ? |
So such luck on #4382 |
can you add the #4322 reference to the release notes and a test for that as well? |
#4322 is already in the release notes and was tested in a previous PR. What happened is that I fixed 4322 (thousands operator ignored) which uncovered this bug about a conflict between dates and the thousands operator. Someone reopened 4322 but also created a new ticket, so we can close them both. |
@guyrt thanks! |
I'll take a look. Working on #3866 right now |
closes #4678
closes #4322
I've fixed the C parser portion. The issue there was that it did not handle the case where parse_dates is a dict.
Python parser fix yet to come. That test still fails.
Example:
s = '06-02-2013;13:00;1-000,215;0,215;0,185;0,205;0,00'
d = pd.read_csv(StringIO(s), header=None, parse_dates={'Date': [0, 1]}, thousands='-', sep=';', decimal=',')
Then d should have a column 'Date' with value "2013-06-02 13:00:00"