BUG: Conflict between thousands sep and date parser. #4945

guyrt · 2013-09-23T03:43:26Z

closes #4678
closes #4322

I've fixed the C parser portion. The issue there was that it did not handle the case where parse_dates is a dict.

Python parser fix yet to come. That test still fails.

Example:
s = '06-02-2013;13:00;1-000,215;0,215;0,185;0,205;0,00'
d = pd.read_csv(StringIO(s), header=None, parse_dates={'Date': [0, 1]}, thousands='-', sep=';', decimal=',')

Then d should have a column 'Date' with value "2013-06-02 13:00:00"

Fixes issue where thousands separator could conflict with date parsing. This is only fixed in the C parser. Closes issue pandas-dev#4678

jreback · 2013-09-23T19:31:38Z

@guyrt go ahead and put in this same PR.....(both c and python)

guyrt · 2013-09-23T19:47:32Z

@jreback I've pushed a fix for Python parser, but I'm not completely happy with it.

The issue is that the Python parser has special handling for the first line when headers=None (https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1414). In this section, we read the first line without having access to the column names in every case. Therefore, if a column is identified by name as in a date (so we skip thousands parsing) then we are unable to identify which column to skip. This is a pretty rare case requiring:

no header
parse_dates = a dict that refers to columns by name
those columns use the thousands separator.

I don't see an easy way to fix this problem without rethinking the first line processing completely.

jreback · 2013-09-23T19:50:04Z

hmm....but in that case you would HAVE to have names specified (otherwise its an error to put the name in parse_dates)...pretty sure that is available in the passed kwds

guyrt · 2013-09-23T19:55:40Z

You would have to now. Previously, any use of self.names was deferred until after the first line was read so names could be inferred set. Now, we have to be prepared to read that line and selectively remove thousands separators, so we have to use names before the first line is read.

jreback · 2013-09-23T19:57:43Z

makes sense...whole ordering bizness is somewhat complicated....mk if you need help

guyrt · 2013-09-23T21:08:29Z

Turns out there was an easy fix. The first line is pushed into a buffer, but it isn't fully processed until later. I deferred checking for thousands until later.

jreback · 2013-09-23T21:17:32Z

gr8
ping when u think ready

guyrt · 2013-09-23T21:49:35Z

@jreback tests pass

jreback · 2013-09-23T22:03:21Z

pandas/io/parsers.py

@@ -1500,7 +1541,7 @@ def _next_line(self):
            line = next(self.data)

        line = self._check_comments([line])[0]
-        line = self._check_thousands([line])[0]
+        #line = self._check_thousands([line])[0]


you can just delete this line

jreback · 2013-09-23T22:05:38Z

do both test cases from the issue pass? or is there something else going on?

guyrt · 2013-09-24T00:04:00Z

There were two open issues:
#4322 (thousands separator)
#4678 (date/thousands conflict)

Both are fixed.

jreback · 2013-09-24T01:10:37Z

does this also fix #4382 ?

guyrt · 2013-09-24T01:25:34Z

So such luck on #4382

jreback · 2013-09-24T11:52:00Z

can you add the #4322 reference to the release notes and a test for that as well?

guyrt · 2013-09-24T12:25:37Z

#4322 is already in the release notes and was tested in a previous PR.

What happened is that I fixed 4322 (thousands operator ignored) which uncovered this bug about a conflict between dates and the thousands operator. Someone reopened 4322 but also created a new ticket, so we can close them both.

jreback · 2013-09-26T01:04:19Z

@guyrt thanks!

jreback · 2013-09-26T03:03:09Z

@guyrt interested in #4335, #4201?

guyrt · 2013-09-26T03:11:29Z

I'll take a look. Working on #3866 right now

guyrt mentioned this pull request Sep 23, 2013

Dates are parsed with read_csv thousand seperator #4678

Closed

BUG: Conflict between thousands sep and date parser.

c6bf2eb

Fixes issue where thousands separator could conflict with date parsing. This is only fixed in the C parser. Closes issue pandas-dev#4678

jreback reviewed Sep 23, 2013
View reviewed changes

BUG: fix issue pandas-dev#4678 for Python parser

fedb26d

jreback merged commit fedb26d into pandas-dev:master Sep 26, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Conflict between thousands sep and date parser. #4945

BUG: Conflict between thousands sep and date parser. #4945

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 24, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 24, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 26, 2013

jreback commented Sep 26, 2013

guyrt commented Sep 26, 2013

BUG: Conflict between thousands sep and date parser. #4945

BUG: Conflict between thousands sep and date parser. #4945

Conversation

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback commented Sep 23, 2013

guyrt commented Sep 23, 2013

jreback Sep 23, 2013

Choose a reason for hiding this comment

jreback commented Sep 23, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 24, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 24, 2013

guyrt commented Sep 24, 2013

jreback commented Sep 26, 2013

jreback commented Sep 26, 2013

guyrt commented Sep 26, 2013