csv_import: Thousands separator works in floating point numbers #4598

guyrt · 2013-08-18T15:13:06Z

Closes issue #4322

Adds support for the thousands character in csv parser for floats.

Previously, the thousands separator character did not pass into the core floating point number parsing algorithm:
https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1861

I added an argument to this function and provided a test. Now, in a file like this:

A|B|C
1|2,334.01|5
10|13|10

Column B would import as float type.

Also related to issue #2594

jreback · 2013-08-21T12:24:15Z

can you do the comparisons with full data frames (rather than .values)? to make sure that the dtypes are what they should be (they are, just want to have the test reflect that)

is there a test for a thousands sep that has len > 1 (does it raise?), what does the decimal separator do for the same issue?

what about a thousands sep that is None? (or ''), None is default so that shouldn't matter I guess

also what about a multiple seps in the data e.g. 2,,334?

just trying to bulletproof a bit...

great PR btw!

guyrt · 2013-08-21T18:09:42Z

can you do the comparisons with full data frames (rather than .values)? to make sure that the dtypes are what they should be (they are, just want to have the test reflect that)

Done

is there a test for a thousands sep that has len > 1 (does it raise?), what does the decimal separator do for the same issue?

It does raise a ValueError, though the message was wrong. I've added tests for both decimal and thousands.

what about a thousands sep that is None? (or ''), None is default so that shouldn't matter I guess

This also raises a ValueError. Test added.

also what about a multiple seps in the data e.g. 2,,334?

What should we do here? Right now, all parsers seem to treat the thousands separator as a "skip character" that can be in an arbitrary position. See #4602 where we simply remove the character. In this case, multiple separators are all ignored. I'm happy to write it to require every third character, but we should decide on a standard first.

jreback · 2013-08-21T18:22:02Z

your last comment about 2,,334...what happens now?

guyrt · 2013-08-21T18:33:07Z

Using ints:

$ cat tmp.csv 
A|B|C
1|2,,334|5
10|13,,|10.

Yields:

> df = pandas.read_csv('tmp.csv', thousands=',', sep='|')
> print df
    A     B   C
0   1  2334   5
1  10    13  10

So it looks like we're ignoring all of the thousands separators. However, there is a bug with leading thousands separators. The file:

$ cat tmp.csv 
A|B|C
1|2,,334|5
10|,,13|10.

imports B as an object.

jreback · 2013-08-21T18:41:02Z

what about the above example with floats?

do you know what the problem is with leading ',,'?

guyrt · 2013-08-21T18:58:26Z

With floats it imports as strings. This PR fixes the fact that the thousands separator was being ignored and tripping up the parser.

It turns out the leading separator behavior is intentional, and probably for good reason:
https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L2034

jreback · 2013-08-21T19:02:51Z

pandas/io/tests/test_parsers.py

+        })
+
+        df = self.read_csv(StringIO(data), sep='|', thousands=',')
+        assert_almost_equal(df.values, expected)


i would use assert_frame_equal here (and I think in a coupl eof other tests)

jreback · 2013-08-21T19:04:27Z

ok....so that's fine...so you covered all the issues? docs ok? (I think that was the problem in that they said it should work but it didnt)

#2594 is independt of this? (as that fixes the PythonParser) IIRC?

jreback · 2013-08-21T19:05:45Z

sorry....

can you add some tests with dtype specified as well (as that's taht #2594 looks at)

then can close that too

jtratner · 2013-08-21T22:54:39Z

Please make sure that this works with something like:

1.250,01

Given that I think we changed the thousands separator parser to accept . for thousands sep.

guyrt · 2013-08-22T04:17:22Z

@jtratner I added numbers like 1.250,01 as a test case.

@jreback I've switched to assert_frame_equal. Please also confirm that my dtype test is correct (it's in commit 2cca2ba)

I'm pretty sure the docs were written for the PythonParser and did not take a thousands separator in floats into account. I've fixed both issues.

#2594 does not reference the PythonParser directly, so I would argue for closing that ticket as worded. However, the PythonParser has logic to explicitly reject separators and decimal other than the default. That is a PythonParser specific limitation.

guyrt · 2013-08-22T04:20:18Z

My last commit is purely a cleanup of some import redundancies.

wesm · 2013-08-22T05:26:18Z

Can you check briefly whether there's any performance penalty for adding the extra if statement? Maybe something so simple as parsing a file containing a single column with 10 million rows. Probably small relative to the other expenses.

cancan101 · 2013-08-23T12:28:28Z

FWIW, You can write that line without adding a branch / if statement:

p +=  (tsep != '\0' & *p == tsep)

guyrt · 2013-08-23T14:36:21Z

@wesm I put in @cpcloud's improvement and tested with a 10,000,000 by 5 file. Separator caused no statistically significant change in import time.

Example runtimes with hot file:

$ cat import_file.py
import pandas
a = pandas.read_csv('tmp.csv', thousands='|')



$ cat tmp.csv > /dev/null

$ workon pandas_dev
$ /usr/bin/time python import_file.py
11.52user 0.86system 0:12.42elapsed 99%CPU (0avgtext+0avgdata 4136960maxresident)k 0inputs+0outputs (0major+391461minor)pagefaults 0swaps

$ workon stable
$ /usr/bin/time python import_file.py
11.65user 0.91system 0:14.34elapsed 87%CPU (0avgtext+0avgdata 4136960maxresident)k 27128inputs+0outputs (55major+391443minor)pagefaults 0swaps

jreback · 2013-08-23T14:46:08Z

@guyrt ok...you need to rebase on master (prob just a release notes conflict), and squash the commits down a bit (whatever is reasonable)

cancan101 · 2013-08-23T15:03:58Z

@guyrt I think you want to use single & rather than a double &&. I am a little fuzzy on this, but I think you do not want the short circuit logic in the double && since that will potentially introduce a branch.

jreback · 2013-08-23T15:06:27Z

@guyrt actually...why don't you add an additional vbench in vb_suite/parsers.py; just copy an example and modify it to also parse the thousands.....

guyrt · 2013-08-23T15:39:40Z

@jreback There is already a vb_suite test using separators: https://github.com/pydata/pandas/blob/master/vb_suite/parser.py#L31

@cancan101 Good catch.

Adds support for the thousands character in csv parser for floats. Updated docs to reflect bug fix.

guyrt · 2013-08-23T15:46:48Z

@jreback I've squashed to three commits: main fix, cleanup in test, and changing && to &. Happy to squash farther if you prefer.

Also rebased

jreback · 2013-08-23T15:49:50Z

@guyrt no that's fine....ping me when travis is done and will merge it

guyrt · 2013-08-23T17:04:35Z

@jreback ping

csv_import: Thousands separator works in floating point numbers

jreback · 2013-08-23T17:51:44Z

@guyrt great thanks!!!!

jreback · 2013-08-26T19:28:06Z

@guyrt @hayd is going to open an issue about this odd-edge case where the thousands sep is in a date column (and getting handled where it should not)....can you take a look?

thkxs

guyrt mentioned this pull request Aug 19, 2013

Fixes issue with TextFileReader using python engine and thousands != "," #4602

Merged

jreback reviewed Aug 21, 2013
View reviewed changes

guyrt added 3 commits August 23, 2013 11:45

BUG: fixes issue pandas-dev#4322

0922599

Adds support for the thousands character in csv parser for floats. Updated docs to reflect bug fix.

TST: Removed double import in test io.tests.parser_tests.

51c72fc

ENH: avoid branch in thousands check. Thanks to @cancan101

93ea765

jreback added a commit that referenced this pull request Aug 23, 2013

Merge pull request #4598 from guyrt/csv-import-arg-conflict

d536ff6

csv_import: Thousands separator works in floating point numbers

jreback merged commit d536ff6 into pandas-dev:master Aug 23, 2013

jreback mentioned this pull request Aug 23, 2013

read csv thousands separator #4322

Closed

hayd mentioned this pull request Aug 26, 2013

Support parsing thousands separators in floating point data #2594

Closed

hayd mentioned this pull request Aug 26, 2013

Dates are parsed with read_csv thousand seperator #4678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv_import: Thousands separator works in floating point numbers #4598

csv_import: Thousands separator works in floating point numbers #4598

guyrt commented Aug 18, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

guyrt commented Aug 22, 2013

guyrt commented Aug 22, 2013

wesm commented Aug 22, 2013

cancan101 commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

cancan101 commented Aug 23, 2013

jreback commented Aug 23, 2013

guyrt commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

jreback commented Aug 26, 2013

csv_import: Thousands separator works in floating point numbers #4598

csv_import: Thousands separator works in floating point numbers #4598

Conversation

guyrt commented Aug 18, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback commented Aug 21, 2013

guyrt commented Aug 21, 2013

jreback Aug 21, 2013

Choose a reason for hiding this comment

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

guyrt commented Aug 22, 2013

guyrt commented Aug 22, 2013

wesm commented Aug 22, 2013

cancan101 commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

cancan101 commented Aug 23, 2013

jreback commented Aug 23, 2013

guyrt commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

guyrt commented Aug 23, 2013

jreback commented Aug 23, 2013

jreback commented Aug 26, 2013