Skip to content

csv_import: Thousands separator works in floating point numbers #4598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 23, 2013

Conversation

guyrt
Copy link
Contributor

@guyrt guyrt commented Aug 18, 2013

Closes issue #4322

Adds support for the thousands character in csv parser for floats.

Previously, the thousands separator character did not pass into the core floating point number parsing algorithm:
https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1861

I added an argument to this function and provided a test. Now, in a file like this:

A|B|C
1|2,334.01|5
10|13|10

Column B would import as float type.

Also related to issue #2594

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

can you do the comparisons with full data frames (rather than .values)? to make sure that the dtypes are what they should be (they are, just want to have the test reflect that)

is there a test for a thousands sep that has len > 1 (does it raise?), what does the decimal separator do for the same issue?

what about a thousands sep that is None? (or ''), None is default so that shouldn't matter I guess

also what about a multiple seps in the data e.g. 2,,334?

just trying to bulletproof a bit...

great PR btw!

@guyrt
Copy link
Contributor Author

guyrt commented Aug 21, 2013

can you do the comparisons with full data frames (rather than .values)? to make sure that the dtypes are what they should be (they are, just want to have the test reflect that)

Done

is there a test for a thousands sep that has len > 1 (does it raise?), what does the decimal separator do for the same issue?

It does raise a ValueError, though the message was wrong. I've added tests for both decimal and thousands.

what about a thousands sep that is None? (or ''), None is default so that shouldn't matter I guess

This also raises a ValueError. Test added.

also what about a multiple seps in the data e.g. 2,,334?

What should we do here? Right now, all parsers seem to treat the thousands separator as a "skip character" that can be in an arbitrary position. See #4602 where we simply remove the character. In this case, multiple separators are all ignored. I'm happy to write it to require every third character, but we should decide on a standard first.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

your last comment about 2,,334...what happens now?

@guyrt
Copy link
Contributor Author

guyrt commented Aug 21, 2013

Using ints:

$ cat tmp.csv 
A|B|C
1|2,,334|5
10|13,,|10.

Yields:

> df = pandas.read_csv('tmp.csv', thousands=',', sep='|')
> print df
    A     B   C
0   1  2334   5
1  10    13  10

So it looks like we're ignoring all of the thousands separators. However, there is a bug with leading thousands separators. The file:

$ cat tmp.csv 
A|B|C
1|2,,334|5
10|,,13|10.

imports B as an object.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

what about the above example with floats?

do you know what the problem is with leading ',,'?

@guyrt
Copy link
Contributor Author

guyrt commented Aug 21, 2013

With floats it imports as strings. This PR fixes the fact that the thousands separator was being ignored and tripping up the parser.

It turns out the leading separator behavior is intentional, and probably for good reason:
https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L2034

})

df = self.read_csv(StringIO(data), sep='|', thousands=',')
assert_almost_equal(df.values, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would use assert_frame_equal here (and I think in a coupl eof other tests)

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

ok....so that's fine...so you covered all the issues? docs ok? (I think that was the problem in that they said it should work but it didnt)

#2594 is independt of this? (as that fixes the PythonParser) IIRC?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

sorry....

can you add some tests with dtype specified as well (as that's taht #2594 looks at)

then can close that too

@jtratner
Copy link
Contributor

Please make sure that this works with something like:

1.250,01

Given that I think we changed the thousands separator parser to accept . for thousands sep.

@guyrt
Copy link
Contributor Author

guyrt commented Aug 22, 2013

@jtratner I added numbers like 1.250,01 as a test case.

@jreback I've switched to assert_frame_equal. Please also confirm that my dtype test is correct (it's in commit 2cca2ba)

I'm pretty sure the docs were written for the PythonParser and did not take a thousands separator in floats into account. I've fixed both issues.

#2594 does not reference the PythonParser directly, so I would argue for closing that ticket as worded. However, the PythonParser has logic to explicitly reject separators and decimal other than the default. That is a PythonParser specific limitation.

@guyrt
Copy link
Contributor Author

guyrt commented Aug 22, 2013

My last commit is purely a cleanup of some import redundancies.

@wesm
Copy link
Member

wesm commented Aug 22, 2013

Can you check briefly whether there's any performance penalty for adding the extra if statement? Maybe something so simple as parsing a file containing a single column with 10 million rows. Probably small relative to the other expenses.

@cancan101
Copy link
Contributor

FWIW, You can write that line without adding a branch / if statement:

p +=  (tsep != '\0' & *p == tsep)

@guyrt
Copy link
Contributor Author

guyrt commented Aug 23, 2013

@wesm I put in @cpcloud's improvement and tested with a 10,000,000 by 5 file. Separator caused no statistically significant change in import time.

Example runtimes with hot file:

$ cat import_file.py
import pandas
a = pandas.read_csv('tmp.csv', thousands='|')



$ cat tmp.csv > /dev/null

$ workon pandas_dev
$ /usr/bin/time python import_file.py
11.52user 0.86system 0:12.42elapsed 99%CPU (0avgtext+0avgdata 4136960maxresident)k 0inputs+0outputs (0major+391461minor)pagefaults 0swaps

$ workon stable
$ /usr/bin/time python import_file.py
11.65user 0.91system 0:14.34elapsed 87%CPU (0avgtext+0avgdata 4136960maxresident)k 27128inputs+0outputs (55major+391443minor)pagefaults 0swaps

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@guyrt ok...you need to rebase on master (prob just a release notes conflict), and squash the commits down a bit (whatever is reasonable)

@cancan101
Copy link
Contributor

@guyrt I think you want to use single & rather than a double &&. I am a little fuzzy on this, but I think you do not want the short circuit logic in the double && since that will potentially introduce a branch.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@guyrt actually...why don't you add an additional vbench in vb_suite/parsers.py; just copy an example and modify it to also parse the thousands.....

@guyrt
Copy link
Contributor Author

guyrt commented Aug 23, 2013

@jreback There is already a vb_suite test using separators: https://github.com/pydata/pandas/blob/master/vb_suite/parser.py#L31

@cancan101 Good catch.

@guyrt
Copy link
Contributor Author

guyrt commented Aug 23, 2013

@jreback I've squashed to three commits: main fix, cleanup in test, and changing && to &. Happy to squash farther if you prefer.

Also rebased

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@guyrt no that's fine....ping me when travis is done and will merge it

@guyrt
Copy link
Contributor Author

guyrt commented Aug 23, 2013

@jreback ping

jreback added a commit that referenced this pull request Aug 23, 2013
csv_import: Thousands separator works in floating point numbers
@jreback jreback merged commit d536ff6 into pandas-dev:master Aug 23, 2013
@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@guyrt great thanks!!!!

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@guyrt @hayd is going to open an issue about this odd-edge case where the thousands sep is in a date column (and getting handled where it should not)....can you take a look?

thkxs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants