-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
csv_import: Thousands separator works in floating point numbers #4598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you do the comparisons with full data frames (rather than is there a test for a thousands sep that has len > 1 (does it raise?), what does the decimal separator do for the same issue? what about a thousands sep that is None? (or ''), None is default so that shouldn't matter I guess also what about a multiple seps in the data e.g. just trying to bulletproof a bit... great PR btw! |
Done
It does raise a ValueError, though the message was wrong. I've added tests for both decimal and thousands.
This also raises a ValueError. Test added. also what about a multiple seps in the data e.g. 2,,334?
|
your last comment about |
Using ints:
Yields:
So it looks like we're ignoring all of the thousands separators. However, there is a bug with leading thousands separators. The file:
imports B as an object. |
what about the above example with floats? do you know what the problem is with leading ',,'? |
With floats it imports as strings. This PR fixes the fact that the thousands separator was being ignored and tripping up the parser. It turns out the leading separator behavior is intentional, and probably for good reason: |
}) | ||
|
||
df = self.read_csv(StringIO(data), sep='|', thousands=',') | ||
assert_almost_equal(df.values, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would use assert_frame_equal here (and I think in a coupl eof other tests)
ok....so that's fine...so you covered all the issues? docs ok? (I think that was the problem in that they said it should work but it didnt) #2594 is independt of this? (as that fixes the PythonParser) IIRC? |
sorry.... can you add some tests with dtype specified as well (as that's taht #2594 looks at) then can close that too |
Please make sure that this works with something like: 1.250,01 Given that I think we changed the thousands separator parser to accept |
@jtratner I added numbers like 1.250,01 as a test case. @jreback I've switched to assert_frame_equal. Please also confirm that my dtype test is correct (it's in commit 2cca2ba) I'm pretty sure the docs were written for the PythonParser and did not take a thousands separator in floats into account. I've fixed both issues. #2594 does not reference the PythonParser directly, so I would argue for closing that ticket as worded. However, the PythonParser has logic to explicitly reject separators and decimal other than the default. That is a PythonParser specific limitation. |
My last commit is purely a cleanup of some import redundancies. |
Can you check briefly whether there's any performance penalty for adding the extra if statement? Maybe something so simple as parsing a file containing a single column with 10 million rows. Probably small relative to the other expenses. |
FWIW, You can write that line without adding a branch / if statement:
|
@wesm I put in @cpcloud's improvement and tested with a 10,000,000 by 5 file. Separator caused no statistically significant change in import time. Example runtimes with hot file:
|
@guyrt ok...you need to rebase on master (prob just a release notes conflict), and squash the commits down a bit (whatever is reasonable) |
@guyrt I think you want to use single & rather than a double &&. I am a little fuzzy on this, but I think you do not want the short circuit logic in the double && since that will potentially introduce a branch. |
@guyrt actually...why don't you add an additional vbench in |
@jreback There is already a vb_suite test using separators: https://github.com/pydata/pandas/blob/master/vb_suite/parser.py#L31 @cancan101 Good catch. |
Adds support for the thousands character in csv parser for floats. Updated docs to reflect bug fix.
@jreback I've squashed to three commits: main fix, cleanup in test, and changing && to &. Happy to squash farther if you prefer. Also rebased |
@guyrt no that's fine....ping me when travis is done and will merge it |
@jreback ping |
csv_import: Thousands separator works in floating point numbers
@guyrt great thanks!!!! |
Closes issue #4322
Adds support for the thousands character in csv parser for floats.
Previously, the thousands separator character did not pass into the core floating point number parsing algorithm:
https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1861
I added an argument to this function and provided a test. Now, in a file like this:
Column B would import as float type.
Also related to issue #2594