Skip to content

BUG, ENH: Improve infinity parsing for read_csv #13274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented May 24, 2016

  1. Allow mixed-case infinity strings for the Python engine

Bug was traced back via lib.maybe_convert_numeric to the floatify function in pandas/src/parse_helper.h. In addition to correcting the bug and adding tests for it, this PR also moves the test_inf_parsing test from c_parser_only.py to common.py in the pandas/io/tests/parser dir.

  1. Interpret +inf as positive infinity for both engines

float('+inf') in Python is interpreted as positive infinity, so we should allow it too in parsing.

@gfyoung gfyoung force-pushed the floatify-infinity-parsing branch from 84f91d4 to e9d3ac9 Compare May 24, 2016 23:25
@@ -252,3 +252,4 @@ Bug Fixes
- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`)

- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
- Bug in ``pd.lib.maybe_convert_numeric`` in which infinities of mixed-case forms were not being interpreted properly (:issue:`13274`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a more user friendly example (the user doesn't see this function as its internal)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'll rewrite it to refer to read_csv and the Python engine.

@jreback
Copy link
Contributor

jreback commented May 24, 2016

ok looks reasonable. pls run parser asv's just to be sure (I doubt we have something that specifically has infs in it; you can add if you want).

@jreback jreback added IO CSV read_csv, to_csv Bug labels May 24, 2016
@jreback
Copy link
Contributor

jreback commented May 24, 2016

btw, not averse to signed inf (e.g. +inf), if its in the IEEE spec (not sure)

@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

  1. Not sure if it's necessary to add a benchmark for this. A couple of string checks shouldn't have much effect, so unless there is a strong objection, I'll leave it for now.

  2. I can't actually run benchmarks at the moment. There seems to be an issue with linux-64 channels in Anaconda for tables and libpython that prevents me from running the benchmarks. Can you confirm on a different architecture?

  3. Can't confirm if +inf is in IEEE standard (Googled "IEEE infinity," and that didn't turn up anything). However, float('+inf') does work in Python, so I guess we should accept it. However, it is not supported in the CParser either right now, so I'll need to see where infinity parsing is done there.

@jreback
Copy link
Contributor

jreback commented May 25, 2016

pls just run the current benchmarks - don't need a new one

you don't need libpython on Linux

@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

I'm just running asv continuous master floatify-infinity-parsing -b io_bench, and that's what it's complaining about, nothing special here, right?

@gfyoung gfyoung force-pushed the floatify-infinity-parsing branch from e9d3ac9 to 784f07a Compare May 25, 2016 00:46
@gfyoung gfyoung changed the title Fix infinity parsing in parse_helper.floatify BUG, ENH: Improve infinity parsing for read_csv May 25, 2016
@gfyoung gfyoung force-pushed the floatify-infinity-parsing branch from 784f07a to 8455f53 Compare May 25, 2016 00:50
@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

@jreback : Added +inf parsing to both parsers, and Travis is still giving the green light. While I cannot run the benchmarks because of that Anaconda issue, I wouldn't be surprised if they remain relatively the same since my changes shouldn't down anything too much. Ready to merge if there are no other concerns.

strcpy(inf_str, data);
for ( ; *inf_ptr; ++inf_ptr) *inf_ptr = tolower(*inf_ptr);

if (0 == strcmp(inf_str, "-inf")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you using strcasecomp here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I just do include "headers/portable.h" and then I can call it? That would indeed save me pain of this char copying.

Copy link
Contributor

@jreback jreback May 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't do C very much :> but I think yes; though its included the same way as strcmp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, fair enough. Will give it a shot and see. Otherwise, I'll ask Google.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not sure if this will be more clear to make a if/else if for len 3 and another for len 4 (I 'c' lots of c code doing things like this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Can break up for clarity.

@gfyoung gfyoung force-pushed the floatify-infinity-parsing branch from 8455f53 to 9d60901 Compare May 25, 2016 14:56
1) Python infinity parsing bug

Initially an attempt to fix a Python parsing bug of
mixed-case infinity strings, the bug was traced back
via lib.maybe_convert_numeric to the 'floatify' method
in pandas/src/parse_helper.h.

In addition to correcting the bug and adding tests for
it, this commit also moves the infinity-parsing test
from CParser-only to common.

2) Interpret '+inf' as positive infinity

This is consistent with the Python API, where
float('+inf') is interpreted as positive infinity.
@gfyoung gfyoung force-pushed the floatify-infinity-parsing branch from 9d60901 to f37b130 Compare May 25, 2016 14:59
@jreback jreback added this to the 0.18.2 milestone May 25, 2016
@jreback
Copy link
Contributor

jreback commented May 25, 2016

lgtm, ping on green. we don't have an associated issue right?

@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

Nope. Discovered it when I was refactoring test_parsers, and someone had inadvertently put read_csv instead of self.read_csv for that test I moved in this PR.

@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

@jreback : Travis giving the green light . Ready to merge if there is nothing else.

@jreback jreback closed this in da5fc17 May 25, 2016
@jreback
Copy link
Contributor

jreback commented May 25, 2016

nice PR @gfyoung

only comment is that don't put Bug Fixes at the very end, causes conflicts, stick them somewhat randomly in the spaces within the current list (that's why I leave the space)

@gfyoung gfyoung deleted the floatify-infinity-parsing branch May 25, 2016 19:26
@gfyoung
Copy link
Member Author

gfyoung commented May 25, 2016

@jreback : Ah, fair point. Bad habit on my part. Thanks for the reminder!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants