Skip to content

BUG: Parse trailing NaN values for the Python parser #13320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented May 29, 2016

Fixes bug in which the Python parser failed to detect trailing NaN values in rows

@gfyoung gfyoung force-pushed the trailing-nan-conversion branch 3 times, most recently from f9db081 to 7515029 Compare May 29, 2016 23:34
@codecov-io
Copy link

codecov-io commented May 30, 2016

Current coverage is 84.22%

Merging #13320 into master will decrease coverage by <.01%

@@             master     #13320   diff @@
==========================================
  Files           138        138          
  Lines         50710      50667    -43   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42712      42672    -40   
+ Misses         7998       7995     -3   
  Partials          0          0          

Powered by Codecov. Last updated by 8bbd2bc...7515029

@jreback jreback added Bug IO CSV read_csv, to_csv labels May 30, 2016
@jreback jreback added this to the 0.18.2 milestone May 30, 2016
@@ -296,6 +296,7 @@ Bug Fixes


- Bug in ``pd.read_csv()`` with ``engine='python'`` in which infinities of mixed-case forms were not being interpreted properly (:issue:`13274`)
- Bug in ``pd.read_csv()`` with ``engine='python'`` in which trailing NaN values were not being parsed (:issue:`13320`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks around NaN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

rebase after #13325

@jreback
Copy link
Contributor

jreback commented May 30, 2016

ok merged #13325 so go ahead and rebase this and na_filter one

@gfyoung
Copy link
Member Author

gfyoung commented May 30, 2016

Sounds good. Don't have access to my computer at the moment, but will do ASAP.

@gfyoung gfyoung force-pushed the trailing-nan-conversion branch 7 times, most recently from e8704ce to 7d72579 Compare May 30, 2016 23:52
@@ -1132,15 +1132,15 @@ def map_infer(ndarray arr, object f, bint convert=1):
return result


def to_object_array(list rows):
def to_object_array(list rows, int width=0):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string. what exactly does width do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc-string added

@gfyoung gfyoung force-pushed the trailing-nan-conversion branch 2 times, most recently from 20462b5 to f6f7cc2 Compare May 31, 2016 17:46
cdef:
Py_ssize_t i, j, n, k, tmp
ndarray[object, ndim=2] result
list row

n = len(rows)

k = 0
k = width
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I would make width=None the default. If its specified, then you can skip the first part of the check which is just to figure out k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? That's just asking for data loss (i.e. when k < len(row)). Having width=0 as default also helps to keep the checking largely intact from before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you misunderstand. This ONLY matters if width is NOT NONE. If its None then you compute as now. The point is you a) you are not testing if width exceeds the width of the structure or what happens if its less. (so I guess you can mitigate to mean at least themax(width or 0, max_found_width)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The former case is the whole point of this PR.
  2. You didn't address the data loss issue when k < len(row)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well what should you do here? e.g. you are specifying width, which could be less < len(row).?

Copy link
Member Author

@gfyoung gfyoung May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the doc-string I added would illuminate the meaning of width and the purpose it serves?

@gfyoung gfyoung force-pushed the trailing-nan-conversion branch from f6f7cc2 to 590874d Compare May 31, 2016 21:52
@gfyoung
Copy link
Member Author

gfyoung commented Jun 1, 2016

@jreback : Rebased and Travis is happy. Ready to merge if there are no other concerns.

@jreback jreback closed this in 45bab82 Jun 1, 2016
@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

thanks!

@gfyoung gfyoung deleted the trailing-nan-conversion branch June 1, 2016 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants