read_fwf with passing converters and na_values at the same time #15144

gitporst · 2017-01-17T13:35:49Z

Duplicate of #13302

When reading via read_fwf and passing both na_values and converters at the same time, na_value detection does not work:

Code Sample, a copy-pastable example if possible

Assume a fwf-style text file like this:

A B C
10 -1 10
10 10 10

In: read_fwf(file_path, converters={'B':lambda x: int(x)/10.}, na_values={'B':-1})
Out:
>     A    B   C
> 0  10 -0.1  10
> 1  10  1.0  10

Expected Output

>     A    B   C
> 0  10  NaN  10
> 1  10  1.0  10

It would be nice if this could be handled behind the scenes.
In pandas.io.parsers._clean_na_values, float representations of the na_values are produced by _floatify_na_values. Maybe passing the converters to this function might be a solution?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 23 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
(...)

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-17T14:48:46Z

cc @gfyoung

gfyoung · 2017-01-18T05:30:55Z

So this is in fact a point of discussion for the API (and not a bug per se):

When should na_values be applied?

Currently, as evidenced by the code (and personal testing) they are applied after the converters are applied. This is why there are no NaN values because post-conversion, there are no more NaN values technically.

I don't see the harm in choosing one or the other, but we just need agreement and consistency (we have the latter currently). Regardless of how this issue goes, documentation updates will be needed to explain when na_values are applied and to give an example in io.rst

jorisvandenbossche · 2017-01-18T09:20:18Z

I we choose, I would find it more logical that na_values is applied first.

gfyoung · 2017-01-18T09:23:56Z

@jorisvandenbossche : Very well. So we have two people (@gitporst and @jorisvandenbossche) in favor of applying the na_values before the converters are applied. Is anyone willing to play devil's advocate for applying after the converters? Otherwise, I think we'll go with trying to apply before (I say trying because who knows what tests will break).

gfyoung · 2017-01-22T04:04:23Z

@jorisvandenbossche : Argh...I forgot. We have this ugly bug that I reported in #13302. We might need to put a hold on resolving this and save it for a longer-term revamp of read_csv in 1.0. What do you think?

jreback · 2017-01-22T15:04:55Z

@gfyoung as long as we are consistent with the approach I think its fine (IOW apply na_values first).

to be honest, converters is just trying to push more functionaility into read_csv, when you really want to do the opposite. IOW, do conversions after, where the tooling is much more idiomatic, rather than trying to shove everything into passed conversion functions.

not sure what this means

save it for a longer-term revamp of read_csv in 1.0.

what exactly are you referring? (do you have some plans for this)?

gfyoung · 2017-01-22T19:45:19Z

@jreback : I was just saying that patching converters is not easy in the short-term because of the bug I pointed and the inconsistency I pointed out in the issue.

However, at least we seem to have agreement that converters should come after na_values. For bookkeeping purposes, I would be in favor of closing this and then moving some of this information into my issue (#13302).

jorisvandenbossche · 2017-01-23T20:23:17Z

Yes, this indeeds seems like a duplicate of #13302. @gfyoung can you move the necessary information? (just the agreement that na_values should be applied before converters I think?)

I am also not sure why solving this or the bug in #13302 would need a 1.0. Do you mean that the logical way to solve it would have a lot of other consequences?

gfyoung · 2017-01-24T04:29:36Z

@jorisvandenbossche : The consequences of changing it would be too big I think. I gave it a try before a couple of months ago, and things just broke. Though if someone wants to prove me wrong, be my guest! 😄

jreback added the IO CSV read_csv, to_csv label Jan 17, 2017

jorisvandenbossche closed this as completed Jan 23, 2017

jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Jan 23, 2017

jorisvandenbossche added this to the No action milestone Jan 23, 2017

gfyoung mentioned this issue Jan 24, 2017

Inconsistent Handling of na_values and converters in read_csv #13302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf with passing converters and na_values at the same time #15144

read_fwf with passing converters and na_values at the same time #15144

gitporst commented Jan 17, 2017 •

edited by jorisvandenbossche

Loading

jreback commented Jan 17, 2017

gfyoung commented Jan 18, 2017 •

edited

Loading

jorisvandenbossche commented Jan 18, 2017

gfyoung commented Jan 18, 2017

gfyoung commented Jan 22, 2017 •

edited

Loading

jreback commented Jan 22, 2017

gfyoung commented Jan 22, 2017

jorisvandenbossche commented Jan 23, 2017

gfyoung commented Jan 24, 2017 •

edited

Loading

read_fwf with passing converters and na_values at the same time #15144

read_fwf with passing converters and na_values at the same time #15144

Comments

gitporst commented Jan 17, 2017 • edited by jorisvandenbossche Loading

Code Sample, a copy-pastable example if possible

Expected Output

Output of pd.show_versions()

jreback commented Jan 17, 2017

gfyoung commented Jan 18, 2017 • edited Loading

jorisvandenbossche commented Jan 18, 2017

gfyoung commented Jan 18, 2017

gfyoung commented Jan 22, 2017 • edited Loading

jreback commented Jan 22, 2017

gfyoung commented Jan 22, 2017

jorisvandenbossche commented Jan 23, 2017

gfyoung commented Jan 24, 2017 • edited Loading

gitporst commented Jan 17, 2017 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`

gfyoung commented Jan 18, 2017 •

edited

Loading

gfyoung commented Jan 22, 2017 •

edited

Loading

gfyoung commented Jan 24, 2017 •

edited

Loading