Skip to content

read_fwf with passing converters and na_values at the same time #15144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gitporst opened this issue Jan 17, 2017 · 9 comments
Closed

read_fwf with passing converters and na_values at the same time #15144

gitporst opened this issue Jan 17, 2017 · 9 comments
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@gitporst
Copy link

gitporst commented Jan 17, 2017

Duplicate of #13302


When reading via read_fwf and passing both na_values and converters at the same time, na_value detection does not work:

Code Sample, a copy-pastable example if possible

Assume a fwf-style text file like this:

A B C
10 -1 10
10 10 10

In: read_fwf(file_path, converters={'B':lambda x: int(x)/10.}, na_values={'B':-1})
Out:
>     A    B   C
> 0  10 -0.1  10
> 1  10  1.0  10

Expected Output

>     A    B   C
> 0  10  NaN  10
> 1  10  1.0  10

It would be nice if this could be handled behind the scenes.
In pandas.io.parsers._clean_na_values, float representations of the na_values are produced by _floatify_na_values. Maybe passing the converters to this function might be a solution?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 23 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
(...)

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

cc @gfyoung

@jreback jreback added the IO CSV read_csv, to_csv label Jan 17, 2017
@gfyoung
Copy link
Member

gfyoung commented Jan 18, 2017

So this is in fact a point of discussion for the API (and not a bug per se):

When should na_values be applied?

Currently, as evidenced by the code (and personal testing) they are applied after the converters are applied. This is why there are no NaN values because post-conversion, there are no more NaN values technically.

I don't see the harm in choosing one or the other, but we just need agreement and consistency (we have the latter currently). Regardless of how this issue goes, documentation updates will be needed to explain when na_values are applied and to give an example in io.rst

@jorisvandenbossche
Copy link
Member

I we choose, I would find it more logical that na_values is applied first.

@gfyoung
Copy link
Member

gfyoung commented Jan 18, 2017

@jorisvandenbossche : Very well. So we have two people (@gitporst and @jorisvandenbossche) in favor of applying the na_values before the converters are applied. Is anyone willing to play devil's advocate for applying after the converters? Otherwise, I think we'll go with trying to apply before (I say trying because who knows what tests will break).

@gfyoung
Copy link
Member

gfyoung commented Jan 22, 2017

@jorisvandenbossche : Argh...I forgot. We have this ugly bug that I reported in #13302. We might need to put a hold on resolving this and save it for a longer-term revamp of read_csv in 1.0. What do you think?

@jreback
Copy link
Contributor

jreback commented Jan 22, 2017

@gfyoung as long as we are consistent with the approach I think its fine (IOW apply na_values first).

to be honest, converters is just trying to push more functionaility into read_csv, when you really want to do the opposite. IOW, do conversions after, where the tooling is much more idiomatic, rather than trying to shove everything into passed conversion functions.

not sure what this means

save it for a longer-term revamp of read_csv in 1.0.

what exactly are you referring? (do you have some plans for this)?

@gfyoung
Copy link
Member

gfyoung commented Jan 22, 2017

@jreback : I was just saying that patching converters is not easy in the short-term because of the bug I pointed and the inconsistency I pointed out in the issue.

However, at least we seem to have agreement that converters should come after na_values. For bookkeeping purposes, I would be in favor of closing this and then moving some of this information into my issue (#13302).

@jorisvandenbossche
Copy link
Member

Yes, this indeeds seems like a duplicate of #13302. @gfyoung can you move the necessary information? (just the agreement that na_values should be applied before converters I think?)

I am also not sure why solving this or the bug in #13302 would need a 1.0. Do you mean that the logical way to solve it would have a lot of other consequences?

@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Jan 23, 2017
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jan 23, 2017
@gfyoung
Copy link
Member

gfyoung commented Jan 24, 2017

@jorisvandenbossche : The consequences of changing it would be too big I think. I gave it a try before a couple of months ago, and things just broke. Though if someone wants to prove me wrong, be my guest! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants