Skip to content

MetaIssue: Pandas tries too hard to be smart #12865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DavidEscott opened this issue Apr 11, 2016 · 12 comments
Closed

MetaIssue: Pandas tries too hard to be smart #12865

DavidEscott opened this issue Apr 11, 2016 · 12 comments

Comments

@DavidEscott
Copy link

....and ends up doing things wrong.

I had the following task: Given a csv file A and a set of identically structured files X, Y, Z [which have keys in the first two columns, and values in subsequent columns]: I wanted to create a new file which has contents MAX(A.column_val, X.column_val + Y.column_val + Z.column_val).

I had previously written a pure python function to sort the individual columns, and then perform an merge-join on the columns. However conceptually this task is only a few lines of pandas code:

summed = pandas.concat([X, Y, Z]).groupby(["key_col1", "key_col2"]).sum()
output = pandas.concat([A, summed]).groupby(["key_col1", "key_col2"]).max()
output.to_csv("something.csv")

So surely switching to pandas would make the code easier to understand and maintain. Right? Nope!!!

All of the following were issues I had to work around:

  1. The results of sum()/max() have heirarchical indexes and so I must liberally sprinkle reset_index() at particular points in my code.
  2. Pandas tries to be smart and assumes that the empty string in my key_cols is a NULL string and converts it to NaN. This is annoying because now I have to add na_rep="" to my to_csv command.
  3. More significantly though, pandas thinks that a NaN in a groupby column is reason to exclude the entire row... so now I have to add a bunch of lines to fillna('') (and that doesn't modify the frame you call it on, but rather returns a copy so you have to fillna('', inplace=True)
  4. Because pandas doesn't support columns with integer values and NA values it has cast all my ints to floats and my output now looks like 1.0, 2.0. instead of 1, 2. So I have to add a float_format directive to the to_csv.
  5. Pandas also likes to output a row number, so index=False has to get added to to_csv.

The end result is that I am not of the opinion that the resulting pandas implementation is any more legible or maintainable. In three weeks am I going to remember why I have set index=False or what the hell that means? Am I going to understand why the calls to reset_index are necessary, or what they do? Probably not, which makes me concerned about using this code in production.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

well if you post a copy-pastable example then you will prob get more help.

I am not sure why any of this is suprising. This is all quite well documented and as expected.

@DavidEscott
Copy link
Author

This isn't a usage question. I eventually did figure out how to get Pandas to properly format output. This is more of a general comment on the usability of pandas. These default behaviors don't seem sensible to me as a user, and having to work around them makes pandas seem both harder to use and more fragile than one would like. The end result being that I probably won't use pandas at all.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

@DavidEscott well if you want to help people, being more specific with an example would help.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

So here are some responses to your points, keeping in mind that I don't a good example of what you are doing, so have to guess a bit:

  1. The results of sum()/max() have heirarchical indexes and so I must liberally sprinkle reset_index() at particular points in my code.

The is to USE the hierarchical indexes, you can pass as_index=False to not have to do this. But this is a rare case, hence the default behavior

  1. Pandas tries to be smart and assumes that the empty string in my key_cols is a NULL string and converts it to NaN. This is annoying because now I have to add na_rep="" to my to_csv command.

This is standard practice. What else would a null string mean then?

  1. More significantly though, pandas thinks that a NaN in a groupby column is reason to exclude the entire row... so now I have to add a bunch of lines to fillna('') (and that doesn't modify the frame you call it on, but rather returns a copy so you have to fillna('', inplace=True)

Exactly you DO normally exclude these. This is a feature which may eventually be added.

  1. Because pandas doesn't support columns with integer values and NA values it has cast all my ints to floats and my output now looks like 1.0, 2.0. instead of 1, 2. So I have to add a float_format directive to the to_csv.

This may be added at some point. But is a fact of life ATM with numpy as the only backend.

  1. Pandas also likes to output a row number, so index=False has to get added to to_csv.

This is outputing the default index. This is to make to_csv and DataFrame.read_csv convertible.

@DavidEscott
Copy link
Author

I think your response and my response will generally come down to what is a rare case for you is not a rare case for me, and vice versa.

Ultimately for me it is coming down to their being too many hoops to jump through to get sensible behavior that it isn't worth it. I know of no other system that when asked to aggregate values drops rows because the key contains a NULL value. Now I predominately work in SQL, but this seems very strange to me.

The use of an empty string to mean an empty string (and not a null string), is ambiguous, but again very common in the kinds of files and data I work with. It would be great if we never had to deal with CSV and its inability to express NULL, but that isn't likely to happen soon.

So if I were to make concrete suggestions:

  1. Allow me to specify that "" is not null for string columns (it has to be for numeric columns). [Perhaps in read_table have an argument "allow_blank_strings"]
  2. Group NaN's together in min/max/sum.
  3. Perhaps implement a to_table function that is a closer inverse of read_table (because to_csv isn't).

As it is the default behavior is not "standard csv" (whatever that may mean) but rather "pandas version of csv", and that is very confusing to new users. I do not expect csv files to have rownumbers in the first column. I do not expect multi-line headers, etc... I don't care if some data is lost, but I really need to be able to get the data out with as little modification/metadata as possible. I just want the bare output of column headers and contents.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

pls show a specific example, your code above is not very idiomatic.

@DavidEscott
Copy link
Author

I'm not sure what you consider to not be idiomatic about this. All I am doing is ensuring that the file lhs obeys the restriction that lhs.column >= SUM(rhs.column for rhs in rhs_list) for all columns in lhs excepting FIRST_NAME and LAST_NAME:

lhs_frame = pandas.read_table(infile)
lhs_frame.FIRST_NAME.fillna('', inplace=True)
lhs_frame.LAST_NAME.fillna('', inplace=True)
rhs_frames = []
for infile in rhs_list:
    df = pandas.read_table(infile)
    df.FIRST_NAME.fillna('', inplace=True)
    df.LAST_NAME.fillna('', inplace=True)
    rhs_frames.append(df)
# sum of RHS
rhs = (pandas.concat(rhs_frames).groupby(["FIRST_NAME", "LAST_NAME"])
                                .sum().reset_index())
# max of lhs and rhs
patched = (pandas.concat([lhs_frame, rhs]).groupby(["FIRST_NAME", "LAST_NAME"])
                 .max().reset_index())
patched.to_csv(infile + ".patched", sep="\t", na_rep="", index=False, float_format="%.0f")

@TomAugspurger
Copy link
Contributor

@DavidEscott why are you calling reset_index() instead of using "FIRST_NAME", "LAST_NAME" as your index in lhs? Would be much easier to help if you provide a copy-pastable example.

@DavidEscott
Copy link
Author

@TomAugspurger I have to call reset_index otherwise I get errors or nonsense:

lhs = pandas.DataFrame([["A", "B", 1]], columns=["FIRST", "SECOND", "VAL"])
rhs = pandas.DataFrame([["A", "B", 1], ["A", "B", 2], ["C", "D", 3]], columns=["FIRST", "SECOND", "VAL"])
summed= rhs.groupby(["FIRST", "SECOND"]).sum()
union = pandas.concat([lhs, summed])
# this union object is really strange
patched = union.groupby(["FIRST", "SECOND"]).max()

The end result I get from that is wrong. It tells me the max of the lhs and the summed rhs is A,B,1 not A,B,3\nC,D,3 as it should be. It completely ignored the rhs. While pandas allows me to concatenate a groupby object, and a regular dataframe, it doesn't do anything sensible with such objects. Hence the reset_index to get back to a normal dataframe.

I'm not asking for how to make pandas do this correctly. After many hours I did eventually figure it out. I'm pointing out how counter-intuitively pandas is behaving.

@TomAugspurger
Copy link
Contributor

I'm not asking for how to make pandas do this correctly

Gotcha. In case anyone else comes across this,

lhs = lhs.set_index(['FIRST', 'SECOND'])`  # you would probably just do this in read_csv
(pd.concat([lhs, rhs.groupby(['FIRST', 'SECOND']).sum()])
    .groupby(level=['FIRST', 'SECOND'])
    .max()
)

I'm pointing out how counter-intuitively pandas is behaving.

The tough with that is what's intuitive is in the eye of the beholder. The solution is consistent behavior and good docs. Let us know if you find any areas where those are lacking.

@jorisvandenbossche
Copy link
Member

FYI, the issue on not dropping NaNs in groupby keys is here: #3729 (and there is also a PR in progress)

@Tim-K-DFW
Copy link

Tim-K-DFW commented Feb 4, 2021

Totally agree. Pandas thinks there's a certain way of doing things, and whatever you decide to do with it, you always need to check what that way is, or risk a lot of aggravation and wasted time. And after you do figure it out, from that point on you need to always add some boilerplate code to work around it. Yes, everything is documented, but the issue is that there are too many extremely opinionated (and some may say unnatural) defaults and artifacts. Here are some of my favorites:

  1. by far, the most annoying thing is default_na in read_csv. It would make much more sense to flip the default - read all N/A, NA etc. stuff as strings unless told otherwise, maybe show a warning message but don't assume anything.

  2. another nice one is preservation of indexes on filtering, so once you filter and then try to get something by loc, you're in for a surprise. I mean, the whole index system is perplexing. It's like having a shadow "axis" with its own mind that you have to always keep in mind and reset (and not forget to drop=True of course!) every time you do something. Indexes are all cost (time and attention) and no benefit. So second suggestion, make indexing optional or at least minimally intrusive.

In general, I sorely miss R and its simplicity and power. It does exactly what it needs to, not a bit less, not a bit more, it doesn't get in your way and doesn't impose its artifacts and arbitrary defaults. Anything you do in Pandas, is longer and more complicated than in R; part of that is due to Python and is outside of your control of course, but a very large part is due to all the defaults and assumptions it makes about what and how user wants to do.

With all that said, I genuinely appreciate all your guys' hard work on maintaining and developing this project. I'm not in any position to complain, given that it's free and I haven't contributed anything, and yet use it 10 hours a day. Just hope you'll consider this and other similar feedback. Thank you very much for what you do! 🙏🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants