-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
MetaIssue: Pandas tries too hard to be smart #12865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
well if you post a copy-pastable example then you will prob get more help. I am not sure why any of this is suprising. This is all quite well documented and as expected. |
This isn't a usage question. I eventually did figure out how to get Pandas to properly format output. This is more of a general comment on the usability of pandas. These default behaviors don't seem sensible to me as a user, and having to work around them makes pandas seem both harder to use and more fragile than one would like. The end result being that I probably won't use pandas at all. |
@DavidEscott well if you want to help people, being more specific with an example would help. |
So here are some responses to your points, keeping in mind that I don't a good example of what you are doing, so have to guess a bit:
The is to USE the hierarchical indexes, you can pass
This is standard practice. What else would a null string mean then?
Exactly you DO normally exclude these. This is a feature which may eventually be added.
This may be added at some point. But is a fact of life ATM with numpy as the only backend.
This is outputing the default index. This is to make |
I think your response and my response will generally come down to what is a rare case for you is not a rare case for me, and vice versa. Ultimately for me it is coming down to their being too many hoops to jump through to get sensible behavior that it isn't worth it. I know of no other system that when asked to aggregate values drops rows because the key contains a NULL value. Now I predominately work in SQL, but this seems very strange to me. The use of an empty string to mean an empty string (and not a null string), is ambiguous, but again very common in the kinds of files and data I work with. It would be great if we never had to deal with CSV and its inability to express NULL, but that isn't likely to happen soon. So if I were to make concrete suggestions:
As it is the default behavior is not "standard csv" (whatever that may mean) but rather "pandas version of csv", and that is very confusing to new users. I do not expect csv files to have rownumbers in the first column. I do not expect multi-line headers, etc... I don't care if some data is lost, but I really need to be able to get the data out with as little modification/metadata as possible. I just want the bare output of column headers and contents. |
pls show a specific example, your code above is not very idiomatic. |
I'm not sure what you consider to not be idiomatic about this. All I am doing is ensuring that the file lhs obeys the restriction that lhs_frame = pandas.read_table(infile)
lhs_frame.FIRST_NAME.fillna('', inplace=True)
lhs_frame.LAST_NAME.fillna('', inplace=True)
rhs_frames = []
for infile in rhs_list:
df = pandas.read_table(infile)
df.FIRST_NAME.fillna('', inplace=True)
df.LAST_NAME.fillna('', inplace=True)
rhs_frames.append(df)
# sum of RHS
rhs = (pandas.concat(rhs_frames).groupby(["FIRST_NAME", "LAST_NAME"])
.sum().reset_index())
# max of lhs and rhs
patched = (pandas.concat([lhs_frame, rhs]).groupby(["FIRST_NAME", "LAST_NAME"])
.max().reset_index())
patched.to_csv(infile + ".patched", sep="\t", na_rep="", index=False, float_format="%.0f") |
@DavidEscott why are you calling |
@TomAugspurger I have to call lhs = pandas.DataFrame([["A", "B", 1]], columns=["FIRST", "SECOND", "VAL"])
rhs = pandas.DataFrame([["A", "B", 1], ["A", "B", 2], ["C", "D", 3]], columns=["FIRST", "SECOND", "VAL"])
summed= rhs.groupby(["FIRST", "SECOND"]).sum()
union = pandas.concat([lhs, summed])
# this union object is really strange
patched = union.groupby(["FIRST", "SECOND"]).max() The end result I get from that is wrong. It tells me the max of the lhs and the summed rhs is A,B,1 not A,B,3\nC,D,3 as it should be. It completely ignored the rhs. While pandas allows me to concatenate a groupby object, and a regular dataframe, it doesn't do anything sensible with such objects. Hence the reset_index to get back to a normal dataframe. I'm not asking for how to make pandas do this correctly. After many hours I did eventually figure it out. I'm pointing out how counter-intuitively pandas is behaving. |
Gotcha. In case anyone else comes across this, lhs = lhs.set_index(['FIRST', 'SECOND'])` # you would probably just do this in read_csv
(pd.concat([lhs, rhs.groupby(['FIRST', 'SECOND']).sum()])
.groupby(level=['FIRST', 'SECOND'])
.max()
)
The tough with that is what's intuitive is in the eye of the beholder. The solution is consistent behavior and good docs. Let us know if you find any areas where those are lacking. |
FYI, the issue on not dropping NaNs in groupby keys is here: #3729 (and there is also a PR in progress) |
Totally agree. Pandas thinks there's a certain way of doing things, and whatever you decide to do with it, you always need to check what that way is, or risk a lot of aggravation and wasted time. And after you do figure it out, from that point on you need to always add some boilerplate code to work around it. Yes, everything is documented, but the issue is that there are too many extremely opinionated (and some may say unnatural) defaults and artifacts. Here are some of my favorites:
In general, I sorely miss R and its simplicity and power. It does exactly what it needs to, not a bit less, not a bit more, it doesn't get in your way and doesn't impose its artifacts and arbitrary defaults. Anything you do in Pandas, is longer and more complicated than in R; part of that is due to Python and is outside of your control of course, but a very large part is due to all the defaults and assumptions it makes about what and how user wants to do. With all that said, I genuinely appreciate all your guys' hard work on maintaining and developing this project. I'm not in any position to complain, given that it's free and I haven't contributed anything, and yet use it 10 hours a day. Just hope you'll consider this and other similar feedback. Thank you very much for what you do! 🙏🏽 |
....and ends up doing things wrong.
I had the following task: Given a csv file A and a set of identically structured files X, Y, Z [which have keys in the first two columns, and values in subsequent columns]: I wanted to create a new file which has contents MAX(A.column_val, X.column_val + Y.column_val + Z.column_val).
I had previously written a pure python function to sort the individual columns, and then perform an merge-join on the columns. However conceptually this task is only a few lines of pandas code:
So surely switching to pandas would make the code easier to understand and maintain. Right? Nope!!!
All of the following were issues I had to work around:
reset_index()
at particular points in my code.na_rep=""
to myto_csv
command.fillna('')
(and that doesn't modify the frame you call it on, but rather returns a copy so you have tofillna('', inplace=True)
1.0, 2.0
. instead of1, 2
. So I have to add afloat_format
directive to theto_csv
.index=False
has to get added toto_csv.
The end result is that I am not of the opinion that the resulting pandas implementation is any more legible or maintainable. In three weeks am I going to remember why I have set
index=False
or what the hell that means? Am I going to understand why the calls toreset_index
are necessary, or what they do? Probably not, which makes me concerned about using this code in production.The text was updated successfully, but these errors were encountered: