Skip to content

Un-deprecate DataFrame.from_items() #21850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzwinck opened this issue Jul 11, 2018 · 22 comments
Closed

Un-deprecate DataFrame.from_items() #21850

jzwinck opened this issue Jul 11, 2018 · 22 comments

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Jul 11, 2018

Under #18262 a FutureWarning was added suggesting that existing code like this:

pd.DataFrame.from_items(x)

Should be changed to this:

import collections
pd.DataFrame.from_dict(collections.OrderedDict(x))

The fact that from_items() appeared only 6 times (now 8 times) in a Stack Overflow search was used as partial justification for removing it. But if you search on GitHub, pd.DataFrame.from_items() appears more than 15,000 times in Python--almost half as many as from_records()!

We should celebrate the fact that this function doesn't cause enough confusion to appear often on Stack Overflow. But it does occur (a lot!) in real code, and deprecating it is a mistake.

If constructing a temporary OrderedDict around items is the best way to construct a DataFrame, Pandas should implement that as a short function called DataFrame.from_items(), rather than asking thousands of people to busy themselves to accommodate this unnecessary API change.

I recommend removing the FutureWarning, and retaining this widely-used, longstanding function.

For reference, the FutureWarning starts in 0.23 and looks like this:

FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.

@gfyoung
Copy link
Member

gfyoung commented Jul 12, 2018

cc @jreback @chris-b1

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 19, 2018

@jreback I would appreciate your review of this topic.

@jorisvandenbossche
Copy link
Member

cc @reidy-p @chris-b1

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jul 26, 2018
@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

I disgree, this is a good deprecation. Keeping around old code is not helpful. So -1 on adding back.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 26, 2018

@jreback It is clear that you think this is a good deprecation, otherwise you would not have done it. However, the initial justification ("only 6 hits on Stack Overflow") completely ignored the larger picture, which is that this function is widely used in a huge number of applications and libraries.

It might be the case that the original implementation was unwieldy or costly to maintain. But what could be the reason not to maintain backward compatibility by a 3-line function that delegates the work to OrderedDict and the main DataFrame constructor?

In other words, maybe the old code was not helpful. Would you accept new code to implement this widely-used function in a single place, rather than having thousands of users sprinkle the same into their local codebases?

@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

there is a burden to maintain apis and code
why should this be any different from other ‘highly used code’

@jreback jreback closed this as completed Jul 26, 2018
jzwinck added a commit to jzwinck/pandas that referenced this issue Jul 28, 2018
This removes the deprecation warnings introduced in pandas-dev#18262,
by reimplementing DataFrame.from_items() in the recommended
way using DataFrame.from_dict() and collections.OrderedDict.

This eliminates the maintenance burden of separate code for
from_items(), while allowing existing uses to keep working.

A small cleanup can be done once pandas-dev#8425 is fixed.
jzwinck added a commit to jzwinck/pandas that referenced this issue Jul 28, 2018
This removes the deprecation warnings introduced in pandas-dev#18262,
by reimplementing DataFrame.from_items() in the recommended
way using DataFrame.from_dict() and collections.OrderedDict.

This eliminates the maintenance burden of separate code for
from_items(), while allowing existing uses to keep working.

A small cleanup can be done once pandas-dev#8425 is fixed.
jzwinck added a commit to jzwinck/pandas that referenced this issue Jul 28, 2018
This removes the deprecation warnings introduced in pandas-dev#18262,
by reimplementing DataFrame.from_items() in the recommended
way using DataFrame.from_dict() and collections.OrderedDict.

This eliminates the maintenance burden of separate code for
from_items(), while allowing existing uses to keep working.

A small cleanup can be done once pandas-dev#8425 is fixed.
@jorisvandenbossche
Copy link
Member

I agree with @jzwinck that "old code" and "burden to maintain apis and code" is not a good reason for deprecating from_items. from_items is really not that of a complicated function (compared to many other things in pandas) and as @jzwinck shows can be easily cleaned-up.

If that was the original reason to deprecate it, I think the PR of @jzwinck is a perfectly reasonable solution.

However, often when we deprecate one of the public methods, the main reasons are related to the "User API" design: inconsistent apis, duplicate functionality, hardly useful functionality that clobbers the API, etc are all reasons to consider deprecating a functionality.
We have a huge number of methods on the DataFrame object, and this huge number results in an overload for users. Therefore for having something as a method on the DataFrame, there should be a clear added value that is big enough to balance the added complexity / API load.

So also for from_items: is it worth having this as a DataFrame method given the relatively limited use case / relatively easy alternative?

And to be clear, such a trade-off is difficult and subjective.
And when it is about removing existing functionality (instead of adding a new method), the balance should typically be more strongly against having the method (because then there is also the disruption of the existing use cases).


That said, I personally never had a use case myself where from_items would have been fitting. That is not saying that those do not exist, but for me it is very hard to judge its usefulness.

If this would have been a new method proposal, I would probably be -1 to add it. But given that this is about removing an already existing method, I personally don't know.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 30, 2018

Thanks for the insight @jorisvandenbossche. If this were a brand new API, I would not promote it either. But as you point out, it is an API Pandas has had for a long time.

To me the largest concerns are

  1. Existing uses of from_items() in the wild.
  2. The inadequacy (even bugs, see below) of the refactoring proposed by the deprecation warning. Reading ENH: DataFrame.from_xy methods are duplicates #4916, it was not totally clear even to the experts how to replace from_items() in some cases.

On 24 August 2017, @bashtage asked:

What about `orient='index'? I used this just last week.

This was a prescient question: it is exactly the case which triggers #8425 if you can't use from_items(). But no one replied. Evidently no one tried it either, because one of the existing Pandas test cases fails if you try to replace from_items(orient='index') with from_dict().

I could see making the docs say from_items() is deprecated, but when the difference between user code working and not working comes down to 15 lines of Python, I don't see why we'd break those apps or force them all to copy-paste an incomplete and buggy workaround.

@jzwinck jzwinck closed this as completed Aug 17, 2018
@PatrickDRusk
Copy link

PatrickDRusk commented Sep 14, 2018

I haven't seen any mention that the from_dict(OrderedDict(items)... doesn't work in cases where there would be duplicates in the index. Here is a test that would fail:

import pandas
from collections import OrderedDict
def test_from_dict_replacing_from_items():
    rows = [(1, (2,)), (1, (2,))]
    df1 = pandas.DataFrame.from_items(rows, columns=('a', ), orient='index')
    df2 = pandas.DataFrame.from_dict(OrderedDict(rows), columns=('a', ), orient='index')
    pandas.testing.assert_frame_equal(df1, df2)

@jzwinck
Copy link
Contributor Author

jzwinck commented Jan 29, 2019

@PatrickDRusk Good point. There are several cases where the replacement for from_items() is not obvious. Pandas maintainers simply don't care about backward compatibility--deleting working code from Pandas is more important than providing a smooth experience for long-time Pandas users like you and me.

@gfyoung
Copy link
Member

gfyoung commented Jan 29, 2019

Pandas maintainers simply don't care about backward compatibility--deleting working code from Pandas is more important than providing a smooth experience for long-time Pandas users like you and me.

@jzwinck : I should point out that @jorisvandenbossche was willing to reopen the issue, but you promptly closed it. From the looks of it, your comments are in fact somewhat hypocritical.

I would caution your tone before you make such broad statements about us like that. We don't delete code on a whim, and with so many users of this codebase it is difficult to please everyone when it comes to making a library like this work.

If you want to have a discussion again about this topic, then please re-open the issue.

@jzwinck jzwinck reopened this Jan 29, 2019
@jzwinck
Copy link
Contributor Author

jzwinck commented Jan 29, 2019

@gfyoung OK, I've reopened this issue. The current state of play is that there are a number of users of Pandas who can't easily upgrade to 0.23 or later because from_items() is deprecated and the suggested workaround only covers some fraction of use cases (maybe 30% by cardinality and 70% by popularity). My PR #22094 eliminates most of the burdensome implementation code for from_items() and covers maybe 90% of the use cases (but not the one from @PatrickDRusk, I think).

What should we do next?

@gfyoung gfyoung removed this from the 0.24.0 milestone Jan 29, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 29, 2019

Okay, let's take stock of what we have here:

I haven't seen any mention that the from_dict(OrderedDict(items)... doesn't work in cases where there would be duplicates in the index.

Hmmm...duplicate indices generally have not been a fan-favorite for us as maintainers because such support can be very hard to maintain, not just for one case, but as a "feature" across the entire interface of pandas.

@PatrickDRusk : So admittedly I'm somewhat torn: the statement is indeed inaccurate with regards to duplicate indices, but committing to supporting such behavior could send the wrong message that we intend to support duplicates full-throttle, when that is far from the case in our codebase.

@pandas-dev/pandas-core : Thoughts on this?

the suggested workaround only covers some fraction of use cases (maybe 30% by cardinality and 70% by popularity)

@jzwinck : This statement unfortunately is too vague for us to work with. Especially since we already made the decision to deprecate, we will need more than words and GitHub search results to make a convincing argument. Meet us halfway and present us with more examples to illustrate that the suggested change is incompatible.

Caveat: if they fall into same category of duplicates, I personally wouldn't count it unless it is somehow different (I'll leave it to your discretion to judge appropriately) from what @PatrickDRusk has given.

@toobaz
Copy link
Member

toobaz commented Jan 29, 2019

@pandas-dev/pandas-core : Thoughts on this?

I'm +0.5 on keeping the method. In general, I don't see "easy to replace with user code" as an argument to deprecate - maybe the opposite, if the code is so simple and so widely used...
I'm -1.5 on "we don't want to encourage duplicated labels" as an argument. Pandas supports them, let's face it. It is perfectly fine to not support them - and document it - in those cases when it's just a mess. But not supporting them "just to give a message to users", in cases where it's extremely simple to support them, is not nice.

Anyway, if we really want to discourage users from using duplicated labels, the the correct way to do it is to introduce a general warning for it, inside the Index (and subclasses) constructor. Maybe even a DeprecationWarning, giving users several releases to adapt, would allow us to eventually simplify our code. I just don't think it's worth the pain (but this is very IMHO).

@toobaz
Copy link
Member

toobaz commented Jan 29, 2019

By the way: deprecating DataFrame.from_items because the API is too large is particularly funny because we just introduced the very specific MultiIndex.from_frame, which is a very simple wrapper.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jan 29, 2019

@gfyoung I don't personally need support for duplicate values in the index, and I wrote #22094 without considering that. For people who like reduced implementation code and don't want duplicate values in the index, #22094 provides both.

You said:

Meet us halfway and present us with more examples to illustrate that the suggested change is incompatible.

Please look at the logic in #22094. That's what I had to write to make the preexisting unit tests pass. In other words, the unit tests I restored in that PR are the examples you are looking for. Any code using from_items() with the columns and/or orient parameters requires more than the suggested workaround using OrderedDict (and that additional code is what my PR implements). I wrote that PR for two reasons: to solve the problem, and to demonstrate that the trivial-seeming workaround code using OrderedDict is not sufficient.

@gfyoung
Copy link
Member

gfyoung commented Jan 29, 2019

"we don't want to encourage duplicated labels" as an argument. Pandas supports them, let's face it. It is perfectly fine to not support them - and document it

@toobaz : I beg to differ to the extent in which duplicates of labels, whether it be in columns or indices is supported as well as non-duplicates, but let's agree to disagree on this one for the moment. My point is not about "sending a message" by not supporting it. It's actually about "not sending the wrong message" that's more important IMO.

@gfyoung
Copy link
Member

gfyoung commented Jan 29, 2019

Please look at the logic in #22094. That's what I had to write to make the preexisting unit tests pass.

@jzwinck : I read it, and I'm really confused here...what pre-existing unit test was failing before your change? You seemed to have removed several tests + removed deprecation warnings that we had in place that we were checking for...

@jzwinck
Copy link
Contributor Author

jzwinck commented Jan 29, 2019

@gfyoung I should have said "to make the tests pass as they did before the deprecation warnings were added." In other words, I put the unit tests back to the way they were before the deprecation, and then implemented the logic in from_items() to make them all work again without the "burdensome" logic (replacing that with the OrderedDict solution plus a bit more logic to cover all the test cases).

@shoyer
Copy link
Member

shoyer commented Jan 29, 2019

I would also lean towards keeping from_items around. It may not be an optimal API, but it solves a real use case and it isn't broken. I don't see it really adding much to the pandas maintenance burden.

@jzwinck jzwinck closed this as completed May 13, 2019
@jolespin
Copy link

jolespin commented Apr 9, 2020

If .from_items isn't going to be un-deprecated then Rpy2 is going to need to update their code:

@ri2py.register(DataFrame)
def ri2py_dataframe(obj):
    items = tuple((k, ri2py(v)) for k, v in obj.items())
    res = PandasDataFrame.from_items(items)
    return res

I set up a GitHub issue to let them know the change breaks their code.
rpy2/rpy2#680

@aflag
Copy link

aflag commented Mar 31, 2022

Is there any workaround that allows us to have a fast version of from_items which allows for duplicated columns? I think I've missed it. Otherwise, can we reopen the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants