-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix issue with concat creating SparseFrame if not all series are sparse. #18924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this should not to be necessary. rather this is a bug in how reindexing is implemented |
K I'll look at reindex then. I was a bit hesitant to push this fix because it felt a bit off. |
I think it might still be get_dummies because if I create a SparseDataFrame from scratch it works fine import pandas.util.testing as tm
df2 = pd.SparseDataFrame()
df2['GDP'] = pd.Series([1,2])
df2['Nation_AB'] = pd.SparseSeries([1,0])
df2['Nation_CD'] = pd.SparseSeries([0,1])
df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
df = pd.get_dummies(df, columns=['Nation'], sparse=True) # SparseDataFrame
tm.assert_frame_equal(df, df2) # => Works
df2.reindex(columns=['GDP']) #=> doesn't fail
df.reindex(columns=['GDP']) #=> fails
df.GDP #=> fails since it's cast as a SparseSeries but has an IntBlock underneath
df[['GDP']] #=> works fine
df2.GDP.block #=> gets cast to SparseBlock somewhere in transmission Somehow... GDP is being stored as a IntBlock but marked as SparseBlock. It doesn't feel like this is reindex's fault but I can definitely be wrong. I might have to come back to this in the morning. Not sure where to take it. |
Codecov Report
@@ Coverage Diff @@
## master #18924 +/- ##
==========================================
- Coverage 91.62% 91.6% -0.03%
==========================================
Files 150 150
Lines 48724 48726 +2
==========================================
- Hits 44642 44634 -8
- Misses 4082 4092 +10
Continue to review full report at Codecov.
|
Alright since it's after the holiday I'll take a look today 😄. |
I think I need some guidance on this one... It really doesn't appear to be reindex to me. Doing some debugging I couldn't re-create it without utilizing concat. The problem is that this column "GDP" is being cast as a SparseSeries when the underlying block is an IntBlock. But reindex thinks that since it's a SparseSeries it should re-index using SparseArray only which is why it's throwing errors. IF I create the sparse data frame from scratch I can't re-create the issue but if I use concat I can. from pandas.core.reshape.reshape import _get_dummies_1d
from pandas.core.reshape.concat import concat
df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
dummy = _get_dummies_1d(df['Nation'], sparse=True, prefix='Nation')
isinstance(dummy, pd.SparseDataFrame)
isinstance(df, pd.DataFrame)
# I think this is the real problem.
concatted = concat([df.drop('Nation', axis=1), dummy], axis=1)
concatted.reindex(columns=['GDP']) # => Fails because reindex thinks it's reindexing a SparseArray but it's an IntBlock
concatted.GDP # => Fails since the underlying block is an IntBlock not a SparseBlock I also looked at concat but have run out of my open source time for the day :(. Let me know if I'm on the right track or barking up a tree I shouldn't be barking up. Thanks @jreback and @TomAugspurger. |
Alright I looked at this again today... Here is the problem as I see it: SparseFrame is instantiating only SparseSeries even though 'GDP' isn't a SparseSeries. On top of that SparseSeries when wrapping a SingleBlockManager doesn't coerce it into a Sparse data structure. I think SingleBlockManager should be coerced into a sparse structure since everything else is on instantiation. So I will try that tomorrow. |
yeah this should actually be returning a |
cb099dd
to
2612dcc
Compare
Hello @hexgnu! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 31, 2018 at 11:26 Hours UTC |
7b34c07
to
a37ea0c
Compare
daac0c8
to
4061e67
Compare
Alright so I just pushed some test fixes. One thing that kind of confuses me is why is there both: pandas.core.dtypes.concat._get_series_result_type
pandas.core.dtypes.concat._get_frame_result_type They are almost identical minus some things. I didn't want to get stuck in a refactor loop but it seems like these functions should be one instead of two? |
pandas/tests/reshape/test_reshape.py
Outdated
df = get_dummies(df, columns=['Nation'], sparse=True) | ||
df2 = df.reindex(columns=['GDP']) | ||
|
||
tm.assert_index_equal(df.index, df2.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert the frame itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the sparse arg here (so it handles both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few parameterized changes. This one as well as the test in test_combine_concat.py which honestly looks like it needs some cleanup someday.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -363,6 +363,7 @@ Reshaping | |||
- Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`) | |||
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`) | |||
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`) | |||
- Bug in :func:`pandas.core.dtypes.concat._get_series_result_type` which returns SparseDataFrame even if not all contained in Frame are sparse. (:issue:`18914` and :issue:`18686`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whatsnew note should be from a user perspective (e.g. the error that happens with get_dummies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to be about pd.concat I think that's user facing right? While get_dummies is the symptom pd.concat is the real problem. Lemme know if you want me to change that thanks!
have a look at https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3ASparse+label%3AReshaping, see if this closes any of those (aside from the already marked ones) |
4061e67
to
2ef4ed1
Compare
Also I found one more issue that was directly tied to this. I opened up a new issue though asking about DataFrame.density and whether that should be defined or not. |
2ef4ed1
to
41411d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment. did you cover the tests in the issue you are marking as closing?
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -363,6 +363,7 @@ Reshaping | |||
- Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`) | |||
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`) | |||
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`) | |||
- Bug in :func:`concat` when concatting sparse and dense series it returns only a SparseDataFrame. Should be a DataFrame. (:issue:`18914`, :issue:`18686`, and :issue:`16874`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks around DataFrame, SparseDataFrame.
looks like we can add some tests from the issues. (make the new tests and add the gh issue on them). |
Yea I wondered if you wanted me to do that. I’ll get on it soon thanks!
…On Thu, Jan 4, 2018 at 8:44 AM Jeff Reback ***@***.***> wrote:
looks like we can add some tests from the issues. (make the new tests and
add the gh issue on them).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18924 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIdDu0VpdVzT6fxj5n5M_0wg32MRz1mks5tHB7ugaJpZM4RL0Op>
.
|
So on #18686 it will close that problem but to_coo won't be defined on DataFrame. Do you want me to tack onto that existing issue I made? I added a test for the other one plus a fix for what you pointed out in whatsnew. |
This was originally brought up in :issue:`18686` and :issue:`18914`. Basically the problem is when you use get_dummies with sparse=True it will return a SparseDataFrame with sparse and dense columns. This is in fact not what we want. What we want is a DataFrame with sparse and dense columns. Inside of pandas.core.dtypes.concat is a function that defines the factory class which needed to be changed.
bfa713d
to
6fc6369
Compare
Ok this should be good to go after tests run @jreback |
@hexgnu close now! Just some linting failures: https://travis-ci.org/pandas-dev/pandas/jobs/334545951#L3209 Can you fix those and repush? |
@TomAugspurger done! thanks :) been waiting for that fix in master to show up. |
pandas/core/dtypes/concat.py
Outdated
return SparseDataFrame | ||
else: | ||
return objs[0] | ||
return next(obj for obj in objs if not type(obj) == SparseDataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ABCSparseDataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually need to add this in pandas/core/dtypes/generic.py (and a test for it)
use isinstance(obj, ABCSparseDataFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm you can just return a DataFrame
(this migth be subclasses though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep this as similar to the original as possible. For instance someone might have their own subclass defined of DataFrame.
assert isinstance(res, pd.SparseDataFrame) | ||
tm.assert_frame_equal(res.to_dense(), exp) | ||
|
||
res = pd.concat([self.dense3, sparse], axis=1) | ||
exp = pd.concat([self.dense3, self.dense1], axis=1) | ||
assert isinstance(res, pd.SparseDataFrame) | ||
# See GH18914 and #18686 for why this should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just construct the dataframe and compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to follow the style of this test. I can definitely do that but figured since the rest of the test was following that standard I'd keep it similar.
assert res.iloc[0, 0] == self.dense3.iloc[0, 0] | ||
|
||
for column in self.dense3.columns: | ||
tm.assert_series_equal(res[column], exp[column]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be all for splitting this long test up a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright split it up a bit. Also added some new test cases since it wasn't covering all aspects like concating onself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just some test cleanup. ping on green.
|
||
assert isinstance(res, pd.SparseDataFrame) | ||
tm.assert_frame_equal(res.to_dense(), exp) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here (and in test below), the purposes of the reverses & the loop
res = pd.concat(sparse_frame, axis=1) | ||
exp = pd.concat(dense_frame, axis=1) | ||
|
||
# See GH18914 and #18686 for why this should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove these asserts (357-364), the assert on the frame/series is comprehensive. you can not the issue numbers at the top of the test instead
pandas/core/dtypes/concat.py
Outdated
if any(b.is_sparse for b in result.blocks): | ||
from pandas.core.sparse.api import SparseDataFrame | ||
|
||
from pandas.core.sparse.api import SparseDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could put this right before you actually use SparseDataFrame
lgtm. ping on green. |
alrighty all green thanks! @jreback |
thanks @hexgnu nice patch! keep em coming! |
Ok so after trying a few things out. This seems to be the problem.
When concatting multiple data frames together if any of the containing series are sparse then the entire data frame becomes sparse (or SparseDataFrame).
This is in fact not what we want. We want a DataFrame that contains a SparseSeries and a Series.
closes #18914,
closes #18686
closes #16874
closes #18551
git diff upstream/master -u -- "*.py" | flake8 --diff