Skip to content

BUG: GH10536 in concat for SparseSeries #10626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -388,3 +388,5 @@ Bug Fixes
- Reading "famafrench" data via ``DataReader`` results in HTTP 404 error because of the website url is changed (:issue:`10591`).

- Bug in `read_msgpack` where DataFrame to decode has duplicate column names (:issue:`9618`)

- Bug in ``concat`` with ``SparseSeries`` (:issue:`10536`)
16 changes: 13 additions & 3 deletions pandas/tools/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
from pandas.util.decorators import Appender, Substitution
from pandas.core.common import ABCSeries
from pandas.io.parsers import TextFileReader

from pandas.sparse.series import SparseSeries
from pandas.sparse.frame import SparseDataFrame
import pandas.core.common as com

import pandas.lib as lib
Expand Down Expand Up @@ -838,6 +839,7 @@ def __init__(self, objs, axis=0, join='outer', join_axes=None,
axis = 1 if axis == 0 else 0

self._is_series = isinstance(sample, ABCSeries)
self._is_sp_series = isinstance(sample, SparseSeries)
if not 0 <= axis <= sample.ndim:
raise AssertionError("axis must be between 0 and {0}, "
"input was {1}".format(sample.ndim, axis))
Expand Down Expand Up @@ -894,13 +896,21 @@ def get_result(self):
if self.axis == 0:
new_data = com._concat_compat([x.values for x in self.objs])
name = com._consensus_name_attr(self.objs)
return Series(new_data, index=self.new_axes[0], name=name).__finalize__(self, method='concat')
if self._is_sp_series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't like all of these if-thens. The basic problem is the existence of SparseSeries/SparseDataFrame, which don't really need to exist at all. The blocks actually hold the sparse data. But that is a bit bigger project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do ._constructor to solve the immediate issue and allow for those objects to be removed in time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first approach but I also wanted to keep the symmetry between axis = 1 and axis = 0. And if you want to check fill values etc, the case work seems necessary (if ugly) without refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't replace this block with self._constructor(...?

                if self._is_sp_series:
                    klass = SparseDataFrame
                else:
                    klass = DataFrame
                return klass(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not seeing how that would work? Do Series and DataFrame share the same constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little fast on the trigger. I didn't realize that you need ._is_frame etc elsewhere.

For this code, this is what I was originally thinking. Tell me if I'm still off, though

In __init__

self.obj_constructor=sample._constructor

and then in get_result

self.obj_constructor(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the result could be a SparseDataFrame when the sample is a SparseSeries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sample is a SparseSeries, then sample._constructor will also be a SparseSeries...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can do that if self.axis==0, but concatenation could happen along a different axis, so I thought the code would be easier to understand if the different cases were handled in roughly the same way.

klass = SparseSeries
else:
klass = Series
return klass(new_data, index=self.new_axes[0], name=name).__finalize__(self, method='concat')

# combine as columns in a frame
else:
data = dict(zip(range(len(self.objs)), self.objs))
index, columns = self.new_axes
tmpdf = DataFrame(data, index=index)
if self._is_sp_series:
klass = SparseDataFrame
else:
klass = DataFrame
tmpdf = klass(data, index=index)
if columns is not None:
tmpdf.columns = columns
return tmpdf.__finalize__(self, method='concat')
Expand Down
22 changes: 21 additions & 1 deletion pandas/tools/tests/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@
assert_almost_equal,
makeCustomDataframe as mkdf,
assertRaisesRegexp)
from pandas import isnull, DataFrame, Index, MultiIndex, Panel, Series, date_range, read_table, read_csv
from pandas import (isnull, DataFrame, Index, MultiIndex, Panel, Series, date_range,
read_table, read_csv, SparseSeries, SparseDataFrame)
import pandas.algos as algos
import pandas.util.testing as tm
from pandas.sparse.tests.test_sparse import assert_sp_series_equal, assert_sp_frame_equal

a_ = np.array

Expand Down Expand Up @@ -2476,6 +2478,24 @@ def test_concat_invalid_first_argument(self):
expected = read_csv(StringIO(data))
assert_frame_equal(result,expected)

def test_concat_sp_series(self):
# GH10536
data = [0, 1, 1, 2, 3, 0, np.nan]
index = [1, 2, 3, 4, 5, 6, 7]
sp = SparseSeries(data, index=index)
result = concat([sp, sp], axis=0)
expected = SparseSeries(data * 2, index=index * 2, kind='integer')
assert_sp_series_equal(result, expected)

def test_concat_sp_dataframe(self):
# GH10536
data = [0, 1, 1, 2, 3, 0, np.nan]
sp = SparseDataFrame(data)
result = concat([sp, sp], axis=1, ignore_index=True)
expected = SparseDataFrame({0: data, 1: data})
assert_sp_frame_equal(result, expected)


class TestOrderedMerge(tm.TestCase):

def setUp(self):
Expand Down