Skip to content

BUG: fix NDFrame.as_blocks() for sparse containers #6748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ Bug Fixes
- Bug in consistency of groupby aggregation when passing a custom function (:issue:`6715`)
- Bug in resample when ``how=None`` resample freq is the same as the axis frequency (:issue:`5955`)
- Bug in downcasting inference with empty arrays (:issue:`6733`)
- Bug in ``obj.blocks`` on sparse containers dropping all but the last items of same for dtype (:issue:`6748`)

pandas 0.13.1
-------------
Expand Down
18 changes: 12 additions & 6 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2003,7 +2003,7 @@ def ftypes(self):
return Series(self._data.get_ftypes(), index=self._info_axis,
dtype=np.object_)

def as_blocks(self, columns=None):
def as_blocks(self):
"""
Convert the frame to a dict of dtype -> Constructor Types that each has
a homogeneous dtype.
Expand All @@ -2025,12 +2025,18 @@ def as_blocks(self, columns=None):
"""
self._consolidate_inplace()

bd = dict()
bd = {}
for b in self._data.blocks:
b = b.reindex_items_from(columns or b.items)
bd[str(b.dtype)] = self._constructor(
BlockManager([b], [b.items, self.index])).__finalize__(self)
return bd
bd.setdefault(str(b.dtype), []).append(b)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had meant to move the guts of this to internals anyhow; e.g. maybe get_as_blocks can return the dict of str(dtype) to block manager (already combined), so in generic needs to just iterate and _constructor....finalize...

if it works better to refactor this later (in your internal changes checklist), ok 2

lmk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely get overwritten during refactoring.

And it did occur to me that this is very much like "group blocks by ftype" operation that happens during merging/concatenation. And the implementation does look rather trivial:

def mapreduce_blocks(mgr, keyfunc, reducer):
    return dict((key, reducer(val_iter))
                for key, val_iter in itertools.groupby(mgr.blocks, keyfunc=keyfunc))

def group_blocks_by_ftype(mgr):
    return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.ftype),
                            reducer=list)

def combine_blocks_by_dtype(mgr):
    return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.dtype),
                            reducer=mgr.combine)

Such one-liner functions are indeed a natural extension of blockmanager external API, but I'm not sure if they deserve to be part of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok....maybe just reference this issue then in your refactoring...

will merge then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to separate this into a separate comment.

Which brings me to another topic, minimalism in API design. I mean, I tend to share the opinion that an API is perfect not when you can't add anything else but rather when you can't take anything from it. It happens quite often that an interface gets overburdened with those small details, e.g. it just hurts my eyes to wade through numpy.ndarray attrs/methods when I forget the exact spelling of the thing I want to use and I know is there:

In [1]: np.arange(10)
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]: _1.
_1.T             _1.choose        _1.data          _1.flatten       _1.nbytes        _1.repeat        _1.sort          _1.tostring
_1.all           _1.clip          _1.diagonal      _1.getfield      _1.ndim          _1.reshape       _1.squeeze       _1.trace
_1.any           _1.compress      _1.dot           _1.imag          _1.newbyteorder  _1.resize        _1.std           _1.transpose
_1.argmax        _1.conj          _1.dtype         _1.item          _1.nonzero       _1.round         _1.strides       _1.var
_1.argmin        _1.conjugate     _1.dump          _1.itemset       _1.prod          _1.searchsorted  _1.sum           _1.view
_1.argsort       _1.copy          _1.dumps         _1.itemsize      _1.ptp           _1.setfield      _1.swapaxes      
_1.astype        _1.ctypes        _1.fill          _1.max           _1.put           _1.setflags      _1.take          
_1.base          _1.cumprod       _1.flags         _1.mean          _1.ravel         _1.shape         _1.tofile        
_1.byteswap      _1.cumsum        _1.flat          _1.min           _1.real          _1.size          _1.tolist        

And, unfortunately, pandas containers add more on top of that:

In [1]: pd.Series()
Out[1]: Series([], dtype: float64)

In [2]: _1.
Display all 225 possibilities? (y or n)

The desire to have a bit of everything at your fingertip is tempting indeed, but 225 (two hundred!!) methods and properties is a bit too many for my liking. And it is a lot to wrap your head around when you're only starting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a valid point, except that you have search in the docs, lots of docs, and you can always start the possibilities with a letter (to narrow down the search).

pandas provides a lot of functionaility (as does numpy). not sure that this is a problem per se.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

I like how overloading is handled in BLAS interfaces: you have a convention about prefixes/suffixes that correspond to certain variation of the core algorithm, so you as a programmer should make an educated guess about what the data is going to look like or just use the generic implementation.

As for the naming, I've gone through the reference a bit and found out a "light of hope": string functions that obviously form a cluster of functionality available for series containers are conveniently put under .str attribute, that's a very nice application of namespacing and that's a very viable approach to the API cluttering problem. Following the Zen, I'd suggest that Namespaces are one honking great idea -- let's do more of those!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I monkey patch new functionality I try to create namespaces. Since we control the ipython autocomplete, we could always move methods from the root to namespaces. The original methods would still work, they just wouldn't autocomplete in the root. I do this for subclasses of pandas objects. If I created a subclass like OHLC, I'd rather only see the custom methods I made.

result = {}
for dtype, blocks in bd.items():
# Must combine even after consolidation, because there may be
# sparse items which are never consolidated into one block.
combined = self._data.combine(blocks, copy=True)
result[dtype] = self._constructor(combined).__finalize__(self)

return result

@property
def blocks(self):
Expand Down
8 changes: 8 additions & 0 deletions pandas/sparse/tests/test_sparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -1515,6 +1515,14 @@ def test_sparse_pow_issue(self):

self.assertEqual(len(r2.sp_values), len(r1.sp_values))

def test_as_blocks(self):
df = SparseDataFrame({'A': [1.1, 3.3], 'B': [nan, -3.9]},
dtype='float64')

df_blocks = df.blocks
self.assertEqual(list(df_blocks.keys()), ['float64'])
assert_frame_equal(df_blocks['float64'], df)


def _dense_series_compare(s, f):
result = f(s)
Expand Down