-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix NDFrame.as_blocks() for sparse containers #6748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype. Also, drop the 'columns' kwarg, there's reindex for that.
BlockManager([b], [b.items, self.index])).__finalize__(self) | ||
return bd | ||
bd.setdefault(str(b.dtype), []).append(b) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had meant to move the guts of this to internals anyhow; e.g. maybe get_as_blocks
can return the dict of str(dtype) to block manager (already combined), so in generic needs to just iterate and _constructor....finalize...
if it works better to refactor this later (in your internal changes checklist), ok 2
lmk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will most likely get overwritten during refactoring.
And it did occur to me that this is very much like "group blocks by ftype" operation that happens during merging/concatenation. And the implementation does look rather trivial:
def mapreduce_blocks(mgr, keyfunc, reducer):
return dict((key, reducer(val_iter))
for key, val_iter in itertools.groupby(mgr.blocks, keyfunc=keyfunc))
def group_blocks_by_ftype(mgr):
return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.ftype),
reducer=list)
def combine_blocks_by_dtype(mgr):
return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.dtype),
reducer=mgr.combine)
Such one-liner functions are indeed a natural extension of blockmanager external API, but I'm not sure if they deserve to be part of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok....maybe just reference this issue then in your refactoring...
will merge then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to separate this into a separate comment.
Which brings me to another topic, minimalism in API design. I mean, I tend to share the opinion that an API is perfect not when you can't add anything else but rather when you can't take anything from it. It happens quite often that an interface gets overburdened with those small details, e.g. it just hurts my eyes to wade through numpy.ndarray
attrs/methods when I forget the exact spelling of the thing I want to use and I know is there:
In [1]: np.arange(10)
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [2]: _1.
_1.T _1.choose _1.data _1.flatten _1.nbytes _1.repeat _1.sort _1.tostring
_1.all _1.clip _1.diagonal _1.getfield _1.ndim _1.reshape _1.squeeze _1.trace
_1.any _1.compress _1.dot _1.imag _1.newbyteorder _1.resize _1.std _1.transpose
_1.argmax _1.conj _1.dtype _1.item _1.nonzero _1.round _1.strides _1.var
_1.argmin _1.conjugate _1.dump _1.itemset _1.prod _1.searchsorted _1.sum _1.view
_1.argsort _1.copy _1.dumps _1.itemsize _1.ptp _1.setfield _1.swapaxes
_1.astype _1.ctypes _1.fill _1.max _1.put _1.setflags _1.take
_1.base _1.cumprod _1.flags _1.mean _1.ravel _1.shape _1.tofile
_1.byteswap _1.cumsum _1.flat _1.min _1.real _1.size _1.tolist
And, unfortunately, pandas containers add more on top of that:
In [1]: pd.Series()
Out[1]: Series([], dtype: float64)
In [2]: _1.
Display all 225 possibilities? (y or n)
The desire to have a bit of everything at your fingertip is tempting indeed, but 225 (two hundred!!) methods and properties is a bit too many for my liking. And it is a lot to wrap your head around when you're only starting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a valid point, except that you have search in the docs, lots of docs, and you can always start the possibilities with a letter (to narrow down the search).
pandas provides a lot of functionaility (as does numpy). not sure that this is a problem per se.
the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.
I like how overloading is handled in BLAS interfaces: you have a convention about prefixes/suffixes that correspond to certain variation of the core algorithm, so you as a programmer should make an educated guess about what the data is going to look like or just use the generic implementation.
As for the naming, I've gone through the reference a bit and found out a "light of hope": string functions that obviously form a cluster of functionality available for series containers are conveniently put under .str
attribute, that's a very nice application of namespacing and that's a very viable approach to the API cluttering problem. Following the Zen, I'd suggest that Namespaces are one honking great idea -- let's do more of those!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I monkey patch new functionality I try to create namespaces. Since we control the ipython autocomplete, we could always move methods from the root to namespaces. The original methods would still work, they just wouldn't autocomplete in the root. I do this for subclasses of pandas objects. If I created a subclass like OHLC
, I'd rather only see the custom methods I made.
BUG: fix NDFrame.as_blocks() for sparse containers
SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype:
when the last output should be:
This also drops the
columns
kwarg of as_blocks since it doubles reindex functionality.