-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: revisit & simplify Block/BlockManager, remove axes #6745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
whoosh! cleans up a lots of code |
Thanks for the info, I'm pretty sure it's a lot slower right now, but I'm going to address performance when the situation stabilizes. As of now, it's breakage everywhere. |
ok np but the general concept seems like a nice idea does make it a lot simpler so big +1 in that perf will come |
iirc their is an OrdersDict somewhere in compat (so that fixes the. 2.6 errors) |
@jreback, one of timeseries-related slow tests fails under weird circumstances on py2.6, any idea on why that may happen? I tried to troubleshoot that but to no avail. |
that uses numpy 1.6.1, maybe set up a virtual env to debug. Their is some coercion in make_block to make sure that a datetime-like is actually coerced to the DatetimeBlock (normally this coercion happens upon in _sanitize_array), but their are cases like this (e.g. insertion to a DataFrame), when that is not called (and the coercion happens in make_block). I on't recall the exact reason that _sanitize_array is NOT used. It might be that is the best soln and eliminate the coercion in make_block. |
Yeah, I figured that out from logs, although I did install some dependencies by hand... |
So, the branch has been green for quite a while, there's still some cleanup potential, but in terms of functionality it's complete. Several issues still exist though:
|
any perf issues? |
if u find a 3rd party library that has a compatible license (bsd) and it can be fully incorporated then it's ok to include it would be shipped inline with pandas (kind of like ujson / msgpack / khash stuff) |
if u have something in mind put up a link and I'll take a look |
index is not new_data.axes[baxis]): | ||
new_data = new_data.copy(deep=copy) | ||
new_data.set_axis(baxis, index) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is copy handled in BlockManager.reindex_indexers ?
I don't think this is well tested and most of the time it should prob copy unless identical indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rule of thumb I try to follow is, yeah, that reindex should copy, unless there's "inplace=True" kwarg somewhere.
But point taken, need to double check that.
I don't think the mode should be changed on the computation/tests/test_eval.py I turn off all mode changing to the repor FYI |
@immerr - okay I got part way through adding a range index. I'll see if I can dust it off soon. Not sure if you want to wait though. |
@jtratner that's awesome, I'm fine with using BlockPlacement for now but I look forward to dropping that kludge in favour of what you're going to implement. |
@jreback, I don't see any computation/tests directory there... where should I look? |
it's the first file in the changes list (it's the mode that is changed) |
Ahh, will fix, thanks |
ping when you want me to look again and squash as appropraite |
There's something weird with how sparse blocks handle reindexing: In [1]: pd.__version__
Out[1]: '0.13.1'
In [2]: pd.DataFrame({'a': pd.SparseArray([1,2,3, np.nan, np.nan]), 'b': [1,2,3,np.nan,np.nan]})
Out[2]:
a b
0 1 1
1 2 2
2 3 3
3 NaN NaN
4 NaN NaN
[5 rows x 2 columns]
In [3]: _2.reindex(_2.index[1:], fill_value=10.)
Out[3]:
a b
1 2 2
2 3 3
3 10 NaN
4 10 NaN
[4 rows x 2 columns] I thought this was a result of my intervention, but this occurs in upstream, too. @jreback, is this expected? |
Ok, this has to be a bug... In [4]: _2.reindex(_2.index[1:], fill_value='foo')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-a9f0316534b7> in <module>()
----> 1 _2.reindex(_2.index[1:], fill_value='foo')
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py in reindex(self, index, columns, **kwargs)
2160 def reindex(self, index=None, columns=None, **kwargs):
2161 return super(DataFrame, self).reindex(index=index, columns=columns,
-> 2162 **kwargs)
2163
2164 @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs)
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/generic.py in reindex(self, *args, **kwargs)
1563 return self._reindex_axes(axes, level, limit,
1564 method, fill_value, copy,
-> 1565 takeable=takeable).__finalize__(self)
1566
1567 def _reindex_axes(self, axes, level, limit, method, fill_value, copy,
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py in _reindex_axes(self, axes, level, limit, method, fill_value, copy, takeable)
2115 if index is not None:
2116 frame = frame._reindex_index(index, method, copy, level,
-> 2117 fill_value, limit, takeable=takeable)
2118
2119 return frame
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py in _reindex_index(self, new_index, method, copy, level, fill_value, limit, takeable)
2127 return self._reindex_with_indexers({0: [new_index, indexer]},
2128 copy=copy, fill_value=fill_value,
-> 2129 allow_dups=takeable)
2130
2131 def _reindex_columns(self, new_columns, copy, level, fill_value=NA,
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/generic.py in _reindex_with_indexers(self, reindexers, method, fill_value, limit, copy, allow_dups)
1689 new_data = new_data.reindex_indexer(index, indexer, axis=baxis,
1690 fill_value=fill_value,
-> 1691 allow_dups=allow_dups)
1692
1693 elif (baxis == 0 and index is not None and
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/internals.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups)
3249 for block in self.blocks:
3250 newb = block.reindex_axis(
-> 3251 indexer, axis=axis, fill_value=fill_value)
3252 new_blocks.append(newb)
3253
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/internals.py in reindex_axis(self, indexer, method, axis, fill_value, limit, mask_info)
1878 fill_value = self.fill_value
1879 return self.make_block(self.values.take(indexer), items=self.items,
-> 1880 fill_value=fill_value)
1881
1882 def reindex_items_from(self, new_ref_items, indexer=None, method=None,
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/core/internals.py in make_block(self, values, items, ref_items, sparse_index, kind, dtype, fill_value, copy, fastpath)
1827 new_values = SparseArray(values, sparse_index=sparse_index,
1828 kind=kind or self.kind, dtype=dtype,
-> 1829 fill_value=fill_value, copy=copy)
1830 return make_block(new_values, items, ref_items, ndim=self.ndim,
1831 fastpath=fastpath)
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/sparse/array.py in __new__(cls, data, sparse_index, index, kind, fill_value, dtype, copy)
150 if sparse_index is None:
151 values, sparse_index = make_sparse(data, kind=kind,
--> 152 fill_value=fill_value)
153 else:
154 values = data
/home/immerrr/.conda/envs/py33/lib/python3.3/site-packages/pandas/sparse/array.py in make_sparse(arr, kind, fill_value)
509 length = len(arr)
510
--> 511 if np.isnan(fill_value):
512 mask = -np.isnan(arr)
513 else:
TypeError: Not implemented for this type |
the reindex looks like a bug |
reported it as #6949, do you want me to fix this upstream before fixing in this branch? |
Now, there's one authoritative mapping from BlockManager axis0 to axis0 of subordinate blocks: * _blknos[loc] is block number where loc-th item resides * _blklocs[loc] is location in given block of loc-th item Blocks support this mapping with blk._ref_locs attributes that enumerate all items contained in that block.
PERF: add BlockPlacement class to optimize range-like cases CLN: unify get_slice & take/reindex ops along axis0 TST: add slicing/taking/reindexing tests for BlockManagers CLN: remove unused code
@jreback, rebased and green |
ok...bombs away |
and joy you get to support this going forward :) |
CLN: revisit & simplify Block/BlockManager, remove axes
sure...that's reasonable, #163 will leave open though |
ok...travis passed (except a trivial failure), but windows is blowing up (it tests all versions, this just so far)
|
I think you need to wrap all calls the to cython routines with windows does weird things, e.g. when contructing a numpy array, be sure to ALWAYS make it if you have a patch I can test it you can also try to setup a build environment for windows....(see |
Can this be reproduced on 32-bit Linux? |
hmm...I would say it could; has to do with the default type for numpy |
@jreback could you try immerrr@0054ec0 ? |
fails pretty much everywhere
|
instead of using memoryviews, you can prob use: |
buffers didn't work well for generator function afair |
2nd portion of 32-bit fixes: immerrr@dd75f0f |
Is there some sort of a policy (or requirement) to have all pandas-internal ints 64 bit? Platform-int seems a better decision performance wise, no? |
their is an issue about that somewhere, in fact to change to all 64-bit. for simplicity. perf I dont' think makes a huge difference. It just causes massive code complexity (e.g. Index is a case in point, all kinds of conversions to/from platform). The internal doesn't matter per se, but it avoids issues like this. |
This particular issue was caused by me putting int64_t to function declarations (rather tentatively) instead of NPY_INTP. I started changing types to int64_t only to find out that |
ok....your fixes worked....do you want to do a PR for that (make sure it passes travis)... |
will do tomorrow probably, feel free to pull them in directly if you like. |
picked 2972f37 |
an u add a release note and in v0.14.0 enhancement section |
# blocks list, that way new blkno is exactly len(blocks). | ||
# | ||
# FIXME: mgr_groupby_blknos must return mgr_locs in ascending order, | ||
# pytables serialization will break otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback long-shot any idea if this FIXME is still needed or what it would take to address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea
sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is sure quite a trip down the memory lane :)
Unfortunately, I don't quite remember what was the problem with pytables serialisation here, but looks like mgr_locs are returned in ascending order from lib.get_blkno_indexers
anyway, so it's likely that the FIXME is not needed anymore.
This is a WIP request for Stage 1 deliverable of #6744.
Roadmap:
items
ref_items