Skip to content

BUG: GH10536 in concat for SparseSeries #10626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kawochen
Copy link
Contributor

To address #10536, but it's clearly not enough. What should be done for SparseSeries of different kinds and different fill values?

@jreback
Copy link
Contributor

jreback commented Jul 19, 2015

you can only promote if the things u r concatting have the same fill

the kind doesn't matter

@sinhrks sinhrks added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Jul 20, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone Jul 20, 2015
@artemyk
Copy link
Contributor

artemyk commented Jul 20, 2015

Not sure if this is accurate, but would calling values on the all the concatenated objects convert them to dense first? It would be nice if sparse concat didn't do this.

@kawochen
Copy link
Contributor Author

.values returns a SparseArray and there is a _concat_compat specialized for sparse types, but beyond that I haven't really checked.

@@ -894,13 +896,21 @@ def get_result(self):
if self.axis == 0:
new_data = com._concat_compat([x.values for x in self.objs])
name = com._consensus_name_attr(self.objs)
return Series(new_data, index=self.new_axes[0], name=name).__finalize__(self, method='concat')
if self._is_sp_series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't like all of these if-thens. The basic problem is the existence of SparseSeries/SparseDataFrame, which don't really need to exist at all. The blocks actually hold the sparse data. But that is a bit bigger project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do ._constructor to solve the immediate issue and allow for those objects to be removed in time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first approach but I also wanted to keep the symmetry between axis = 1 and axis = 0. And if you want to check fill values etc, the case work seems necessary (if ugly) without refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't replace this block with self._constructor(...?

                if self._is_sp_series:
                    klass = SparseDataFrame
                else:
                    klass = DataFrame
                return klass(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not seeing how that would work? Do Series and DataFrame share the same constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little fast on the trigger. I didn't realize that you need ._is_frame etc elsewhere.

For this code, this is what I was originally thinking. Tell me if I'm still off, though

In __init__

self.obj_constructor=sample._constructor

and then in get_result

self.obj_constructor(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the result could be a SparseDataFrame when the sample is a SparseSeries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sample is a SparseSeries, then sample._constructor will also be a SparseSeries...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can do that if self.axis==0, but concatenation could happen along a different axis, so I thought the code would be easier to understand if the different cases were handled in roughly the same way.

@kawochen
Copy link
Contributor Author

Since SparseSeries is for float64 only according to this doc string, are the tests here valid?

@artemyk
Copy link
Contributor

artemyk commented Jul 22, 2015

@jreback Out of curiosity --- is there any particular reason for the SparseDataFrame vs DataFrame class distinction (as opposed to just having sparse vs dense underlying block managers?)

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

@artemyk I made this comment in another issue I think, but can't seem to find it.

The short answer is sort-of. I think the original intent of SparseDataFrame was to have a matrix like object that understood indexes. Kind of like a csr/coo (or pick your favorite storage format). That would then for example be all a single dtype, and thus could handle auto-filling.

However after the refactor to move Series out of sub-classing ndarray (0.13). This was no longer necessary (e.g. the SparseBlock became a real thing). So at this point SparseDataFrame is not really necessary at all. SparseSeries OTOH, does have utiliity as it carries around the filling property for example.

So bottom line, I think we could entirely drop SparseDataFrame and everything would work (and of course drop SparsePanel which has an original non-useful internal implementation).

@artemyk
Copy link
Contributor

artemyk commented Jul 22, 2015

@jreback Thanks, that makes sense. I"m actually feeling like the underlying sparse implementation could use an overhaul. For example, if I understand correctly, sparse blocks cannot be consolidated right now. So, if get_dummies is called on a column with 100,000 values, we get a dataframe with 100,000 blocks. I imagine this can strongly hurt performance in some areas. An underlying sparse block manager that can truly handle sparse matrices (not just series) of various data types would be nice.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

well, we have been thinking about what to do with sparse. I am not 100% sure why @wesm originally wrote this. But then again I don't know the state of scipy sparse back in 2010.

So we could have an alternative (or maybe THE) sparse repr that actually is something from scipy, e.g. coo/csr. And just use the current index stuff. It would be something of a project, but yes, could be very nice.

@artemyk
Copy link
Contributor

artemyk commented Jul 22, 2015

Unfortunately it seems that scipy.sparse only supports numeric datatypes. Perhaps a wrapper around something like the Boost sparse matrix library (http://www.boost.org/doc/libs/1_45_0/libs/numeric/ublas/doc/matrix_sparse.htm#2CompressedMatrix) could be used instead.

@wesm
Copy link
Member

wesm commented Jul 22, 2015

@jreback AFAIK isn't pandas.sparse the only solution out there for "mostly NA" sparse tables (scipy.sparse is "mostly 0", which is not the same thing at all)? That was the original use case: panel regressions involving future contract data over a long history. So you might have a time series that only appears for 3-6 months out of a 30 year period, and want to convert that to a long (dense) panel form without insane memory use

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

yeah, have been thinking about sparse and what to do. a suggestion was made to split this off into pandas-sparse, to allow a more freedom to experiment (but still keep some integration). The actual sparse impl is pretty ok. I think we need to blow away SparsePanel (as it has an older impl and not really much supported). Maybe SparseDataFrame (as the blocks can simply be included in a DataFrame), though the usecase for a 2-d sparse repr is there for it. Suggestions welcome, esp someone to lead this effort.

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 16, 2015
@jreback
Copy link
Contributor

jreback commented Oct 11, 2015

@kawochen you want to rebase / update and can review

@jreback
Copy link
Contributor

jreback commented Nov 10, 2015

@kawochen yeh, the sparse stuff needs an overhaul....

closing, but if you'd like to update, pls reopen

@jreback jreback closed this Nov 10, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Jul 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants