Skip to content

Sparse columns #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Aug 25, 2021 · 8 comments
Open

Sparse columns #55

ogrisel opened this issue Aug 25, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@ogrisel
Copy link
Contributor

ogrisel commented Aug 25, 2021

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

Context

It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

Use cases

  • efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
  • efficient computation: e.g. computing the nanmean and nanstd of a sparse column with more then 99% are missing
  • some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.

Limitations

  • treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
  • other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)

Survey of existing support

(incomplete, feel free to edit or comment)

Questions:

  • Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)
  • Should this be some kind of optional module / extension of the main dataframe API spec?
@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 25, 2021

Note: there is a dedicated discussion for single-column categorical data representation in #41.

@rgommers rgommers added the enhancement New feature or request label Aug 25, 2021
@rgommers
Copy link
Member

Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)

That's a really subtle question, which isn't even worked out in array/tensor libraries that provide sparse data structures. My first impression was leaving it undefined, because interpretation does not necessarily depend on memory layout. However there is an interaction with the missing data support already, so that may not be feasible.

fill_value was something that was looked at quite a bit for PyTorch, but it seems like there's additional complexity and very limited use cases for non-zero fill values.

@rgommers
Copy link
Member

Should this be some kind of optional module / extension of the main dataframe API spec?

It seems like there's only a few libraries that support sparse columns. Perhaps a first step would be to use the metadata attribute to store a sparse column and see if two of those libraries can be made to work together. A concrete use case would help a lot.

Memory layout wise sparse is a bit of a problem. Pandas seems to use COO; scipy.sparse has many formats however CSR/CSC are the most performant ones. It'd be nontrivial to have a clear memory layout description here that isn't overly complex.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 1, 2021

A concrete use case would help a lot.

A concrete use case would be to do lossless rountrip conversions of very sparse data between libraries that implement sparse columns either for zeroness or missingness (or both ideally) without triggering a unexpectedly large memory allocation or a MemoryError or trigger the OOM killer.

For instance we could have a dataframe storing the one-hot encoded representation of 6M Wikipedia abstracts with 100000 columns for the 100000 most frequent words in Wikipedia. Assuming Wikipedia abstracts have much less than 1000 words on average, this should easily fit in memory using a sparse representation but this would probably break (or be very inefficient) if the conversion is trying to materialize the zeros silently.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 1, 2021

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation. Furthermore text processing with one-hot encoding is less and less popular now that most interesting NLP tasks are done using lower dimensional dense embeddings from pre-trained neural networks.

@rgommers
Copy link
Member

rgommers commented Sep 2, 2021

Thanks @ogrisel, the application makes a lot of sense.

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation.

Indeed, with use case I also meant: can this actually be done today with two dataframe libraries? If there's no two libraries with support for the same format of sparse data, then adding the capability to the protocol may be a bit premature.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 3, 2021

pandas and vaex both support sparse data (for zeroness) without materialization although with different memory layouts. vaex uses a scipy.sparse CSR matrix while pandas have individual sparse columns.

arrow has null chunks that do not store any values if a full chunk is null.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2021

So we probably should have a prototype that goes from one of pandas/Vaex/Arrow to another one of those libraries without a densification step in between. That may result in something that can be generalized. Given that scipy.sparse should be able to convert between CSR and COO efficiently and pandas is based on COO (with a df.sparse.to_coo() to export to scipy.sparse format), that should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants