-
Notifications
You must be signed in to change notification settings - Fork 21
Sparse columns #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note: there is a dedicated discussion for single-column categorical data representation in #41. |
That's a really subtle question, which isn't even worked out in array/tensor libraries that provide sparse data structures. My first impression was leaving it undefined, because interpretation does not necessarily depend on memory layout. However there is an interaction with the missing data support already, so that may not be feasible.
|
It seems like there's only a few libraries that support sparse columns. Perhaps a first step would be to use the Memory layout wise sparse is a bit of a problem. Pandas seems to use COO; |
A concrete use case would be to do lossless rountrip conversions of very sparse data between libraries that implement sparse columns either for zeroness or missingness (or both ideally) without triggering a unexpectedly large memory allocation or a For instance we could have a dataframe storing the one-hot encoded representation of 6M Wikipedia abstracts with 100000 columns for the 100000 most frequent words in Wikipedia. Assuming Wikipedia abstracts have much less than 1000 words on average, this should easily fit in memory using a sparse representation but this would probably break (or be very inefficient) if the conversion is trying to materialize the zeros silently. |
That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation. Furthermore text processing with one-hot encoding is less and less popular now that most interesting NLP tasks are done using lower dimensional dense embeddings from pre-trained neural networks. |
Thanks @ogrisel, the application makes a lot of sense.
Indeed, with use case I also meant: can this actually be done today with two dataframe libraries? If there's no two libraries with support for the same format of sparse data, then adding the capability to the protocol may be a bit premature. |
pandas and vaex both support sparse data (for zeroness) without materialization although with different memory layouts. vaex uses a scipy.sparse CSR matrix while pandas have individual sparse columns. arrow has null chunks that do not store any values if a full chunk is null. |
So we probably should have a prototype that goes from one of pandas/Vaex/Arrow to another one of those libraries without a densification step in between. That may result in something that can be generalized. Given that |
Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?
Context
It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.
Use cases
nanmean
andnanstd
of a sparse column with more then 99% are missingLimitations
Survey of existing support
(incomplete, feel free to edit or comment)
Questions:
fill_value
param)The text was updated successfully, but these errors were encountered: