-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add high-performance DecimalArray type? #34166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
would be +1 on this and using pyarrow is fine as we already use for some other EA types |
Great, thanks for the quick response! I have started working on it |
I think this would indeed be interesting to see happen! But two notes:
Can you explain a bit more how you see this part? |
@jorisvandenbossche Thanks for the detailed response! I think I agree that pandas probably should not implement these low-level operations if there is another library (numpy for the base types, pyarrow or something else for these new extension types) that already implements them. It looks like pyarrow has started implementing operations on arrays for some types in the C++ library at cpp/src/arrow/compute/kernels, but it doesn't look like they've added any decimal functions, other than casting Decimal to Decimal or Decimal to Int. I also don't think Regarding fletcher, it also seems that they don't support decimal types right now (see import pyarrow
import fletcher
import pandas as pd
arrow_array = pyarrow.array([
Decimal('21.3000'), Decimal('1.141'), None, Decimal('0.53')
])
df = pd.DataFrame({
'col': fletcher.FletcherChunkedArray(arrow_array)
}) So it seems like the best move for me would be to try to contribute these basic operations to either pyarrow? And then, once that interface is ready, think about adding a wrapper to either pandas or fletcher? (One other note: the ExtensionArray implementation I was playing with would have converted the underlying data, which arrow stores as 128-bit signed ints, to 64-bit signed ints to make them easier to work with in numpy. But that is probably also not a great idea, since we would like the type to roundtrip perfectly between Arrow/Parquet and pandas. So I guess that's another reason to do this in pyarrow itself?) |
there is also a 2 prong strategy here I would certainly store the data in pyarrow arrays and then use any operations that pyarrow defines. for missing operations you can both:
|
IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time). So I think I should look into contributing to pyarrow then! Thanks for the guidance, and feel free to close the issue. |
Note there is a bit more than just casting. What is eg exposed in pyarrow and works for decimals is Further, there are a bunch of element-wise kernels available in the Gandiva submodule, which is a JIT expression compiler for Arrow memmory, which can be used from python on decimal arrays (but also not reductions). Querying what is currently available for decimal types:
And there is currently work being done in pyarrow to refactor the kernel implementation, which amongst other things should allow those gandiva functions also to be reused in the precompiled kernels (see https://issues.apache.org/jira/browse/ARROW-8792). Once that has landed, it should be more straightforward to add a bunch of kernels for the decimal type.
It's certainly the goal of pyarrow to provide to core computational building blocks (so a developer API, just not an end-user API, but it (eventually) should provide everything to allow fletcher / another project to build a decimal ExtensionArray)
I am not sure whether this is on purpose, or whether they would welcome contributions (cc @xhochy)
I agree that such conversion should always be avoided. I think it might be good to keep this issue open as general issue about "fast decimal support in pandas", until there is an actual solution / project we can point users towards. |
@sid-kap something working but slow and available now is much better than fast but not available |
I would be happy to merge things into Also feel free to make a PR to fletcher and add |
@xhochy Thanks, will do! |
@mroeschke does your pyarrow-decimal PR close this? |
Yes, agreed this would |
Is your feature request related to a problem?
Doing operations with decimal types is currently very slow, since pandas stores them as python Decimal objects. I think it would be nice to have a high-performance, vectorized DecimalArray ExtensionArray type. Would you be open to a PR adding a type like this, or is this something you think belongs outside the pandas library?
Describe the solution you'd like
Create an ExtensionArray datatype that internally represents numbers similar to in pyarrow (as ints, with precision and scale). (Require that all values have the same precision and scale.) Implement basic arithmetic operations (+, -, *, /, etc.).
API breaking implications
By adding a
__from_arrow__
and__to_arrow__
field to DecimalArray, this would cause decimal data from pyarrow to be converted to DecimalArray rather than object array by default, which would be a breaking change.Describe alternatives you've considered
The status quo (loading as object arrays) works, but is slow.
Additional context
This would also be really helpful for loading parquet files that decimal types via pyarrow more quickly, since the step to instantiate python Decimal objects for each value can be very slow for large DataFrames.
The text was updated successfully, but these errors were encountered: