Skip to content

ENH: Add high-performance DecimalArray type? #34166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sid-kap opened this issue May 13, 2020 · 12 comments · Fixed by #50964
Closed

ENH: Add high-performance DecimalArray type? #34166

sid-kap opened this issue May 13, 2020 · 12 comments · Fixed by #50964
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@sid-kap
Copy link

sid-kap commented May 13, 2020

Is your feature request related to a problem?

Doing operations with decimal types is currently very slow, since pandas stores them as python Decimal objects. I think it would be nice to have a high-performance, vectorized DecimalArray ExtensionArray type. Would you be open to a PR adding a type like this, or is this something you think belongs outside the pandas library?

Describe the solution you'd like

Create an ExtensionArray datatype that internally represents numbers similar to in pyarrow (as ints, with precision and scale). (Require that all values have the same precision and scale.) Implement basic arithmetic operations (+, -, *, /, etc.).

API breaking implications

By adding a __from_arrow__ and __to_arrow__ field to DecimalArray, this would cause decimal data from pyarrow to be converted to DecimalArray rather than object array by default, which would be a breaking change.

Describe alternatives you've considered

The status quo (loading as object arrays) works, but is slow.

Additional context

This would also be really helpful for loading parquet files that decimal types via pyarrow more quickly, since the step to instantiate python Decimal objects for each value can be very slow for large DataFrames.

@sid-kap sid-kap added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2020
@jreback
Copy link
Contributor

jreback commented May 13, 2020

would be +1 on this and using pyarrow is fine as we already use for some other EA types

@sid-kap
Copy link
Author

sid-kap commented May 14, 2020

Great, thanks for the quick response! I have started working on it

@jorisvandenbossche
Copy link
Member

I think this would indeed be interesting to see happen! But two notes:
1) nothing stops someone to already implement this in a separate package (the ExtensionArray interface should make this fully possible), and personally I think it makes sense to first experiment with this outside of pandas core, we can later still include it.
And 2) the fletcher library is already kind of doing this (but only for all arrow types, not just for decimal, which might also be a distraction if you want to focus on decimal)

Create an ExtensionArray datatype that internally represents numbers similar to in pyarrow (as ints, with precision and scale). (Require that all values have the same precision and scale.) Implement basic arithmetic operations (+, -, *, /, etc.).

Can you explain a bit more how you see this part?
I would say that we should rely on pyarrow itself to store the array (so not just store the data similar to how pyarrow stores it, but actually using pyarrow), and also for operations we should try to rely on pyarrow (I don't think we would want to implement in pandas our own decimal algorithms for basic operations)

@jorisvandenbossche jorisvandenbossche added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 14, 2020
@sid-kap
Copy link
Author

sid-kap commented May 15, 2020

@jorisvandenbossche Thanks for the detailed response! I think I agree that pandas probably should not implement these low-level operations if there is another library (numpy for the base types, pyarrow or something else for these new extension types) that already implements them. It looks like pyarrow has started implementing operations on arrays for some types in the C++ library at cpp/src/arrow/compute/kernels, but it doesn't look like they've added any decimal functions, other than casting Decimal to Decimal or Decimal to Int. I also don't think pyarrow exposes any of these compute functions yet in the python interface, so in any case we wouldn't be able to use them. (I'm not sure if exposing that interface is one of pyarrow's goals, or if they want other libraries like fletcher to make that interface.)

Regarding fletcher, it also seems that they don't support decimal types right now (see fletcher.base._python_type_map).
So the following code unfortunately fails with a KeyError, since pa.Decimal128Type() is not in that map:

import pyarrow
import fletcher
import pandas as pd

arrow_array = pyarrow.array([
    Decimal('21.3000'), Decimal('1.141'), None, Decimal('0.53')
])
df = pd.DataFrame({
    'col': fletcher.FletcherChunkedArray(arrow_array)
})

So it seems like the best move for me would be to try to contribute these basic operations to either pyarrow? And then, once that interface is ready, think about adding a wrapper to either pandas or fletcher?

(One other note: the ExtensionArray implementation I was playing with would have converted the underlying data, which arrow stores as 128-bit signed ints, to 64-bit signed ints to make them easier to work with in numpy. But that is probably also not a great idea, since we would like the type to roundtrip perfectly between Arrow/Parquet and pandas. So I guess that's another reason to do this in pyarrow itself?)

@jreback
Copy link
Contributor

jreback commented May 15, 2020

there is also a 2 prong strategy here

I would certainly store the data in pyarrow arrays and then use any operations that pyarrow defines.

for missing operations you can both:

  • coercing to python Decimal types and performing the ops (to have a working interface)
  • contribute code to pyarrow to implement these kernels

@sid-kap
Copy link
Author

sid-kap commented May 16, 2020

IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time).

So I think I should look into contributing to pyarrow then! Thanks for the guidance, and feel free to close the issue.

@jorisvandenbossche
Copy link
Member

It looks like pyarrow has started implementing operations on arrays for some types in the C++ library at cpp/src/arrow/compute/kernels, but it doesn't look like they've added any decimal functions, other than casting Decimal to Decimal or Decimal to Int.

Note there is a bit more than just casting. What is eg exposed in pyarrow and works for decimals is unique / dictionary_encode (which could already support groupby with a decimal column as key), and take for general subsetting/indexing.
(I know this is not much computational, but it are still essential functionalities to get it working as a pandas extension array)

Further, there are a bunch of element-wise kernels available in the Gandiva submodule, which is a JIT expression compiler for Arrow memmory, which can be used from python on decimal arrays (but also not reductions).

Querying what is currently available for decimal types:

In [1]: from pyarrow import gandiva   

In [2]: [f for f in gandiva.get_registered_function_signatures() if any(pa.types.is_decimal(t) for t in f.param_types()) or pa.types.is_decimal(f.return_type())]   
Out[2]: 
[FunctionSignature(int64 castBIGINT(decimal(38, 0))),
 FunctionSignature(double castFLOAT8(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) castDECIMAL(int32)),
 FunctionSignature(decimal(38, 0) castDECIMAL(int64)),
 FunctionSignature(decimal(38, 0) castDECIMAL(float)),
 FunctionSignature(decimal(38, 0) castDECIMAL(double)),
 FunctionSignature(decimal(38, 0) castDECIMAL(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) castDECIMAL(string)),
 FunctionSignature(decimal(38, 0) castDECIMALNullOnOverflow(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) add(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) subtract(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) multiply(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) divide(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) mod(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) modulo(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool equal(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool eq(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool same(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool not_equal(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool less_than(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool less_than_or_equal_to(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool greater_than(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool greater_than_or_equal_to(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(int32 hash(decimal(38, 0))),
 FunctionSignature(int32 hash32(decimal(38, 0))),
 FunctionSignature(int32 hash32AsDouble(decimal(38, 0))),
 FunctionSignature(int32 hash32(decimal(38, 0), int32)),
 FunctionSignature(int32 hash32AsDouble(decimal(38, 0), int32)),
 FunctionSignature(int64 hash64(decimal(38, 0))),
 FunctionSignature(int64 hash64AsDouble(decimal(38, 0))),
 FunctionSignature(int64 hash64(decimal(38, 0), int64)),
 FunctionSignature(int64 hash64AsDouble(decimal(38, 0), int64)),
 FunctionSignature(bool isnull(decimal(38, 0))),
 FunctionSignature(bool isnotnull(decimal(38, 0))),
 FunctionSignature(bool isnumeric(decimal(38, 0))),
 FunctionSignature(bool is_distinct_from(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool is_not_distinct_from(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) abs(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) ceil(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) floor(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) round(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) truncate(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) trunc(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) round(decimal(38, 0), int32)),
 FunctionSignature(decimal(38, 0) truncate(decimal(38, 0), int32)),
 FunctionSignature(decimal(38, 0) trunc(decimal(38, 0), int32)),
 FunctionSignature(string castVARCHAR(decimal(38, 0), int64))]

And there is currently work being done in pyarrow to refactor the kernel implementation, which amongst other things should allow those gandiva functions also to be reused in the precompiled kernels (see https://issues.apache.org/jira/browse/ARROW-8792). Once that has landed, it should be more straightforward to add a bunch of kernels for the decimal type.

I'm not sure if exposing that interface is one of pyarrow's goals, or if they want other libraries like fletcher to make that interface

It's certainly the goal of pyarrow to provide to core computational building blocks (so a developer API, just not an end-user API, but it (eventually) should provide everything to allow fletcher / another project to build a decimal ExtensionArray)

Regarding fletcher, it also seems that they don't support decimal types right now

I am not sure whether this is on purpose, or whether they would welcome contributions (cc @xhochy)

IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time).

I agree that such conversion should always be avoided.
But I also think it can be beneficial to already start experimenting with the pandas ExtensionArray side of things (to allow people to start using it for those things it implements, and provide feedback etc), before all necessary computational functionality is available in pyarrow.


I think it might be good to keep this issue open as general issue about "fast decimal support in pandas", until there is an actual solution / project we can point users towards.

@jreback
Copy link
Contributor

jreback commented May 16, 2020

IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time).

So I think I should look into contributing to pyarrow then! Thanks for the guidance, and feel free to close the issue.

@sid-kap something working but slow and available now is much better than fast but not available

@xhochy
Copy link
Contributor

xhochy commented May 18, 2020

I would be happy to merge things into fletcher to support Decimal. It is not yet included in tests as I don't use it as a type myself. While you will have to add most of the functionality to pyarrow itself, fletcher should already give you tooling to better test what is missing to make it a fully functional Arrow-backed ExtenionArray.

Also feel free to make a PR to fletcher and add Decimal to the test matrix in https://github.com/xhochy/fletcher/blob/bbb8ab3840bbd9474a3d23018e473e9d7f2fc184/tests/test_pandas_extension.py#L92. Even if some tests are failing, a PR would be welcome. I can then support you in fixing the missing things.

@sid-kap
Copy link
Author

sid-kap commented May 19, 2020

@xhochy Thanks, will do!

@mroeschke mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Discussion Requires discussion from core team before further action labels Aug 7, 2021
@jbrockmendel
Copy link
Member

@mroeschke does your pyarrow-decimal PR close this?

@mroeschke
Copy link
Member

Yes, agreed this would

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants