Skip to content

[POC] PERF: 1d version of the cython grouped aggregation algorithms #39861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 17, 2021

For now, this is just a small example to open discussion about how to tackle this topic (not meant for merging).

Context: for a columnar dataframe (with ArrayManager), we would also perform the groupby operations column by column (instead of on the 2D blocks). Most of our custom cython aggregation algorithms are currently written for 2D arrays (eg group_add, group_mean, etc. I think only the fill/shift/any/all are 1D). But by being 2D, that means they have some performance penalty when using it for a single column.

Experiment: I copied the implementation for "mean" and made a 1D version of it (exactly the same code, but removing one for loop over the columns, and simplifying the indices to index into 1D arrays instead of 2D arrays).
And with that did some timings to see the impact:

N = 1_000_000
df = pd.DataFrame(np.random.randn(N, 10), columns=list("abcdefghij"))
df["key"] = np.random.randint(0, 100, size=N)

# below I am manually doing the aggregation part for one column of the following groupby operation
expected = df.groupby("key").mean()

Using the existing 2D algorithm on a single column:

codes, uniques = pd.factorize(df["key"], sort=True)
out_shape = (len(uniques), 1)
out_dtype = "float64"

values = df[['a']]._mgr.blocks[0].values.T

result = np.empty(out_shape, out_dtype)
result.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)

pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)

Using the new 1D version:

out_shape1d = (len(uniques),)
out_dtype = "float64"

values1d = df['a']._mgr.blocks[0].values.copy()

result1d = np.empty(out_shape1d, out_dtype)
result1d.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)

pd._libs.groupby.group_mean_1d_float64(result1d, counts, values1d, codes, min_count=-1)

Ensure that both give the same result:

In [4]: np.allclose(result.squeeze(), expected['a'].values)
Out[4]: True

In [5]: np.allclose(result1d, expected['a'].values)
Out[5]: True

And timings for both:

In [8]: %%timeit
   ...: result = np.empty(out_shape, out_dtype)
   ...: result.fill(np.nan)
   ...: counts = np.zeros(len(uniques), dtype=np.int64)
   ...: pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)

4.15 ms ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %%timeit
   ...: result1d = np.empty(out_shape1d, out_dtype)
   ...: result1d.fill(np.nan)
   ...: counts = np.zeros(len(uniques), dtype=np.int64)
   ...: pd._libs.groupby.group_mean_1d_float64(result1d, counts, values1d, codes, min_count=-1)

2.62 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So for a single column, the 1D version is ca 30-40% faster (increasing the size of the array with 10x, I also get 40% faster).

Some notes:

  • The actual aggregation is only a part of the full groupby operation. But for sufficiently large data, it becomes the most significant portion. E.g. for the df.groupby("key").mean() example above, the group_mean takes ca 1.5x the time of the factorization step (that's for multiple columns, though, when profiling it for grouping a single column, the ratio becomes the other way around: ca 38% in group_mean and ca 62% in factorization)
  • If we want to have an optimized groupby for the ArrayManager-backed dataframe, I think we will need those 1D versions of the cython algos. Of course, what I currently did here would give a huge duplication in code (and also not easily templateable or so) on the short term.

xref #39146

@jorisvandenbossche jorisvandenbossche marked this pull request as draft February 17, 2021 09:10
@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

this is a really strange result
you eliminated a loop over the columns

i wouldn't expect any difference here

my guess is that the 2d result array memory layout is unfavorable compared to the 1d

@jorisvandenbossche
Copy link
Member Author

The inner part of the function for the 2D version is:

                val = values[i, j]
                # not nan
                if val == val:
                    nobs[lab, j] += 1
                    y = val - compensation[lab, j]
                    t = sumx[lab, j] + y
                    compensation[lab, j] = t - sumx[lab, j] - y
                    sumx[lab, j] = t

while for the 1D version it is:

            val = values[i]
            # not nan
            if val == val:
                nobs[lab] += 1
                y = val - compensation[lab]
                t = sumx[lab] + y
                compensation[lab] = t - sumx[lab] - y
                sumx[lab] = t

So all arr[i, j] 2D indexing steps are replaced with the simpler arr[i]. So I personally don't find it surprising that this gives some improvement (how much improvement to expect from it, I have no idea apart from the benchmarking I did, as nothing else changed).

my guess is that the 2d result array memory layout is unfavorable compared to the 1d

That was my first guess as well (as for example for the plain (non grouped) reductions, this matters), but from testing it with the example above, that doesn't seem to matter in this case:. Explicitly testing the different memory layouts with the 2D algorithm:

In [48]: values = df[['a']]._mgr.blocks[0].values.T

In [49]: valuesC = values.copy(order="C")

In [50]: valuesF = values.copy(order="F")

In [51]: %%timeit
    ...: result = np.empty(out_shape, out_dtype)
    ...: result.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)
4.12 ms ± 90.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %%timeit
    ...: result = np.empty(out_shape, out_dtype)
    ...: result.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_float64(result, counts, valuesC, codes, min_count=-1)
4.26 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %%timeit
    ...: result = np.empty(out_shape, out_dtype)
    ...: result.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_float64(result, counts, valuesF, codes, min_count=-1)
4.09 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And additionally explicitly with how it would work for a 1D column that needs to be reshaped to pass into the algorithm:

In [63]: values1d = df['a']._mgr.blocks[0].values.copy()

In [64]: values1d_reshaped = values1d[:, None]

In [65]: %%timeit
    ...: result = np.empty(out_shape, out_dtype)
    ...: result.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_float64(result, counts, values1d_reshaped, codes, min_count=-1)
4.13 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So those all don't really show any significant difference.

@jorisvandenbossche
Copy link
Member Author

And to be explicit: I am testing the 2D (vs 1D) algorithm here for a single column (so an array of shape (N, 1)). I suppose that the 2D algorithm will still be faster for an actual 2D array from multiple columns, compared to calling the 1D algorithm multiple times on the different columns.
But what I am benchmarking here is exactly for the case we would have our columns stored as 1D arrays, but still need to use the 2D algorithm vs having a 1D algorithm.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

i think if u reverse the ordering

eg j, lab this will have same perf (need to transpose when creating memory)

as the indexing is the costly part here

@jorisvandenbossche
Copy link
Member Author

as the indexing is the costly part here

Exactly, and I reduce the amount of indexing (indexing into 2D vs 1D)

i think if u reverse the ordering, eg j, lab this will have same perf (need to transpose when creating memory)

Not fully sure what you mean, but I did verify the result for all cases to be identical with the actual df.groupby().mean(), so I don't think I did something wrong in the for loop (otherwise I wouldn't get the correct result?)
And I did transpose the block values in the first case.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 17, 2021

i think if u reverse the ordering, eg j, lab this will have same perf

Ah, I think I understand now. You didn't mean something was wrong in the current example, but that if the 2D algorithm would be written to work on data transposed compared to what it is now, it would be faster?

Currently, the values are passed in as array with shape (n_rows, n_cols)), and you suggest to try what it gives if they are passed in as (n_cols, n_rows) ? (so the indexing like sumx[lab, j] would become sumx[j, lab])

(I don't directly see how that would make a difference, though, as we are still indexing in a 2D array to get a single scalar)

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

yea exactly

try to fill in reversed indexing order to see if that makes the 2d case better

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

the issue is the cost of the indexing is dependent on the memory layout

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

and i think the 2d case would be more performing if we allocate 1d contiguous blocks as these will be cache performant

and since then aggregating you don't need to concat then

@jorisvandenbossche
Copy link
Member Author

Will try that. Maybe that it indeed has significant enough impact for the 1D-as-2D array case.

@jorisvandenbossche
Copy link
Member Author

I gave it a try (see the last commit, please check if this is what you meant), but unfortunately not seeing any improvement.

Using the same example from above, first the 2D version as currently on master (testing on a single column as (N, 1) array):

out_shape = (len(uniques), 1)
out_dtype = "float64"
values = df[['a']]._mgr.blocks[0].values.T

result = np.empty(out_shape, out_dtype)
result.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)
pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)

and then the "transposed" version (so without transposing the block values, and thus the array passed into the function has shape (1, N) instead of (N, 1)):

out_shape_transposed = (1, len(uniques))
values_transposed = df[['a']]._mgr.blocks[0].values

result_transposed = np.empty(out_shape_transposed, out_dtype)
result_transposed.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)
pd._libs.groupby.group_mean_transposed_float64(result_transposed, counts, values_transposed, codes, min_count=-1)

Ensuring we still get the same correct result:

In [9]: result.shape
Out[9]: (100, 1)

In [10]: result_transposed.shape
Out[10]: (1, 100)

In [11]: np.allclose(result.squeeze(), result_transposed.squeeze())
Out[11]: True

And now timing it:

In [15]: %%timeit
    ...: result_transposed = np.empty(out_shape_transposed, out_dtype)
    ...: result_transposed.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_transposed_float64(result_transposed, counts, values_transposed, codes, min_count=-1)
4.1 ms ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: values_transposedC = values_transposed.copy("C")

In [17]: values_transposedF = values_transposed.copy("F")

In [18]: %%timeit
    ...: result_transposed = np.empty(out_shape_transposed, out_dtype)
    ...: result_transposed.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_transposed_float64(result_transposed, counts, values_transposedC, codes, min_count=-1)
4.13 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %%timeit
    ...: result_transposed = np.empty(out_shape_transposed, out_dtype)
    ...: result_transposed.fill(np.nan)
    ...: counts = np.zeros(len(uniques), dtype=np.int64)
    ...: pd._libs.groupby.group_mean_transposed_float64(result_transposed, counts, values_transposedF, codes, min_count=-1)
4.27 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the 2D version with reversed indexing (transposed input array) is giving more or less the same speed as the 2D version currently in pandas.

@jorisvandenbossche
Copy link
Member Author

@jreback my suspicion is that we don't see improvement because it's mostly the array setitem that is costly (at least relative to the array getitem), and for setitem it seems to make less difference what the memory layout is of the array we are setting values to.
(based on some small experiments in cython with a function just getting/setting values in an array, and timing that with different memory layouts. For only getting values I do see a difference)

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

ok my point is that there is something fundamentally different here in indexing - i'm this is the speed up - just have to repro

@jorisvandenbossche
Copy link
Member Author

i'm this is the speed up - just have to repro

Can you clarify this part of your sentence ;)


counts[lab] += 1
for j in range(K):
val = values[j, i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make a difference if we defined temp_values = values[:, i] (or values[i] in non-transposed version) outside the j-loop and then set val = temp_values[j] inside the loop?

definitely seems weird that this would have a big perf impact

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

i'm this is the speed up - just have to repro

Can you clarify this part of your sentence ;)

what I mean (and this is a theory). is that the 2d (even though its a single column), is created in a strided way and so the indexing of this is more expensive that a single contiguous block. IOW i would expect that creating the 2D array in a transposed way (and changing the indexing to make accesing the last dim first) would negate any indexing penalty vs indexing a single contguous array.

@jbrockmendel
Copy link
Member

another possibility: the memoryview declarations dont include anything about the strides, might be optimizations available there

@jbrockmendel
Copy link
Member

Just tried adding the contiguity to the declaration and the times from the OP went

3.33 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
2.68 ms ± 80 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- branch

@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

Just tried adding the contiguity to the declaration and the times from the OP went

3.33 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
2.68 ms ± 80 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- branch

is this on the original?

@jbrockmendel
Copy link
Member

is this on the original?

The exact code being profiled is based on the OP:

import pandas as pd
import numpy as np
N = 1_000_000
df = pd.DataFrame(np.random.randn(N, 10), columns=list("abcdefghij"))
df["key"] = np.random.randint(0, 100, size=N)

codes, uniques = pd.factorize(df["key"], sort=True)
out_shape = (len(uniques), 1)
out_dtype = "float64"
values = df[['a']]._mgr.blocks[0].values.T

In [8]: %%timeit
   ...: result = np.empty(out_shape, out_dtype)
   ...: result.fill(np.nan)
   ...: counts = np.zeros(len(uniques), dtype=np.int64)
   ...: pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)
3.35 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
2.68 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- branch

where branch's diff from master is

 @cython.wraparound(False)
 @cython.boundscheck(False)
-def _group_mean(floating[:, :] out,
-                int64_t[:] counts,
+def _group_mean(floating[:, ::1] out,
+                int64_t[::1] counts,
                 ndarray[floating, ndim=2] values,
-                const int64_t[:] labels,
+                const int64_t[::1] labels,
                 Py_ssize_t min_count=-1):
     cdef:
         Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
         floating val, count, y, t
-        floating[:, :] sumx, compensation
-        int64_t[:, :] nobs
+        floating[:, ::1] sumx, compensation
+        int64_t[:, ::1] nobs
         Py_ssize_t len_values = len(values), len_labels = len(labels)

@jreback
Copy link
Contributor

jreback commented Feb 18, 2021

great so this is a memory layout issue

happy to take this improvement then

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 18, 2021

another possibility: the memoryview declarations dont include anything about the strides, might be optimizations available there

Thanks Brock! Indeed, specifying the strides makes that cython generates simpler (faster) code.
(I see only a smaller speed-up though, around 5%)

And since in your diff, it's only specified for result (and other arrays created specifically for the groupby operation), we can be sure they are in C order, and it should be safe to apply this optimization, I suppose (the values are not necessarily in C order, unless we would add a check for that / potentially copy).

Now, for me, adding the same trick to the 1D version of the algorithm, also gives a speed-up. And so the 1D version still gives a similar speed-up as before, for me (around 30%).
So I think the main point (a 1D version of the algorithm can give a performance benefit when starting from 1D arrays) still stands. Now, this was mostly meant as experimentation (I don't expect us to duplicate the full groupby.pyx file at this point). But we then might need to take into account when comparing performance of the ArrayManager for groupby, that this can still be speed up with a simple change (actually a simplification).

@jreback
Copy link
Contributor

jreback commented Feb 18, 2021

@jorisvandenbossche i still don't understand where this benefit is coming from

can u be more specific - again you are now indexing 1d vs 2d right? is that the difference ?
why does this give a speed up at all?

@jorisvandenbossche
Copy link
Member Author

i still don't understand where this benefit is coming from

All the code is here, the algorithm is in the diff, the code that I ran is above. If you don't believe me, you can test it yourself.

It's quite simple: indexing a 1D array is cheaper as indexing a 2D array (i.e. values[i] vs values[i, j]). Even when for the rest the memory layout is optimal etc, working with 1D is still faster. Personally, I don't find it that suprising.

@jbrockmendel
Copy link
Member

@scoder the timings here seem very weird to 2/3 of this thread's participants. do they make sense to you?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 18, 2021

I don't think it's worth @scoder's time to go through our long discussion and the complex example. So here is a simplified reproducer that shows the difference.
The below is a small function that just gets the value from one array and sets it in another array. One version does that for a 2D array, and the other version for a 1D array (I ran it in a notebook, so included the %%cython magic):

%%cython
import cython
from cython import Py_ssize_t

import numpy as np

cimport numpy as cnp
from numpy cimport float64_t, ndarray
cnp.import_array()


@cython.wraparound(False)
@cython.boundscheck(False)
def function_2D(ndarray[float64_t, ndim=2] values, float64_t[:, ::1] out):
    cdef:
        Py_ssize_t i, j, N, K
        float64_t val
    
    N, K = (<object>values).shape
    
    with nogil:
        for i in range(N):
            for j in range(K):
                val = values[i, j]
                out[i, j] = val

                
@cython.wraparound(False)
@cython.boundscheck(False)
def function_1D(ndarray[float64_t, ndim=1] values, float64_t[::1] out):
    cdef:
        Py_ssize_t i, N
        float64_t val
    
    N, = (<object>values).shape
    
    with nogil:
        for i in range(N):
            val = values[i]
            out[i] = val

and then testing and timing it with:

N = 1_000_000
values1D = np.random.randn(N)
values2D = values1D[:, None]

assert values1D.flags.c_contiguous
assert values2D.flags.c_contiguous
assert values1D.shape == (N, )
assert values2D.shape == (N, 1)

out2D = np.empty(values2D.shape, dtype="float64")
out1D = np.empty(values1D.shape, dtype="float64")
%timeit function_2D(values2D, out2D)
1.25 ms ± 69.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit function_1D(values1D, out1D)
969 µs ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So indexing into a 1D array is faster than indexing into a 2D array (although it is in practice just a 1D array reshaped as (N, 1), and so the same number of elements are get/set). Is this to be expected?

@scoder
Copy link

scoder commented Feb 18, 2021

One thing that jumps up is that you're mixing the old ndarray buffer syntax with memoryviews here. Memoryviews are designed for fast index calculations that the C compiler can optimise well, but the old buffers weren't. That seems the most likely reason to me, without looking into it more deeply.

@scoder
Copy link

scoder commented Feb 18, 2021

Also seems worth testing if PGO makes a difference here. If the buffer index calculation code is the issue, PGO tuning might be able to make up for its disadvantages. At least somewhat.

@jorisvandenbossche
Copy link
Member Author

Yes, in my small example I used the old ndarray way for the input values, to mimic how it's done in the actual groupby algorithm (I am not fully sure if there is a specific reason why we are not using memoryviews there, though).

But, surprisingly, quicky changing ndarray[float64_t, ndim=2] values to float64_t[:, ::1] values (and similar for the 1D function) in the example above actually slows things down quite a bit for the 2D case:

%timeit function_2D(values2D, out2D)
2.53 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit function_1D(values1D, out1D)
901 µs ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jbrockmendel
Copy link
Member

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby.pyx#L650 With the exception of the values arg i think we are using the memoryview syntax as consistently as possible. values i dont think we can change yet until const floating[:, :] becomes available.

#39861 (comment) tries making the memoryview declarations more specific and gets a decent speedup.

Broadly consistent with Joris's results, im finding that if I further change the signature #39861 (comment) to replace ndarray[floating, ndim=2] values to floating[:, :] values, the runtime increased from ~2.6 ms to ~2.9 ms

(using floating[:, ::1] gets that down to 2.8ms (though with a much higher variance))

@scoder
Copy link

scoder commented Feb 21, 2021

  1. Could you try a larger data set? 2.6 ms vs. 2.9 ms seems rather tiny, especially given that three arrays are being created initially, plus two Python instances of the memory view ((<object>out/values).shape). Those are fairly costly operations in comparison to the rest.

  2. Try using PGO to see what the C compiler can really make of it. A difference of 0.3 ms may well be smaller than the benefit that PGO could give you.

@jbrockmendel
Copy link
Member

Trying on a 2-column case (groupby.pyx edits the same as #39861 (comment))

import pandas as pd
import numpy as np

N = 1_000_000
M = 10
K = 2

df = pd.DataFrame(np.random.randn(N, M))
df["key"] = np.random.randint(0, 100, size=N)

codes, uniques = pd.factorize(df["key"], sort=True)
out_shape = (len(uniques), K)
out_dtype = "float64"
values = df[list(range(K))]._mgr.blocks[0].values.T

result = np.empty(out_shape, out_dtype)
result.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)

In [8]: %%timeit
   ...: pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1)

4.29 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- branch
8.93 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

2x speedup definitely worth implementing. Still not clear why the K=1 case has the perf penalty vs the 1D version.

@jbrockmendel
Copy link
Member

plus two Python instances of the memory view ((<object>out/values).shape). Those are fairly costly operations in comparison to the rest.

btw why is the cast necessary? i thought memoryviews have a shape attribute (searching for "shape" https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html and looking at the last result)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 23, 2021

Could you try a larger data set?

Also with 100x larger array, I see a similar difference: 130ms for 2D function vs 110ms for 1D function (using the dummy function getting/setting values of an array from #39861 (comment)).

Try using PGO to see what the C compiler can really make of it.

Is specifying --pgo with the IPython magic a correct way to use it? (in combination with calling the function in the cell that compiles the code)
If that's correct, I don't see any improvement (actually worse, the 2D function becomes slower, like with using memoryview for the values


Can someone try to explain why you don't expect the 1D function to be faster than the 2D function?
The 1D version only needs to take into account a single stride, so it can do faster indexing?

@jbrockmendel
Copy link
Member

Can someone try to explain why you don't expect the 1D function to be faster than the 2D function?

incrementing values2d[0, i] to values2d[0, i+1] should be equivalent (same number of strides/bytes?) as incrementing values1d[i] to values1d[i+1]

@jorisvandenbossche
Copy link
Member Author

How is that the same number of strides? values2d[0, i] is 2 strides to check, values1d[i] only 1 (the fact that the 0 index in the first case is constant is not something cython/c knows? Maybe that's what the PGO is meant to optimize, but based on my test that doesn't happen).

To be very explicit, this is what cython generates for out[i, j] = val and out[i] = val

__pyx_t_13 = __pyx_v_i;
__pyx_t_12 = __pyx_v_j;
*((__pyx_t_5numpy_float64_t *) ( /* dim=1 */ ((char *) (((__pyx_t_5numpy_float64_t *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_13 * __pyx_v_out.strides[0]) )) + __pyx_t_12)) )) = __pyx_v_val;
__pyx_t_8 = __pyx_v_i;
*((__pyx_t_5numpy_float64_t *) ( /* dim=0 */ ((char *) (((__pyx_t_5numpy_float64_t *) __pyx_v_out.data) + __pyx_t_8)) )) = __pyx_v_val;

Because out is specified as C contiguous, it knows it can just increment for the last dimension (without checking the strides), but for the 2D version it needs to check the strides for the first dimension each time (even though this is constant 0 for the 1D array passed to the 2D function).

@jbrockmendel
Copy link
Member

the fact that the 0 index in the first case is constant is not something cython/c knows

OK, I was under the impression the compiler could figure that out.

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label Mar 14, 2021
@jbrockmendel
Copy link
Member

would it make sense to have the 2D functions check if they are single-column and if so dispatch to the 1D version?

@jreback
Copy link
Contributor

jreback commented Mar 19, 2021

yep!

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 19, 2021
@jbrockmendel
Copy link
Member

@jorisvandenbossche gentle ping

@jorisvandenbossche
Copy link
Member Author

would it make sense to have the 2D functions check if they are single-column and if so dispatch to the 1D version?

Do we actually want to add such 1D versions in the short term? As I said in the top post, this would give a lot of duplication (I don't think it is easily template-able to still share the algorithm), so I mainly did this as an experiment to check what the benefit would be / open discussion.

(I am personally fine with the duplication in the idea that it is only temporary, but I have the impression that you both want to limit that as much as possible)

@jreback
Copy link
Contributor

jreback commented Apr 22, 2021

yea would rather limit code duplicate like this

@jbrockmendel
Copy link
Member

Agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants