Skip to content

PERF: Avoid materializing values in Categorical.set_categories #17508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 13, 2017 · 0 comments
Closed

PERF: Avoid materializing values in Categorical.set_categories #17508

TomAugspurger opened this issue Sep 13, 2017 · 0 comments
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 13, 2017

In Categorical.set_categories, we allocate an array of the values, which may be expensive:

values = cat.__array__()

It should be possible to do this operation by just manipulating the codes.

In [6]: c = pd.Categorical(['a'] * 100000)

In [7]: c.set_categories(['a', 'b'])
Out[7]:
[a, a, a, a, a, ..., a, a, a, a, a]
Length: 100000
Categories (2, object): [a, b]

See 5ab0123 for how this might work, which will probably be squashed, but it's the implementation of Categorical._set_dtype in #16015

I may get to this as a followup to that PR.

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Sep 13, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508

(cherry picked from commit 9b311f4)
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508

(cherry picked from commit 9b311f4)
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508

(cherry picked from commit 9b311f4)
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508

(cherry picked from commit 9b311f4)
@jreback jreback added this to the 0.21.0 milestone Sep 13, 2017
@jreback jreback added the Categorical Categorical Data Type label Sep 13, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 14, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
jreback pushed a commit that referenced this issue Sep 14, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes #17508
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants