duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

eyaler · 2015-05-17T14:03:31Z

the following works quickly in 0.15.2 and has a performance issue on the last operation df.T.duplicated() in 0.16.0 and 0.16.1
also on a private data set that works on 0.15.2 i get an error on 0.16.0 and 0.16.1 on the same operation.

code:

import pandas,numpy

df = pandas.DataFrame({'A': [1 for x in range(1000)],
                   'B': [1 for x in range(1000)]})

print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))

df = pandas.DataFrame({'A': [1 for x in range(1000000)],
                   'B': [1 for x in range(1000000)]})

print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))


this is the error i get on the private data set (code not reproduce yet with synthetic data):
  File "C:\Anaconda3\lib\site-packages\pandas\util\decorators.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2867, in duplicated
    labels, shape = map(list, zip( * map(f, vals)))
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2856, in f
    labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
  File "C:\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 135, in factorize
    labels = table.get_labels(vals, uniques, 0, na_sentinel)
  File "pandas\hashtable.pyx", line 813, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:14025)
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

shoyer · 2015-05-18T09:14:24Z

Looks like this may be related to #9398.

@behzadnouri any ideas?

behzadnouri · 2015-05-18T19:58:07Z

yes, #9398 will not scale well with a very wide frame; in exact same way that joining two frames on 1000000 columns will not scale well.

for ValueError: Buffer ... my guess is that column names are not unique, so when it iterates over the columns for one of the columns it gets a two dimensional array, therefore wrong number of dimensions (expected 1, got 2) error.

eyaler · 2015-05-18T20:05:12Z

Thanks. column names are unique. moreover, it works when using only partial data (fewer rows) so i guess its related to the wide frame issue on the transposed data

behzadnouri · 2015-05-18T20:13:09Z

@eyaler are you transposing the frame? if you are transposing the frame before calling into .duplicated then the row names should be unique

eyaler · 2015-05-18T20:20:33Z

after transposing the column names get the previous row numbers which are unique.

jreback · 2015-06-02T20:01:33Z

@behzadnouri any thoughts on this?

behzadnouri · 2015-06-03T21:21:43Z

@jreback i guess the easiest soln would be to switch to the old code for wide frames if no subset is selected.

jreback · 2015-06-07T22:23:45Z

@behzadnouri that sounds ok

jorisvandenbossche · 2015-06-08T12:24:38Z

@behzadnouri would you be able to put up a fix for this in the coming days? As this is a regression, I think we should try to include a fix in 0.16.2, to be released this friday.

samuelclark · 2015-09-23T19:15:16Z

This issue seems to still exist, running python version 2.7.6. On a relatively small dataframe - 5000 rows of mostly numeric data with a few date and short object columns (no duplicate columns)

import pandas as pd

n [14]: pd.__version__
Out[14]: '0.16.2'

In [15]: df.shape
Out[15]: (5000, 35) 

%timeit df.T.duplicated()
1 loops, best of 3: 19min 37s per loop

In an older version of pandas (0.12.0) on the same dataframe

In [13]: %timeit -n 10 dataframe.T.duplicated()
10 loops, best of 3: 23.8 ms per loop

In [14]: pd.__version__
Out[14]: ‘0.12.0’

For now I reimplemented the old duplicated function and am calling it separately.

I also tested this in python version 3.4 and had the same results

jreback · 2015-09-23T19:56:28Z

The issue is still marked as open, though your timings are a bit odd. You might want to show df.info() on the frame you are doing.

In [1]: df = pd.concat([Series(np.arange(5000))]*35,axis=1)

In [2]: %timeit df.duplicated()
100 loops, best of 3: 7.97 ms per loop

In [3]: %timeit df.T.duplicated()
1 loops, best of 3: 378 ms per loop

In [4]: pd.__version__
Out[4]: '0.16.2'

I see if I add a different dtype then this does slow down.

samuelclark · 2015-09-23T20:02:23Z

Great, here is the info from the dataframe I used

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 35 columns):
...
dtypes: float64(7), int64(12), object(16)
memory usage: 1.4+ MB

behzadnouri · 2015-09-23T20:20:35Z

i will patch this later today

…10161

jreback · 2015-09-25T12:19:15Z

closed by #11180

philippschw · 2017-12-13T16:07:12Z

Issue seems to persist on extremely wide pandas DataFrame with 79 columns. Columns with same name have been removed!
The error is:

C:\Users***\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
558 uniques = vec_klass()
559 check_nulls = not is_integer_dtype(original)
--> 560 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
561
562 labels = _ensure_platform_int(labels)

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_labels (pandas_libs\hashtable.c:8705)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)



`pd.__version__`
> '0.20.3'

jreback · 2017-12-13T16:19:00Z

this is a long closed issue if u have a case
then open a new issue with a repro example

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 18, 2015

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 1, 2015

jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 1, 2015

jreback modified the milestones: 0.17.0, 0.16.2 Jun 7, 2015

jorisvandenbossche mentioned this issue Jun 9, 2015

REL: 0.16.2 #10307

Closed

jreback modified the milestones: Next Major Release, 0.17.0 Aug 19, 2015

jreback added Prio-medium labels Aug 19, 2015

behzadnouri mentioned this issue Sep 23, 2015

improves groupby.get_group_index when shape is a long sequence #11180

Closed

jreback modified the milestones: 0.17.0, Next Major Release Sep 24, 2015

jreback pushed a commit that referenced this issue Sep 25, 2015

PERF: improves groupby.get_group_index when shape is a long sequence, #…

3fb802a

…10161

jreback closed this as completed Sep 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

eyaler commented May 17, 2015

shoyer commented May 18, 2015

behzadnouri commented May 18, 2015

eyaler commented May 18, 2015

behzadnouri commented May 18, 2015

eyaler commented May 18, 2015

jreback commented Jun 2, 2015

behzadnouri commented Jun 3, 2015

jreback commented Jun 7, 2015

jorisvandenbossche commented Jun 8, 2015

samuelclark commented Sep 23, 2015

jreback commented Sep 23, 2015

samuelclark commented Sep 23, 2015

behzadnouri commented Sep 23, 2015

jreback commented Sep 25, 2015

philippschw commented Dec 13, 2017 •

edited

Loading

jreback commented Dec 13, 2017

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

Comments

eyaler commented May 17, 2015

shoyer commented May 18, 2015

behzadnouri commented May 18, 2015

eyaler commented May 18, 2015

behzadnouri commented May 18, 2015

eyaler commented May 18, 2015

jreback commented Jun 2, 2015

behzadnouri commented Jun 3, 2015

jreback commented Jun 7, 2015

jorisvandenbossche commented Jun 8, 2015

samuelclark commented Sep 23, 2015

jreback commented Sep 23, 2015

samuelclark commented Sep 23, 2015

behzadnouri commented Sep 23, 2015

jreback commented Sep 25, 2015

philippschw commented Dec 13, 2017 • edited Loading

jreback commented Dec 13, 2017

philippschw commented Dec 13, 2017 •

edited

Loading