Skip to content

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eyaler opened this issue May 17, 2015 · 16 comments
Closed
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@eyaler
Copy link

eyaler commented May 17, 2015

the following works quickly in 0.15.2 and has a performance issue on the last operation df.T.duplicated() in 0.16.0 and 0.16.1
also on a private data set that works on 0.15.2 i get an error on 0.16.0 and 0.16.1 on the same operation.

code:

import pandas,numpy

df = pandas.DataFrame({'A': [1 for x in range(1000)],
                   'B': [1 for x in range(1000)]})

print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))

df = pandas.DataFrame({'A': [1 for x in range(1000000)],
                   'B': [1 for x in range(1000000)]})

print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))


this is the error i get on the private data set (code not reproduce yet with synthetic data):
  File "C:\Anaconda3\lib\site-packages\pandas\util\decorators.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2867, in duplicated
    labels, shape = map(list, zip( * map(f, vals)))
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2856, in f
    labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
  File "C:\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 135, in factorize
    labels = table.get_labels(vals, uniques, 0, na_sentinel)
  File "pandas\hashtable.pyx", line 813, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:14025)
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
@shoyer
Copy link
Member

shoyer commented May 18, 2015

Looks like this may be related to #9398.

@behzadnouri any ideas?

@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 18, 2015
@behzadnouri
Copy link
Contributor

yes, #9398 will not scale well with a very wide frame; in exact same way that joining two frames on 1000000 columns will not scale well.

for ValueError: Buffer ... my guess is that column names are not unique, so when it iterates over the columns for one of the columns it gets a two dimensional array, therefore wrong number of dimensions (expected 1, got 2) error.

@eyaler
Copy link
Author

eyaler commented May 18, 2015

Thanks. column names are unique. moreover, it works when using only partial data (fewer rows) so i guess its related to the wide frame issue on the transposed data

@behzadnouri
Copy link
Contributor

@eyaler are you transposing the frame? if you are transposing the frame before calling into .duplicated then the row names should be unique

@eyaler
Copy link
Author

eyaler commented May 18, 2015

after transposing the column names get the previous row numbers which are unique.

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 1, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 1, 2015
@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

@behzadnouri any thoughts on this?

@behzadnouri
Copy link
Contributor

@jreback i guess the easiest soln would be to switch to the old code for wide frames if no subset is selected.

@jreback jreback modified the milestones: 0.17.0, 0.16.2 Jun 7, 2015
@jreback
Copy link
Contributor

jreback commented Jun 7, 2015

@behzadnouri that sounds ok

@jorisvandenbossche
Copy link
Member

@behzadnouri would you be able to put up a fix for this in the coming days? As this is a regression, I think we should try to include a fix in 0.16.2, to be released this friday.

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 19, 2015
@samuelclark
Copy link

This issue seems to still exist, running python version 2.7.6. On a relatively small dataframe - 5000 rows of mostly numeric data with a few date and short object columns (no duplicate columns)

import pandas as pd

n [14]: pd.__version__
Out[14]: '0.16.2'

In [15]: df.shape
Out[15]: (5000, 35) 

%timeit df.T.duplicated()
1 loops, best of 3: 19min 37s per loop

In an older version of pandas (0.12.0) on the same dataframe

In [13]: %timeit -n 10 dataframe.T.duplicated()
10 loops, best of 3: 23.8 ms per loop

In [14]: pd.__version__
Out[14]: ‘0.12.0’

For now I reimplemented the old duplicated function and am calling it separately.

I also tested this in python version 3.4 and had the same results

@jreback
Copy link
Contributor

jreback commented Sep 23, 2015

The issue is still marked as open, though your timings are a bit odd. You might want to show df.info() on the frame you are doing.

In [1]: df = pd.concat([Series(np.arange(5000))]*35,axis=1)

In [2]: %timeit df.duplicated()
100 loops, best of 3: 7.97 ms per loop

In [3]: %timeit df.T.duplicated()
1 loops, best of 3: 378 ms per loop

In [4]: pd.__version__
Out[4]: '0.16.2'

I see if I add a different dtype then this does slow down.

@samuelclark
Copy link

Great, here is the info from the dataframe I used

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 35 columns):
...
dtypes: float64(7), int64(12), object(16)
memory usage: 1.4+ MB

@behzadnouri
Copy link
Contributor

i will patch this later today

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

closed by #11180

@jreback jreback closed this as completed Sep 25, 2015
@philippschw
Copy link

philippschw commented Dec 13, 2017

Issue seems to persist on extremely wide pandas DataFrame with 79 columns. Columns with same name have been removed!
The error is:

C:\Users***\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
558 uniques = vec_klass()
559 check_nulls = not is_integer_dtype(original)
--> 560 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
561
562 labels = _ensure_platform_int(labels)

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_labels (pandas_libs\hashtable.c:8705)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)



`pd.__version__`
> '0.20.3'

@jreback
Copy link
Contributor

jreback commented Dec 13, 2017

this is a long closed issue if u have a case
then open a new issue with a repro example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants