Skip to content

Series.unique() dies with many NaNs #714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kieranholland opened this issue Jan 30, 2012 · 5 comments
Closed

Series.unique() dies with many NaNs #714

kieranholland opened this issue Jan 30, 2012 · 5 comments
Labels
Milestone

Comments

@kieranholland
Copy link

Series.unique() dies with many NaNs:

import time

def test_unique(obj):
    for n in range(6):
        objs = Series([obj] * 10 ** n)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique('a')

       1 4.98294830322e-05
      10 2.40802764893e-05
     100 4.10079956055e-05
    1000 6.91413879395e-05
   10000 0.000164985656738
  100000 0.0013279914856

test_unique(float('nan'))

       1 3.91006469727e-05
      10 3.60012054443e-05
     100 0.000331163406372
    1000 0.0283088684082
   10000 2.71325206757
  100000 Boom!     
@adamklein
Copy link
Contributor

Interesting. Guessing because the hash back-end doesn't realize nan != nan

@wesm wesm closed this as completed in b50af20 Jan 30, 2012
@wesm
Copy link
Member

wesm commented Jan 30, 2012

went a different route, added a float64 hash table with NA handling. getting this result now:


In [1]: paste
import time

def test_unique(obj):
    for n in range(6):
        objs = Series([obj] * 10 ** n)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique('a')
## -- End pasted text --
     1 7.48634338379e-05
    10 5.3882598877e-05
   100 5.31673431396e-05
  1000 7.41481781006e-05
 10000 0.000144004821777
100000 0.00133991241455

In [2]: test_unique(float('nan'))
     1 5.19752502441e-05
    10 3.40938568115e-05
   100 3.19480895996e-05
  1000 3.91006469727e-05
 10000 9.29832458496e-05
100000 0.000288009643555

@kieranholland
Copy link
Author

Thanks for quick response.
I encountered an issue with the new version.
It only happens with multiple nan instances and other objects mixed.

import time

def test_unique_single_nan_instance_and_non_nans():
    for n in range(6):
        s = []
        nan = float('nan')
        for _ in range(10 ** n):
            s.append(nan)
            s.append('a')
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_single_nan_instance_and_non_nans() # fine

def test_unique_multiple_nan_instances():
    for n in range(6):
        s = []
        for _ in range(10 ** n):
            s.append(float('nan'))
            s.append(float('nan'))
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_multiple_nan_instances() # fine

def test_unique_multiple_nan_instances_and_non_nans():
    for n in range(6):
        s = []
        for _ in range(10 ** n):
            s.append(float('nan'))
            s.append('a')
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_multiple_nan_instances_and_non_nans() # not so fine

@wesm wesm reopened this Feb 2, 2012
@wesm
Copy link
Member

wesm commented Feb 2, 2012

reopened the issue and will take a look

@wesm
Copy link
Member

wesm commented Feb 4, 2012

fixed this in master, let me know if you have any more issues

@wesm wesm closed this as completed Feb 4, 2012
yarikoptic added a commit to neurodebian/pandas that referenced this issue Feb 10, 2012
* commit 'v0.7.0rc1-94-ge3df4e2':
  DOC: added info on encoding parameter for csv i/o
  TST: renamed io b/c module conflict, made suite check for config
  added vbench for write csv
  BUG: made encoding optional on csv read/write, addresses pandas-dev#717
  BUG: float64 hash table for handling NAs in Series.unique, close pandas-dev#714
  TST: add bench_unique.py
  TST: added better testing for pandas-dev#709
  BUG: closes pandas-dev#709, bug in ix + multiindex use case
  DOC: release notes
  BUG: don't assume that each object contains every unique block type in concat, GH pandas-dev#708
  BUG: inconsistency in .ix with integer label and float index
  Fix test that assumed py2.
  Don't use unnecessary UnicodeReader on Python 3.
  BUG: remove poor man's breakpoint
  BUG: closes pandas-dev#705, csv is encoded utf-8 and then decoded on the read side
  updated support contact info
  DOC: note EWMA adjustment, closes pandas-dev#703
  ENH: close pandas-dev#694, pandas-dev#693, pandas-dev#692
  BUG: Bar plot fails if axis parameter supplied, closes pandas-dev#702
dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019
Based on discussions and given the fact that this collection is
not exposed via any api, it should be safe to remove all usage
of this.
dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants