Series.unique() dies with many NaNs #714

kieranholland · 2012-01-30T11:08:22Z

Series.unique() dies with many NaNs:

import time

def test_unique(obj):
    for n in range(6):
        objs = Series([obj] * 10 ** n)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique('a')

       1 4.98294830322e-05
      10 2.40802764893e-05
     100 4.10079956055e-05
    1000 6.91413879395e-05
   10000 0.000164985656738
  100000 0.0013279914856

test_unique(float('nan'))

       1 3.91006469727e-05
      10 3.60012054443e-05
     100 0.000331163406372
    1000 0.0283088684082
   10000 2.71325206757
  100000 Boom!

The text was updated successfully, but these errors were encountered:

adamklein · 2012-01-30T15:05:58Z

Interesting. Guessing because the hash back-end doesn't realize nan != nan

wesm · 2012-01-30T23:24:53Z

went a different route, added a float64 hash table with NA handling. getting this result now:


In [1]: paste
import time

def test_unique(obj):
    for n in range(6):
        objs = Series([obj] * 10 ** n)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique('a')
## -- End pasted text --
     1 7.48634338379e-05
    10 5.3882598877e-05
   100 5.31673431396e-05
  1000 7.41481781006e-05
 10000 0.000144004821777
100000 0.00133991241455

In [2]: test_unique(float('nan'))
     1 5.19752502441e-05
    10 3.40938568115e-05
   100 3.19480895996e-05
  1000 3.91006469727e-05
 10000 9.29832458496e-05
100000 0.000288009643555

kieranholland · 2012-02-01T08:18:20Z

Thanks for quick response.
I encountered an issue with the new version.
It only happens with multiple nan instances and other objects mixed.

import time

def test_unique_single_nan_instance_and_non_nans():
    for n in range(6):
        s = []
        nan = float('nan')
        for _ in range(10 ** n):
            s.append(nan)
            s.append('a')
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_single_nan_instance_and_non_nans() # fine

def test_unique_multiple_nan_instances():
    for n in range(6):
        s = []
        for _ in range(10 ** n):
            s.append(float('nan'))
            s.append(float('nan'))
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_multiple_nan_instances() # fine

def test_unique_multiple_nan_instances_and_non_nans():
    for n in range(6):
        s = []
        for _ in range(10 ** n):
            s.append(float('nan'))
            s.append('a')
        objs = Series(s)
        start = time.time()
        objs.unique()
        stop = time.time()
        print('%6.0f %s' % (len(objs), stop - start))

test_unique_multiple_nan_instances_and_non_nans() # not so fine

wesm · 2012-02-02T20:27:51Z

reopened the issue and will take a look

…714

wesm · 2012-02-04T22:40:26Z

fixed this in master, let me know if you have any more issues

* commit 'v0.7.0rc1-94-ge3df4e2': DOC: added info on encoding parameter for csv i/o TST: renamed io b/c module conflict, made suite check for config added vbench for write csv BUG: made encoding optional on csv read/write, addresses pandas-dev#717 BUG: float64 hash table for handling NAs in Series.unique, close pandas-dev#714 TST: add bench_unique.py TST: added better testing for pandas-dev#709 BUG: closes pandas-dev#709, bug in ix + multiindex use case DOC: release notes BUG: don't assume that each object contains every unique block type in concat, GH pandas-dev#708 BUG: inconsistency in .ix with integer label and float index Fix test that assumed py2. Don't use unnecessary UnicodeReader on Python 3. BUG: remove poor man's breakpoint BUG: closes pandas-dev#705, csv is encoded utf-8 and then decoded on the read side updated support contact info DOC: note EWMA adjustment, closes pandas-dev#703 ENH: close pandas-dev#694, pandas-dev#693, pandas-dev#692 BUG: Bar plot fails if axis parameter supplied, closes pandas-dev#702

Based on discussions and given the fact that this collection is not exposed via any api, it should be safe to remove all usage of this.

Fixes pandas-dev#714 Do not write to changes collection

wesm closed this as completed in b50af20 Jan 30, 2012

wesm reopened this Feb 2, 2012

wesm added a commit that referenced this issue Feb 4, 2012

BUG: fix Series.unique with 'different' NA values in an object array #…

443dcc5

…714

wesm closed this as completed Feb 4, 2012

dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019

Fixes pandas-dev#714 Do not write to changes collection

b1862e4

Based on discussions and given the fact that this collection is not exposed via any api, it should be safe to remove all usage of this.

dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019

Merge pull request pandas-dev#715 from shashank88/remove_changes_coll

99b956e

Fixes pandas-dev#714 Do not write to changes collection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.unique() dies with many NaNs #714

Series.unique() dies with many NaNs #714

kieranholland commented Jan 30, 2012

adamklein commented Jan 30, 2012

wesm commented Jan 30, 2012

kieranholland commented Feb 1, 2012

wesm commented Feb 2, 2012

wesm commented Feb 4, 2012

Series.unique() dies with many NaNs #714

Series.unique() dies with many NaNs #714

Comments

kieranholland commented Jan 30, 2012

adamklein commented Jan 30, 2012

wesm commented Jan 30, 2012

kieranholland commented Feb 1, 2012

wesm commented Feb 2, 2012

wesm commented Feb 4, 2012