BUG: use size attribute (not method call) #7089

cpcloud · 2014-05-09T19:08:36Z

cpcloud · 2014-05-09T19:11:25Z

@jreback if u want to check this out... not exactly sure where the upcast issue is ... i construct results from the itemsize of the dtype, so any float32 should be fine. can u check with other gruopby float32 methods?

jreback · 2014-05-09T21:06:19Z

C:\Users\Jeff Reback\Documents\GitHub\pandas>c:\python26-32\Scripts\nosetests.exe build\lib.win32-2.6\pandas\tests\test_groupby.py --pdb --pdb-failure

.......................................> c:\users\jeff reback\documents\github\pandas\testing.pyx(136)pandas._testing.assert_almost_equal (pandas\src\
testing.c:2299)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(93)pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(58)pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(520)assert_series_equal()
-> assert_almost_equal(left.values, right.values, check_less_precise)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(573)assert_frame_equal()
-> check_exact=check_exact)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py(2017)test_count()
-> assert_frame_equal(count_not_as, expected.reset_index())
(Pdb) l
2012            count_as = df.groupby('A').count()
2013            count_not_as = df.groupby('A', as_index=False).count()
2014
2015            expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
2016            expected.index.name='A'
2017 ->         assert_frame_equal(count_not_as, expected.reset_index())
2018            assert_frame_equal(count_as, expected)
2019
2020            count_B = df.groupby('A')['B'].count()
2021            assert_series_equal(count_B, expected['B'])
2022
(Pdb) p df
   A   B    C
0  1   2  foo
1  1 NaN  bar
2  3 NaN  NaN

[3 rows x 3 columns]
(Pdb) p count_not_as
   A  B  C
0  1  2  2
1  3  1  1

[2 rows x 3 columns]
(Pdb) p expected.reset_index()
   A  B  C
0  1  1  2
1  3  0  0

[2 rows x 3 columns]
(Pdb) p expected.reset_index().dtypes
A    int64
B    int64
C    int64
dtype: object
(Pdb) p count_not_as.dtypes
A    int64
B    int64
C    int64
dtype: object
(Pdb) c
F> c:\users\jeff reback\documents\github\pandas\testing.pyx(136)pandas._testing.assert_almost_equal (pandas\src\testing.c:2299)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(93)pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(58)pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(520)assert_series_equal()
-> assert_almost_equal(left.values, right.values, check_less_precise)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py(2034)test_count_object()
-> tm.assert_series_equal(result, expected)
(Pdb) l
2029
2030            df = pd.DataFrame({'a': ['a', np.nan, np.nan] + ['b'] * 3,
2031                               'c': [2] * 3 + [3] * 3})
2032            result = df.groupby('c').a.count()
2033            expected = pd.Series([1, 3], index=[2, 3], name='a')
2034 ->         tm.assert_series_equal(result, expected)
2035
2036        def test_non_cython_api(self):
2037
2038            # GH5610
2039            # non-cython calls should not include the grouper
(Pdb) p result
c
2    3
3    3
Name: a, dtype: int64
(Pdb) p expected
2    1
3    3
Name: a, dtype: int64
(Pdb)

jreback · 2014-05-09T22:49:52Z

The problem is here: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1398

when evaluting the object type, a float result to hold the data is created (and not object type).

jreback · 2014-05-09T23:00:55Z

This patch fixes. Note that I still don't think we are testing what happens if the cython aggregation fails (but I am not sure what triggers it to fail anyhow, or if their are ANY tests for a cython function failling back).

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 094324d..b13b212 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1390,16 +1390,19 @@ class BaseGrouper(object):
         if is_numeric_dtype(values.dtype):
             values = com.ensure_float(values)
             is_numeric = True
+            out_dtype = 'f%d' % values.dtype.itemsize
         else:
             is_numeric = issubclass(values.dtype.type, (np.datetime64,
                                                         np.timedelta64))
+            out_dtype = 'float64'
             if is_numeric:
                 values = values.view('int64')
             else:
                 values = values.astype(object)

         # will be filled in Cython function
-        result = np.empty(out_shape, dtype='f%d' % values.dtype.itemsize)
+        result = np.empty(out_shape, dtype=out_dtype)
+
         result.fill(np.nan)
         counts = np.zeros(self.ngroups, dtype=np.int64)

diff --git a/pandas/tests/test_groupby.py b/pandas/tests/test_groupby.py
index 8b95748..c46ae4b 100644
--- a/pandas/tests/test_groupby.py
+++ b/pandas/tests/test_groupby.py
@@ -2009,11 +2009,12 @@ class TestGroupBy(tm.TestCase):
         df = pd.DataFrame([[1, 2, 'foo'], [1, nan, 'bar'], [3, nan, nan]],
                           columns=['A', 'B', 'C'])

+        expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
+        expected.index.name='A'
+
         count_as = df.groupby('A').count()
         count_not_as = df.groupby('A', as_index=False).count()

-        expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
-        expected.index.name='A'
         assert_frame_equal(count_not_as, expected.reset_index())
         assert_frame_equal(count_as, expected)

cpcloud · 2014-05-10T15:18:41Z

@jreback i applied the patch. so this works on windows?

jreback · 2014-05-10T15:23:05Z

yep; I think the problem is that dtype.itemsize for object is 8 on linux and 4 on windows (for 32-bit)

BUG: use size attribute (not method call)

jreback · 2014-05-10T15:23:43Z

thanks!

will let you know of any other windows issues

cpcloud · 2014-05-10T15:26:01Z

ok i'll keep that in mind when doing other stuff with object dtype

jreback · 2014-05-10T16:01:29Z

ok...passes now on windows

thanks!

BUG: use size attribute (not method call)

35b5aed

jreback added Bug labels May 9, 2014

jreback added this to the 0.14.0 milestone May 9, 2014

BUG: force float64 dtype in object results for groupby

b722dee

cpcloud self-assigned this May 10, 2014

jreback added a commit that referenced this pull request May 10, 2014

Merge pull request #7089 from cpcloud/groupby-size-windows-bug

19fe376

BUG: use size attribute (not method call)

jreback merged commit 19fe376 into pandas-dev:master May 10, 2014

cpcloud deleted the groupby-size-windows-bug branch May 10, 2014 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: use size attribute (not method call) #7089

BUG: use size attribute (not method call) #7089

cpcloud commented May 9, 2014

cpcloud commented May 9, 2014

jreback commented May 9, 2014

jreback commented May 9, 2014

jreback commented May 9, 2014

cpcloud commented May 10, 2014

jreback commented May 10, 2014

jreback commented May 10, 2014

cpcloud commented May 10, 2014

jreback commented May 10, 2014

BUG: use size attribute (not method call) #7089

BUG: use size attribute (not method call) #7089

Conversation

cpcloud commented May 9, 2014

cpcloud commented May 9, 2014

jreback commented May 9, 2014

jreback commented May 9, 2014

jreback commented May 9, 2014

cpcloud commented May 10, 2014

jreback commented May 10, 2014

jreback commented May 10, 2014

cpcloud commented May 10, 2014

jreback commented May 10, 2014