Skip to content

BUG: use size attribute (not method call) #7089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 10, 2014
Merged

BUG: use size attribute (not method call) #7089

merged 2 commits into from
May 10, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented May 9, 2014

xref: #7055

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2014

@jreback if u want to check this out... not exactly sure where the upcast issue is ... i construct results from the itemsize of the dtype, so any float32 should be fine. can u check with other gruopby float32 methods?

@jreback
Copy link
Contributor

jreback commented May 9, 2014

C:\Users\Jeff Reback\Documents\GitHub\pandas>c:\python26-32\Scripts\nosetests.exe build\lib.win32-2.6\pandas\tests\test_groupby.py --pdb --pdb-failure

.......................................> c:\users\jeff reback\documents\github\pandas\testing.pyx(136)pandas._testing.assert_almost_equal (pandas\src\
testing.c:2299)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(93)pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(58)pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(520)assert_series_equal()
-> assert_almost_equal(left.values, right.values, check_less_precise)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(573)assert_frame_equal()
-> check_exact=check_exact)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py(2017)test_count()
-> assert_frame_equal(count_not_as, expected.reset_index())
(Pdb) l
2012            count_as = df.groupby('A').count()
2013            count_not_as = df.groupby('A', as_index=False).count()
2014
2015            expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
2016            expected.index.name='A'
2017 ->         assert_frame_equal(count_not_as, expected.reset_index())
2018            assert_frame_equal(count_as, expected)
2019
2020            count_B = df.groupby('A')['B'].count()
2021            assert_series_equal(count_B, expected['B'])
2022
(Pdb) p df
   A   B    C
0  1   2  foo
1  1 NaN  bar
2  3 NaN  NaN

[3 rows x 3 columns]
(Pdb) p count_not_as
   A  B  C
0  1  2  2
1  3  1  1

[2 rows x 3 columns]
(Pdb) p expected.reset_index()
   A  B  C
0  1  1  2
1  3  0  0

[2 rows x 3 columns]
(Pdb) p expected.reset_index().dtypes
A    int64
B    int64
C    int64
dtype: object
(Pdb) p count_not_as.dtypes
A    int64
B    int64
C    int64
dtype: object
(Pdb) c
F> c:\users\jeff reback\documents\github\pandas\testing.pyx(136)pandas._testing.assert_almost_equal (pandas\src\testing.c:2299)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(93)pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\testing.pyx(58)pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)()
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\util\testing.py(520)assert_series_equal()
-> assert_almost_equal(left.values, right.values, check_less_precise)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py(2034)test_count_object()
-> tm.assert_series_equal(result, expected)
(Pdb) l
2029
2030            df = pd.DataFrame({'a': ['a', np.nan, np.nan] + ['b'] * 3,
2031                               'c': [2] * 3 + [3] * 3})
2032            result = df.groupby('c').a.count()
2033            expected = pd.Series([1, 3], index=[2, 3], name='a')
2034 ->         tm.assert_series_equal(result, expected)
2035
2036        def test_non_cython_api(self):
2037
2038            # GH5610
2039            # non-cython calls should not include the grouper
(Pdb) p result
c
2    3
3    3
Name: a, dtype: int64
(Pdb) p expected
2    1
3    3
Name: a, dtype: int64
(Pdb)

@jreback
Copy link
Contributor

jreback commented May 9, 2014

The problem is here: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1398

when evaluting the object type, a float result to hold the data is created (and not object type).

@jreback
Copy link
Contributor

jreback commented May 9, 2014

This patch fixes. Note that I still don't think we are testing what happens if the cython aggregation fails (but I am not sure what triggers it to fail anyhow, or if their are ANY tests for a cython function failling back).

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 094324d..b13b212 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1390,16 +1390,19 @@ class BaseGrouper(object):
         if is_numeric_dtype(values.dtype):
             values = com.ensure_float(values)
             is_numeric = True
+            out_dtype = 'f%d' % values.dtype.itemsize
         else:
             is_numeric = issubclass(values.dtype.type, (np.datetime64,
                                                         np.timedelta64))
+            out_dtype = 'float64'
             if is_numeric:
                 values = values.view('int64')
             else:
                 values = values.astype(object)

         # will be filled in Cython function
-        result = np.empty(out_shape, dtype='f%d' % values.dtype.itemsize)
+        result = np.empty(out_shape, dtype=out_dtype)
+
         result.fill(np.nan)
         counts = np.zeros(self.ngroups, dtype=np.int64)

diff --git a/pandas/tests/test_groupby.py b/pandas/tests/test_groupby.py
index 8b95748..c46ae4b 100644
--- a/pandas/tests/test_groupby.py
+++ b/pandas/tests/test_groupby.py
@@ -2009,11 +2009,12 @@ class TestGroupBy(tm.TestCase):
         df = pd.DataFrame([[1, 2, 'foo'], [1, nan, 'bar'], [3, nan, nan]],
                           columns=['A', 'B', 'C'])

+        expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
+        expected.index.name='A'
+
         count_as = df.groupby('A').count()
         count_not_as = df.groupby('A', as_index=False).count()

-        expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], index=[1,3])
-        expected.index.name='A'
         assert_frame_equal(count_not_as, expected.reset_index())
         assert_frame_equal(count_as, expected)

@jreback jreback added this to the 0.14.0 milestone May 9, 2014
@cpcloud cpcloud self-assigned this May 10, 2014
@cpcloud
Copy link
Member Author

cpcloud commented May 10, 2014

@jreback i applied the patch. so this works on windows?

@jreback
Copy link
Contributor

jreback commented May 10, 2014

yep; I think the problem is that dtype.itemsize for object is 8 on linux and 4 on windows (for 32-bit)

jreback added a commit that referenced this pull request May 10, 2014
BUG: use size attribute (not method call)
@jreback jreback merged commit 19fe376 into pandas-dev:master May 10, 2014
@jreback
Copy link
Contributor

jreback commented May 10, 2014

thanks!

will let you know of any other windows issues

@cpcloud
Copy link
Member Author

cpcloud commented May 10, 2014

ok i'll keep that in mind when doing other stuff with object dtype

@cpcloud cpcloud deleted the groupby-size-windows-bug branch May 10, 2014 15:26
@jreback
Copy link
Contributor

jreback commented May 10, 2014

ok...passes now on windows

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants