Skip to content

BUG: upcasting on reshaping ops #13247 #15594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -771,3 +771,5 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)

- Concating multiple objects will no longer result in automatically upcast to `float64`, and instead try to find the smallest `dtype` that would suffice (:issue:`13247`)
20 changes: 16 additions & 4 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
_maybe_convert_string_to_object,
_maybe_upcast,
_maybe_convert_scalar, _maybe_promote,
is_float_dtype, is_numeric_dtype,
_infer_dtype_from_scalar,
_soft_convert_objects,
_possibly_convert_objects,
Expand Down Expand Up @@ -4523,6 +4524,8 @@ def _interleaved_dtype(blocks):
return np.dtype('int%s' % (lcd.itemsize * 8 * 2))
return lcd

elif have_int and have_float and not have_complex:
return np.dtype('float64')
elif have_complex:
return np.dtype('c16')
else:
Expand Down Expand Up @@ -4892,6 +4895,8 @@ def get_empty_dtype_and_na(join_units):
upcast_cls = 'datetime'
elif is_timedelta64_dtype(dtype):
upcast_cls = 'timedelta'
elif is_float_dtype(dtype) or is_numeric_dtype(dtype):
upcast_cls = dtype.name
else:
upcast_cls = 'float'

Expand All @@ -4916,8 +4921,6 @@ def get_empty_dtype_and_na(join_units):
return np.dtype(np.bool_), None
elif 'category' in upcast_classes:
return np.dtype(np.object_), np.nan
elif 'float' in upcast_classes:
return np.dtype(np.float64), np.nan
elif 'datetimetz' in upcast_classes:
dtype = upcast_classes['datetimetz']
return dtype[0], tslib.iNaT
Expand All @@ -4926,7 +4929,17 @@ def get_empty_dtype_and_na(join_units):
elif 'timedelta' in upcast_classes:
return np.dtype('m8[ns]'), tslib.iNaT
else: # pragma
raise AssertionError("invalid dtype determination in get_concat_dtype")
g = pandas.types._.find_common_type(upcast_classes, [])
if is_float_type(g):
return g, g.type(np.nan)
elif is_numeric_dtype(g):
if has_none_blocks:
return np.float64, np.nan
else:
return g, None
else:
msg = "invalid dtype determination in get_concat_dtype"
raise AssertionError(msg)


def concatenate_join_units(join_units, concat_axis, copy):
Expand Down Expand Up @@ -5191,7 +5204,6 @@ def is_null(self):
return True

def get_reindexed_values(self, empty_dtype, upcasted_na):

if upcasted_na is None:
# No upcasting is necessary
fill_value = self.block.fill_value
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/test_internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -651,7 +651,7 @@ def test_interleave(self):
mgr = create_mgr('a: f8; b: i8')
self.assertEqual(mgr.as_matrix().dtype, 'f8')
mgr = create_mgr('a: f4; b: i8')
self.assertEqual(mgr.as_matrix().dtype, 'f4')
self.assertEqual(mgr.as_matrix().dtype, 'f8')
mgr = create_mgr('a: f4; b: i8; d: object')
self.assertEqual(mgr.as_matrix().dtype, 'object')
mgr = create_mgr('a: bool; b: i8')
Expand Down
61 changes: 61 additions & 0 deletions pandas/tests/tools/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -1899,3 +1899,64 @@ def test_concat_multiindex_dfs_with_deepcopy(self):
tm.assert_frame_equal(result_copy, expected)
result_no_copy = pd.concat(example_dict, names=['testname'])
tm.assert_frame_equal(result_no_copy, expected)

def test_concat_no_unnecessary_upcats(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u write this in pytest form using parametrize (needs to be a separate function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but I need to read up on documentation and it might take some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will be something like

@parametrize('dtype', np.sctypes('float'))
@parametrize('klass', [Series, DataFrame])
def test_concat_no_unecessary_upcasts(klass, dtype):
     # dtype and klass are defined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI only need to do Series/DataFrame/Panel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need pytest module for using @parametrize.
Should I create another class for pytest?

# GH 13247
for pdt in [pd.Series, pd.DataFrame, pd.Panel, pd.Panel4D]:
dims = pdt().ndim
for dt in np.sctypes['float']:
dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)),
pdt(np.array([np.nan], dtype=dt, ndmin=dims)),
pdt(np.array([5], dtype=dt, ndmin=dims))]
x = pd.concat(dfs)
self.assertTrue(x.values.dtype == dt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do assert x.values.dtype == dt, no need for any reference to self


for dt in (np.sctypes['int'] + np.sctypes['uint']):
dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)),
pdt(np.array([5], dtype=dt, ndmin=dims))]
x = pd.concat(dfs)
self.assertTrue(x.values.dtype == dt)

objs = []
objs.append(pdt(np.array([1], dtype=np.float32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
self.assertTrue(pd.concat(objs).values.dtype == np.float32)

objs = []
objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.int64, ndmin=dims)))
self.assertTrue(pd.concat(objs).values.dtype == np.int64)

# not sure what is the best answer here
objs = []
objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need these tests as you have the ones below

self.assertTrue(pd.concat(objs).values.dtype == np.float64)

@parametrize('dtype', np.sctypes('float'))
@parametrize('klass', [Series, DataFrame, Panel])
def test_concat_no_unnecessary_upcats_pytest(self, dtype, klass):
# GH 13247
for pdt in klass:
Copy link
Contributor

@TomAugspurger TomAugspurger Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misreading this, but I don't think you need the for loop over klass anymore. pytest will handle that for you, since you're doing the parametrize.
I think this will also fail, since this class eventually inherits from TestCase, which doesn't play well with parametrized fixtures. I'd try it like this:

import numpy as np
from pandas import Series, DataFrame, Panel
import pandas as pd
import pytest


class TestConcatUpcast(object):

    @pytest.mark.parametrize('pdt', [Series, DataFrame, Panel])
    def test_concat_no_unnecessary_upcats_pytest(self, pdt):
        # GH 13247
        dims = pdt().ndim
        for dt in np.sctypes['float']:
            dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)),
                   pdt(np.array([np.nan], dtype=dt, ndmin=dims)),
                   pdt(np.array([5], dtype=dt, ndmin=dims))]
            x = pd.concat(dfs)
            assert x.values.dtype == dt

            objs = []
            objs.append(pdt(np.array([1], dtype=np.float32, ndmin=dims)))
            objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
            assert pd.concat(objs).values.dtype == np.float32

            objs = []
            objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
            objs.append(pdt(np.array([1], dtype=np.int64, ndmin=dims)))
            assert pd.concat(objs).values.dtype == np.int64

            objs = []
            objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
            objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
            assert pd.concat(objs).values.dtype == np.float64

This doesn't parametrize all that well over the dtype argument, so probably best to just leave it there.

If you want to leave comments on #15608, would appreciate your insight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay.
I'm trying to set up the test environment, but conda doesn't like me.
I'll try your code after I finish setting it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right.
I can't use @parametrize without importing pytest module.
I will try creating another class.

dims = pdt().ndim
for dt in dtype:
dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)),
pdt(np.array([np.nan], dtype=dt, ndmin=dims)),
pdt(np.array([5], dtype=dt, ndmin=dims))]
x = pd.concat(dfs)
self.assertTrue(x.values.dtype == dt)

objs = []
objs.append(pdt(np.array([1], dtype=np.float32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
self.assertTrue(pd.concat(objs).values.dtype == np.float32)

objs = []
objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.int64, ndmin=dims)))
self.assertTrue(pd.concat(objs).values.dtype == np.int64)

objs = []
objs.append(pdt(np.array([1], dtype=np.int32, ndmin=dims)))
objs.append(pdt(np.array([1], dtype=np.float16, ndmin=dims)))
self.assertTrue(pd.concat(objs).values.dtype == np.float64)