Skip to content

pd.Series.apply can't be made to produce Series of expected dtype when base series is empty #29323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JCourt1 opened this issue Nov 1, 2019 · 9 comments
Labels
Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions

Comments

@JCourt1
Copy link

JCourt1 commented Nov 1, 2019

Code Sample

df = pd.DataFrame(data=[3], columns=['colA'], dtype=int)
print(df.colA.dtype)
#> dtype('int64')

# Here things are as expected. A `Series` of dtype=bool is produced
res = df.colA.apply(lambda row: True)
print(res.dtype)
#> dtype('bool')

df = pd.DataFrame(columns=['colA'], dtype=int)
print(df.colA.dtype)
#> dtype('int64')

# Here colA is empty. As a result, a `Series` of dtype=int64 is produced
res = df.colA.apply(lambda row: True)
print(res.dtype)
#> dtype('int64')

Problem description

May be related to #28427.

When calling apply() on a series, you expect to get back a series with a certain dtype based on the function you give it. However, if the base Series is empty, there is nothing to apply the function to, so the resultant Series just has the same dtype as the base.
My actual use case is more along the lines of this:

import pandas as pd
import datetime

# Is a List[datetime.datetime], but could be empty
datetime_data = [datetime.datetime.now()]
# datetime_data = []

df = pd.DataFrame(data=[False for i in range(len(datetime_data))], columns=['colA'], index=pd.DatetimeIndex(datetime_data))

df['colB'] = df.index.to_series().apply(lambda x: True if x > datetime.datetime.now() else False)

start_dt = datetime.datetime.now() - datetime.timedelta(days=5)
end_dt = datetime.datetime.now() + datetime.timedelta(days=5)
new_dt_range = pd.date_range(start_dt, end_dt)
df = df.reindex(new_dt_range, fill_value=True)

# If you run this, it will work. But uncomment the empty list datetime_data, and you get:

#> TypeError: Cannot convert input [True] of type <type 'bool'> to Timestamp

I would have thought there would be a way to force the type of the return value of .apply even when the base vector is empty. You can obviously get around this easily by checking whether the base is empty when you assign colB:

df['colB'] = df.index.to_series().apply(lambda row: True if row > datetime.datetime.now() else False) if not df.empty else True

Which coerces the dtype without setting any values, and that seems hacky and it feels like it should be possible to just make apply do the right thing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 18.1
setuptools : 40.6.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@pambot
Copy link
Contributor

pambot commented Nov 4, 2019

I'm interested in taking a look.

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions labels Nov 4, 2019
@pambot
Copy link
Contributor

pambot commented Nov 4, 2019

My initial thought is that it has something to do with this logic that goes first in the definition for apply in pandas.core.series:

if len(self) == 0:
    return self._constructor(dtype=self.dtype, index=self.index).__finalize__(self)

This would make it so that, no matter what, an empty series will always return a copy of itself and will never check if the function given to apply might change the dtype. It's hard to think of a good solution for this because return types are determined at runtime, so it almost seems to me like we'd have to seed in values of a particular type, try out the function, get the intended return type, and then feed that into the constructor?

What about just making the return type object if the data is empty? Would this be too disruptive?

Also, while I could reproduce the behaviour in the spec, I couldn't reproduce the motivating example with the datetime -> bool conversion when I uncommented the empty list creation. The resulting series was an empty series with dtype of object.

One objection to this approach is that the default dtype for empty series (i.e. Series(data=[]) is float64), so it would be confusing for it to be initialized as a float64 and then converted to object once an apply is applied, when the function that is applied doesn't even do anything. Maybe both defaults should be changed to object?

@JCourt1
Copy link
Author

JCourt1 commented Nov 4, 2019

Yeah looks like it is that. From a quick blame check in tig, it seems to have been introduced in b7a6d1b on the grounds that it fixed a test, and before that in ddbfb3c it just returned Series(), which is more/just as arbitrary as it will always be of dtype float as you say.

Quick fix would ofc be to allow the user to input a kwarg to apply() and use it to enforce the dtype of return value - but that seems sloppy

I couldn't reproduce the motivating example with the datetime

What does pd.show_versions() show for you? I just recreated the environment on a different machine to the one I used before and am getting the same result

Maybe both defaults should be changed to object?

I think other things probably depend on the default being float..

@pambot
Copy link
Contributor

pambot commented Nov 5, 2019

It looks like dtype of float64 as dtype of all empty created series will cause a whole bunch of test errors.

Quick fix would ofc be to allow the user to input a kwarg to apply() and use it to enforce the dtype of return value - but that seems sloppy

I could put that in. Alternatively, we could do dtype_if_empty=<some type>, which is sloppy too. Basically I just can't think of what the best design here is. Leaving it as float64 won't work and doesn't make sense, but we can't infer function return types. We could generate instances of the dtype type, run the function, and use the type of that as the dtype. This seems complicated, but it is the only thing I can think of that would produce the spec, and I'd be happy to implement it, but I don't want to go down a rabbit hole unless people think it's a good idea. Could someone give me a design pointer?

Here's my pd.show_versions()

### INSTALLED VERSIONS

commit : aa0f138
python : 3.7.2.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.24.0.dev0+3421.gaa0f1382f.dirty
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0
Cython : 0.29.14
pytest : 5.2.2
hypothesis : 4.43.1
sphinx : 2.2.1
blosc : 1.8.1
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : 0.3.5
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.6.1
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2

@jreback
Copy link
Contributor

jreback commented Nov 5, 2019

@pambot you can just run the function for the empty case and return the result

might just work

@pambot
Copy link
Contributor

pambot commented Nov 6, 2019

@TomAugspurger I think you recommended we close this?

@TomAugspurger
Copy link
Contributor

Probably.

For DataFrame.apply on an empty df, we should be able to pass the empty cols to the user's function.

For Series.apply, we can't do anything. There's no row to call the function on.

@jreback I think this issue is specifically about Series.apply.

@JCourt1
Copy link
Author

JCourt1 commented Nov 6, 2019

That's right, it's about Series.apply. In a statically typed language, you would generally have the following situation:

Series<sometype> series_x = // whatever
Series<othertype> series_y = series_x.apply(some_lambda)

where you can know what type series_y is (even if the input were empty). Obviously python isn't like that, but at least having some way of forcing apply to produce the right type would be helpful, as effectively we get the expected behaviour except in the edge case where series_x is empty.

Although I understand your point when you say:

we can't do anything. There's no row to call the function on

But that argues it should actually throw an exception (I'm not suggesting it do so, as obviously that would cause a lot of breakages). So giving the option to force a type may be better than nothing

@jreback
Copy link
Contributor

jreback commented Nov 6, 2019

ok closing

yeah this is basically impossible to infer for a Series because we can’t call the function

@jreback jreback closed this as completed Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

5 participants