Skip to content

COMPAT: .map iterates over python types rather than storage type #13236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glaucouri opened this issue May 20, 2016 · 17 comments · Fixed by #17491
Closed

COMPAT: .map iterates over python types rather than storage type #13236

glaucouri opened this issue May 20, 2016 · 17 comments · Fixed by #17491
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Docs Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@glaucouri
Copy link

glaucouri commented May 20, 2016

Code Sample, a copy-pastable example if possible

import pandas as P
S=P.Series([0.6,0.2,15])

pandas 0.18+numpy 0.10:

In [1]: print S.dtype
float64

In [2]: print S.values.dtype
float64

In [3]: print S.map(type)
0    <type 'numpy.float64'>
1    <type 'numpy.float64'>
2    <type 'numpy.float64'>
dtype: object

pandas 0.18.1+numpy 0.11.0:

In [5]: print S.dtype
float64

In [6]: print S.values.dtype
float64

In [7]: print S.map(type)
0    <type 'float'>
1    <type 'float'>
2    <type 'float'>
dtype: object

I expect to get the same dtype for the 3 print, why this is changed in last version?

output of pd.show_versions()

pandas: 0.18.1
nose: 1.3.7
pip: 1.5.4
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2016.4
blosc: 1.2.7
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.4.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Thank you
Gla

@jreback
Copy link
Contributor

jreback commented May 20, 2016

#12473 fixed map as was failing on non-numpy aware dtypes. This is iterating over python float values so this is correct (was essentially a bug before).

cc @sinhrks

@jreback jreback closed this as completed May 20, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Compat pandas objects compatability with Numpy or Python functions Usage Question labels May 20, 2016
@jreback jreback added this to the No action milestone May 20, 2016
@glaucouri
Copy link
Author

glaucouri commented May 20, 2016

But the underlyng data type IS a numpy dtype.

In this example is reasonable to have all float32 dtype, the same type.

In [4]: S.astype(P.np.float32).map(type)
Out[4]: 
0    <type 'float'>
1    <type 'float'>
2    <type 'float'>
dtype: object

In [5]: map(type, S.astype(P.np.float32))
Out[5]: [numpy.float32, numpy.float32, numpy.float32]

thank you
Gla

@jreback
Copy link
Contributor

jreback commented May 20, 2016

So < 0.18.1 we were NOT using .asobject and instead iterating over the actual numpy types. Now these are converted to python types and then iterated.

In [1]: s = Series([1],dtype='float32')

In [2]: pd.lib.map_infer(s.values, type)
Out[2]: array([<type 'numpy.float32'>], dtype=object)

In [4]: pd.lib.map_infer(s.asobject, type)
Out[4]: array([<type 'float'>], dtype=object)

I think this is more correct actually (I don't think this was tested before), and did not provide a guarantee (either way) of what types it would result.

cc @sinhrks

@jreback
Copy link
Contributor

jreback commented May 20, 2016

I will reopen for discussion.

@jreback jreback reopened this May 20, 2016
@jreback jreback removed this from the No action milestone May 20, 2016
@jreback jreback changed the title strange float type promotion COMPAT: .map iterates over python types rather than storage type May 20, 2016
@sinhrks
Copy link
Member

sinhrks commented May 21, 2016

Assuming map to datetime-likes Series. Because np.datetime64 doesn't have convenient properties like pd.Timestamp, it coerces from numpy dtype to pandas/python dtype (Timestamp inherits datetime.datetime).

s = pd.Series([pd.Timestamp('2011-01-01')])
s.dtype
# dtype('<M8[ns]')

s.map(type)
# 0    <class 'pandas.tslib.Timestamp'>
# dtype: object

I feel it is natural numeric coerces to python repr also. Can u provide an usage which np.float is preferable?

@jreback
Copy link
Contributor

jreback commented May 21, 2016

yeah I think this only matters for non-extension & i8 types (e.g. int/float)
(strings / bools, extension types, and i8 are better off with python (and pandas) types for sure).

I think it does make sense to return native types (float vs np.float) as has lots more utility.

Let's update the doc-string to indicate this?

so repurposing this issue.

@jreback jreback added this to the Next Major Release milestone May 21, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release May 21, 2016
@jreback
Copy link
Contributor

jreback commented May 21, 2016

@glaucouri want to do a pull-request?

@glaucouri
Copy link
Author

Before making the proposal i need to understand well what do you mean with: " it does make sense to return native types" .

Why is preferred to convert any single element of a series if is not explicitly required?

Probably is more efficient to iterate over native values and let user to cast only if necessary (with the usefull method map).

Thank you
Glauco

@sinhrks
Copy link
Member

sinhrks commented May 23, 2016

u shouldn't use map at all if you care performance. It focuses on convenience. Do you have any other reason to prefer numpy dtype?

@jreback
Copy link
Contributor

jreback commented May 23, 2016

Before making the proposal i need to understand well what do you mean with: " it does make sense to return native types" .

When iterating you want to have native types if at all possible so the user doesn't need to even think about this. numpy types are not normal for iterating in python; yes they are the 'same' but they often don't have the right methods or behaviors, esp for things like ``np.timedelta/np.datetime`, not to mention extension types. Since we are already converting to pythonic dtypes, makes sense to complete with int/float.

@glaucouri
Copy link
Author

This is true, map is not for performance.
Jeff, What you're saying is reasonable, but my doubt is simple, if pandas is a layer over numpy and internally it use dtype i think it is consistent to have the same behavior using builtins.map or S.map. That is what actually happens for Timestamp.

Why you expect that the map iterator must change underlying types (sometimes)?

In the example above, if i cast a series to float64 i expect the type i've casted to in the same
manner as if iterate with map over a datetime64[*] i expect to use the "to_period".

Thank you
your interest is very appreciated
Glauco

@jreback
Copy link
Contributor

jreback commented May 23, 2016

@glaucouri well, pandas is not simply a layer over numpy; it hides numpy more and more (and will continue to do so).

The iterator type will always change the type to a python one. I don't see any argument not to do this. Essentially you are creating scalars, python scalars are perfectly fine types, vastly superior to numpy scalars.

Appreciate If you would like to update the documentation.

so __iter__ boxes datetimes (including extension) types; categoricals don't matter here. I think we should also return python ints/floats. I'll make that a separate issue. .map has different logic because it can possibly handle things in a more vectorized way rather than as an iterator.

@glaucouri
Copy link
Author

glaucouri commented May 24, 2016

Hi Jeff,
is not my intent to underestimate the work done by pandas...
my intend was to focus on underlying data types.

Do you want to extend the box-python-type behaviour to all kind of iteration over Series?

Take the first example:

S=P.Series([0.6,0.2,15])

And now we try some kind of iteration on it

In [24]: map(type,S)
Out[24]: [numpy.float64, numpy.float64, numpy.float64]

In [25]: map(type,[x for x in S])
Out[25]: [numpy.float64, numpy.float64, numpy.float64]

In [26]: map(type, S.tolist())
Out[26]: [numpy.float64, numpy.float64, numpy.float64]

In [27]: map(type, list(S))
Out[27]: [numpy.float64, numpy.float64, numpy.float64]

In [28]: S.map(type)
Out[28]: 
0    <type 'float'>
1    <type 'float'>
2    <type 'float'>
dtype: object

In [29]: map(type, tuple(S))
Out[29]: [numpy.float64, numpy.float64, numpy.float64]

In [30]: map(type, iter(S))
Out[30]: [numpy.float64, numpy.float64, numpy.float64]

If i'm not wrong some of these examples call explicitly iter but actually works differenlty from map. Do i missing something?

Thank you Jeff, i will work on docs (just figured out how to do it).

Gla

@jreback
Copy link
Contributor

jreback commented May 24, 2016

@glaucouri use master and you will see that .tolist() has changed. You are still showing what is happening, not why you think its a good idea to revert this.

@glaucouri
Copy link
Author

glaucouri commented May 26, 2016

Ok, I did not realize they were already available in master.

In [3]: import pandas as P

In [4]: P.__version__
Out[4]: '0.18.1+70.gb4e2d34'

In [5]: S=P.Series([0.6,0.2,15])

In [6]: map(type,S)
Out[6]: [numpy.float64, numpy.float64, numpy.float64]

In [7]: map(type,[x for x in S])
Out[7]: [numpy.float64, numpy.float64, numpy.float64]

In [8]: map(type, S.tolist())
Out[8]: [float, float, float]

In [9]: map(type, list(S))
Out[9]: [numpy.float64, numpy.float64, numpy.float64]

In [10]: S.map(type)
Out[10]: 
0    <type 'float'>
1    <type 'float'>
2    <type 'float'>
dtype: object

In [11]: map(type, tuple(S))
Out[11]: [numpy.float64, numpy.float64, numpy.float64]

In [12]:  map(type, iter(S))
Out[12]: [numpy.float64, numpy.float64, numpy.float64]

So the approach is to implement a kind of 'python nativization' explicitly only with tolist and map method ?

To be honest i prefer a solution where the type is not changed anyway,
But if it is preferred a solution like this one i expect that all kind of __iter__ call works in the same manner.
Expecially : map(type,S) and S.map(type) . works odd to me.

Moreover this casting has a non negligible ~20% overhead

In [17]: %timeit map(type,S)
1 loop, best of 3: 723 ms per loop

In [18]: %timeit S.map(type)
1 loop, best of 3: 873 ms per loop

Thank you again Jeff.
Glauco

@jreback
Copy link
Contributor

jreback commented May 26, 2016

@glaucouri see #13258 to fix the iteration.

if you care about perf you are going about this the wrong way. .map with iteration is exactly that iteration in python space, which is non-performant compared to vectorized methods.

@sinhrks
Copy link
Member

sinhrks commented May 26, 2016

you should provide enough reason to break existing user's code using numpy dtype.

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 15, 2016
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Docs Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants