ENH: add empty() methods for DataFrame and Series #12291

gfyoung · 2016-02-11T11:18:36Z

Added empty() methods to the Series and DataFrame classes analogous to the empty() function in the numpy library that can also accept scipy duck-type dtypes in addition to numpy dtypes.

Added empty() methods to the Series and DataFrame classes analogous to the empty() function in the numpy library that can also accept scipy duck-type dtypes in addition to numpy dtypes.

gfyoung · 2016-02-11T11:39:32Z

Besides the flake8 issues, I am somewhat confused by all of these Travis failures, as I thought my additions were isolated from the rest of the codebase. Why are they occurring?

jreback · 2016-02-11T12:02:49Z

.empty is a property of NDFrames already
and this method is not necessary

Series(index=range(4)) does this already for example

gfyoung · 2016-02-11T13:11:28Z

On second look, your suggestion doesn't quite entirely match what I was proposing:

>>> from pandas import Series
>>> Series(index=range(4), dtype=int)
0    NaN
1    NaN
dtype: float64

I would think that Series initialization should respect the dtype, which is what my PR does.

jreback · 2016-02-11T13:15:12Z

and so now numpy support missing values with int?

that is the exception to the rule atm.

gfyoung · 2016-02-11T13:17:39Z

I'm not sure I understand your question.

TomAugspurger · 2016-02-11T13:41:41Z

You can't store the value np.nan in an int typed container

In [1]: np.array([np.nan], dtype='int64')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-12005059c4f1> in <module>()
----> 1 np.array([np.nan], dtype='int64')

ValueError: cannot convert float NaN to integer

See here for more. This could change at some point, but it's currently how things are.

Your change of filling with random bits of memory via np.empty is not the same as filling with the specific value of np.nan.

gfyoung · 2016-02-11T13:56:47Z

@TomAugspurger : Agreed. Nevertheless, being able to create dummy Series or DataFrame in a similar manner to numpy with the specified dtype would be useful. My other PR #12284 is one example of this.

jreback · 2016-02-11T14:02:10Z

@gfyoung how is Series(index=range(3),dtype=int) not exactly what you are after? (not related to the other issue), but a general empty creation method?

pandas coerces from a user perspective so you can be giving a specification which is not supported but will just work.

TomAugspurger · 2016-02-11T14:03:36Z

And if you really need the empty data, then Series(np.empty(3, dtype=int)) will work.

gfyoung · 2016-02-11T14:08:59Z

@jreback @TomAugspurger : Maybe I have been stuck in the numpy library for too long, but when I create an "empty" array-object with a specified dtype, I wouldn't think to have to make sure that the dtype is respected during initialization. And as you can observe in #12284, creating a "dummy" Series object with the specified dtype is in fact useful. The fact that current Series initialization with a numerical dtype and empty data that all cast to np.int64 is the reason why I cannot remove the numpy dependency in that PR ATM, for lib.reduce expects the values and arr dtypes to be matching.

jreback · 2016-02-11T14:41:00Z

@gfyoung empty is not uninstalized as its in numpy. its full of a dtype compat missing values.

The other issue is related to the internals and how to handle these types of numpy bugs/issues. We have to work around them.

A series constructor should just work, coercing dtypes if needed. As a user you don't have to be concerned about it. As a code contributor, however, you have to be aware (and compensate for) these issues.

gfyoung · 2016-02-11T14:52:38Z

@jreback : I presume you are referring to my empty methods? I chose to put None in because that goes along similarly to what numpy does in some cases IINM here.

jreback · 2016-02-11T14:57:11Z

@gfyoung I appreciate what numpy does, but that is wrong IMHO. It only makes sense if its object dtype. pandas has in effect a much more detailed and richer missing value support system, so we really really try hard to have appropriate values. Nothing is ever unitialized, its just missing. The exception is really int, which forces a casting to float because of the storage medium (numpy).

gfyoung · 2016-02-11T15:03:02Z

@jreback : Fair enough. I do think though it would be good to be able to create "dummy" Series and DataFrame objects on the fly with "dummy" or "missing" values with the specified dtype nonetheless. My initial documentation for these methods is inconsistent (and probably misleading) with what pandas is trying to do, but I think the overall behaviour of those methods conforms relatively well with what you are saying. For "nice" datatypes (e.g. numerical datatypes), I would contend that numpy does just fine with putting in "appropriate" values.

jreback · 2016-02-11T15:06:37Z

@gfyoung absolutley. and let me say I certainly appreciate your numpy background and viewpoints.

For all dtypes, passing a dtype= to a Series/DataFrame works exactly as intended, which the exception of int which coerces to float.

gfyoung · 2016-02-11T15:08:53Z

@jreback : Also str. It gets coerced to object by Series at least.

jreback · 2016-02-11T15:10:03Z

that is also a representation in how pandas deals with strings. These are by definition object dtype. fixed length is not supported.

gfyoung · 2016-02-11T15:11:39Z

@jreback : So even for Python strings (variable length), that's how it is treated? Just curious, why is that the case?

Also, in light of your point about the exception, would it be worth re-opening this PR so that we can then can create "dummy" Series and DataFrame objects with int data types?

jreback · 2016-02-11T15:17:57Z

@gfyoung

@wesm of course would have the original motivation, but I suspect here are some reasons why fixed length strings are not a great idea in pandas:

generally be more memory with efficient when dealing with variable length strings (since we don't support in-memory compression)
already support object types, so avoid having to support yet another type
i think the biggest reason though is that setting values when indexing becomes much simpler, you don't have to worry about truncation or buffer reallocation, you simply reassign the value, this could be a humungous cost if you are incrementally assigning to a str series with longer strings.
you further don't have to worry about coercion to python strings when you are getting values

not sure why you would want to expose 'dummy' for any purpose, its purely internal to .apply to try to determine whether a udf reduces, its not generally useful. and how would this actually be different that what currently exists?

gfyoung · 2016-02-11T15:48:59Z

@jreback : Well it was working on the PR for .apply that I came up with the idea for this PR. I'm not sure if it's really a matter of exposing dummy rather than allowing you to create dummy objects if need be as a user. It would not be extremely different from what currently exists, except that you would be able to now create dummy Series and DataFrame objects with the specified integer dtypes.

Another reason (though this might be moot - I am not entirely), but if you know for example what sort of DataFrame or Series object you will need, including dimensions and dtype, it seems perfectly reasonable IMHO that you should be able to just initialize the object beforehand and then populate as you go.

jreback · 2016-02-11T15:58:21Z

How can you create dummies with integer dtypes?

it is not efficient at all to create 'dummies' then populate them. In the world of a single dtype, sure you can, but when you have multiple dtypes (and esp lots of inference on the indexers), this not a good pattern.

gfyoung · 2016-02-11T18:07:49Z

I'll concede that in the context of the DataFrame, it does not make as much sense, but a Series?

gfyoung · 2016-02-11T18:09:28Z

When I say create a dummy with integer dtypes, it's essentially initializing an np.empty array with the integer dtype and then casting it into a Series for example.

jorisvandenbossche · 2016-02-11T23:32:44Z

Which is what @TomAugspurger said above: pd.Series(np.empty(3, dtype=int)), or is that not the desired result?

gfyoung · 2016-02-11T23:40:00Z

EDIT: @jorisvandenbossche : Sorry, misread your comment the first time. Yes, that is what I am looking for. But why not abstract into a method, which is what my PR does?

gfyoung · 2016-02-11T23:41:30Z

If the user wants to create an empty Series with the desired dtype, there is no reason why he/she should have to think about whether or not the dtype is numerical. That should be handled behind the scenes.

jreback · 2016-02-11T23:45:47Z

@gfyoung

if you want to do

arr = np.empty(3,dtype=int)
s = Series(arr)

# this is a view
s.values[0] = 5

will work, but ONLY for a Series or a single dtyped DataFrame. However this is not a typical pattern at all in pandas.

gfyoung · 2016-02-11T23:49:23Z

@jreback : Fair enough. I thought there might be a use-case for it, but if that isn't something people do too often or at all, then we can lay this PR to rest then. :)

jorisvandenbossche · 2016-02-11T23:51:11Z

If the user wants to create an empty Series with the desired dtype, there is no reason why he/she should have to think about whether or not the dtype is numerical. That should be handled behind the scenes.

I don't understand this. Why would you have to think about that? You just specify the dtype you want in the empty function?

But why not abstract into a method, which is what my PR does?

I don't have a strong opinion on this, but in any case the approach in this PR is not possible given that empty is already taken. A function (like np.empty) is a function is also a possibility?

gfyoung · 2016-02-12T00:08:03Z

@jorisvandenbossche :

>>> from pandas import Series
>>> Series(index=range(4), dtype=int)
0    NaN
1    NaN
dtype: float64

Notice how the dtype is not respected. The idea behind my PR was that you could call Series.empty(length, dtype=int) and get returned Series(np.empty(length, dtype=int)). For non-numerical dtypes, you would just get Series(index=range(length), dtype=dtype)).

The name was not really the issue besides the fact I had forgotten about empty already being an attribute of Series and DataFrame. np.empty only works for numpy data types, but cannot handle pandas duck-typed dtypes. I was aiming to create np.empty-ish functionality that could also able those additional dtypes.

In any case, this discussion is moot, since @jreback pointed out that such a use case is not as common compared to np.empty.

ENH: add empty() methods for DataFrame and Series

2fbb61c

Added empty() methods to the Series and DataFrame classes analogous to the empty() function in the numpy library that can also accept scipy duck-type dtypes in addition to numpy dtypes.

jreback closed this Feb 11, 2016

gfyoung deleted the empty_struct branch February 11, 2016 12:18

gfyoung mentioned this pull request Feb 11, 2016

BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284

Closed

gfyoung added this to the No action milestone Nov 18, 2019

gfyoung added DataFrame DataFrame data structure Enhancement Series Series data structure labels Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add empty() methods for DataFrame and Series #12291

ENH: add empty() methods for DataFrame and Series #12291

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

TomAugspurger commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

TomAugspurger commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jorisvandenbossche commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jorisvandenbossche commented Feb 11, 2016

gfyoung commented Feb 12, 2016

ENH: add empty() methods for DataFrame and Series #12291

ENH: add empty() methods for DataFrame and Series #12291

Conversation

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

TomAugspurger commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

TomAugspurger commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jorisvandenbossche commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jorisvandenbossche commented Feb 11, 2016

gfyoung commented Feb 12, 2016