Skip to content

ENH: add empty() methods for DataFrame and Series #12291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Feb 11, 2016

Added empty() methods to the Series and DataFrame classes analogous to the empty() function in the numpy library that can also accept scipy duck-type dtypes in addition to numpy dtypes.

Added empty() methods to the Series and DataFrame
classes analogous to the empty() function in the
numpy library that can also accept scipy duck-type
dtypes in addition to numpy dtypes.
@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Besides the flake8 issues, I am somewhat confused by all of these Travis failures, as I thought my additions were isolated from the rest of the codebase. Why are they occurring?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

.empty is a property of NDFrames already
and this method is not necessary

Series(index=range(4)) does this already for example

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

On second look, your suggestion doesn't quite entirely match what I was proposing:

>>> from pandas import Series
>>> Series(index=range(4), dtype=int)
0    NaN
1    NaN
dtype: float64

I would think that Series initialization should respect the dtype, which is what my PR does.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

and so now numpy support missing values with int?

that is the exception to the rule atm.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

I'm not sure I understand your question.

@TomAugspurger
Copy link
Contributor

You can't store the value np.nan in an int typed container

In [1]: np.array([np.nan], dtype='int64')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-12005059c4f1> in <module>()
----> 1 np.array([np.nan], dtype='int64')

ValueError: cannot convert float NaN to integer

See here for more. This could change at some point, but it's currently how things are.

Your change of filling with random bits of memory via np.empty is not the same as filling with the specific value of np.nan.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@TomAugspurger : Agreed. Nevertheless, being able to create dummy Series or DataFrame in a similar manner to numpy with the specified dtype would be useful. My other PR #12284 is one example of this.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung how is Series(index=range(3),dtype=int) not exactly what you are after? (not related to the other issue), but a general empty creation method?

pandas coerces from a user perspective so you can be giving a specification which is not supported but will just work.

@TomAugspurger
Copy link
Contributor

And if you really need the empty data, then Series(np.empty(3, dtype=int)) will work.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback @TomAugspurger : Maybe I have been stuck in the numpy library for too long, but when I create an "empty" array-object with a specified dtype, I wouldn't think to have to make sure that the dtype is respected during initialization. And as you can observe in #12284, creating a "dummy" Series object with the specified dtype is in fact useful. The fact that current Series initialization with a numerical dtype and empty data that all cast to np.int64 is the reason why I cannot remove the numpy dependency in that PR ATM, for lib.reduce expects the values and arr dtypes to be matching.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung empty is not uninstalized as its in numpy. its full of a dtype compat missing values.

The other issue is related to the internals and how to handle these types of numpy bugs/issues. We have to work around them.

A series constructor should just work, coercing dtypes if needed. As a user you don't have to be concerned about it. As a code contributor, however, you have to be aware (and compensate for) these issues.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : I presume you are referring to my empty methods? I chose to put None in because that goes along similarly to what numpy does in some cases IINM here.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung I appreciate what numpy does, but that is wrong IMHO. It only makes sense if its object dtype. pandas has in effect a much more detailed and richer missing value support system, so we really really try hard to have appropriate values. Nothing is ever unitialized, its just missing. The exception is really int, which forces a casting to float because of the storage medium (numpy).

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : Fair enough. I do think though it would be good to be able to create "dummy" Series and DataFrame objects on the fly with "dummy" or "missing" values with the specified dtype nonetheless. My initial documentation for these methods is inconsistent (and probably misleading) with what pandas is trying to do, but I think the overall behaviour of those methods conforms relatively well with what you are saying. For "nice" datatypes (e.g. numerical datatypes), I would contend that numpy does just fine with putting in "appropriate" values.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung absolutley. and let me say I certainly appreciate your numpy background and viewpoints.

For all dtypes, passing a dtype= to a Series/DataFrame works exactly as intended, which the exception of int which coerces to float.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : Also str. It gets coerced to object by Series at least.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

that is also a representation in how pandas deals with strings. These are by definition object dtype. fixed length is not supported.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : So even for Python strings (variable length), that's how it is treated? Just curious, why is that the case?

Also, in light of your point about the exception, would it be worth re-opening this PR so that we can then can create "dummy" Series and DataFrame objects with int data types?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung

@wesm of course would have the original motivation, but I suspect here are some reasons why fixed length strings are not a great idea in pandas:

  • generally be more memory with efficient when dealing with variable length strings (since we don't support in-memory compression)
  • already support object types, so avoid having to support yet another type
  • i think the biggest reason though is that setting values when indexing becomes much simpler, you don't have to worry about truncation or buffer reallocation, you simply reassign the value, this could be a humungous cost if you are incrementally assigning to a str series with longer strings.
  • you further don't have to worry about coercion to python strings when you are getting values

not sure why you would want to expose 'dummy' for any purpose, its purely internal to .apply to try to determine whether a udf reduces, its not generally useful. and how would this actually be different that what currently exists?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : Well it was working on the PR for .apply that I came up with the idea for this PR. I'm not sure if it's really a matter of exposing dummy rather than allowing you to create dummy objects if need be as a user. It would not be extremely different from what currently exists, except that you would be able to now create dummy Series and DataFrame objects with the specified integer dtypes.

Another reason (though this might be moot - I am not entirely), but if you know for example what sort of DataFrame or Series object you will need, including dimensions and dtype, it seems perfectly reasonable IMHO that you should be able to just initialize the object beforehand and then populate as you go.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

How can you create dummies with integer dtypes?

it is not efficient at all to create 'dummies' then populate them. In the world of a single dtype, sure you can, but when you have multiple dtypes (and esp lots of inference on the indexers), this not a good pattern.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

I'll concede that in the context of the DataFrame, it does not make as much sense, but a Series?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

When I say create a dummy with integer dtypes, it's essentially initializing an np.empty array with the integer dtype and then casting it into a Series for example.

@jorisvandenbossche
Copy link
Member

Which is what @TomAugspurger said above: pd.Series(np.empty(3, dtype=int)), or is that not the desired result?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

EDIT: @jorisvandenbossche : Sorry, misread your comment the first time. Yes, that is what I am looking for. But why not abstract into a method, which is what my PR does?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

If the user wants to create an empty Series with the desired dtype, there is no reason why he/she should have to think about whether or not the dtype is numerical. That should be handled behind the scenes.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung

if you want to do

arr = np.empty(3,dtype=int)
s = Series(arr)

# this is a view
s.values[0] = 5

will work, but ONLY for a Series or a single dtyped DataFrame. However this is not a typical pattern at all in pandas.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

@jreback : Fair enough. I thought there might be a use-case for it, but if that isn't something people do too often or at all, then we can lay this PR to rest then. :)

@jorisvandenbossche
Copy link
Member

If the user wants to create an empty Series with the desired dtype, there is no reason why he/she should have to think about whether or not the dtype is numerical. That should be handled behind the scenes.

I don't understand this. Why would you have to think about that? You just specify the dtype you want in the empty function?

But why not abstract into a method, which is what my PR does?

I don't have a strong opinion on this, but in any case the approach in this PR is not possible given that empty is already taken. A function (like np.empty) is a function is also a possibility?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 12, 2016

@jorisvandenbossche :

>>> from pandas import Series
>>> Series(index=range(4), dtype=int)
0    NaN
1    NaN
dtype: float64

Notice how the dtype is not respected. The idea behind my PR was that you could call Series.empty(length, dtype=int) and get returned Series(np.empty(length, dtype=int)). For non-numerical dtypes, you would just get Series(index=range(length), dtype=dtype)).

The name was not really the issue besides the fact I had forgotten about empty already being an attribute of Series and DataFrame. np.empty only works for numpy data types, but cannot handle pandas duck-typed dtypes. I was aiming to create np.empty-ish functionality that could also able those additional dtypes.

In any case, this discussion is moot, since @jreback pointed out that such a use case is not as common compared to np.empty.

@gfyoung gfyoung added this to the No action milestone Nov 18, 2019
@gfyoung gfyoung added DataFrame DataFrame data structure Enhancement Series Series data structure labels Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Enhancement Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants