Skip to content

BUG: don't lose dtypes when concatenating empty array-likes #5742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Dec 19, 2013

I develop an application that does quite a bit of data manipulation. Being aware of pandas being functional-but-not-really-heavily-optimized I use it to maintain label consistency and for grouping/merging data, heavy-duty maths is usually done with numpy ufuncs. The application contains entities that have no data at the beginning and receive data over their lifetimes. Every once in a while an incoming data chunk will contain no data for a certain entity. Usually it's fine but if the entity was just created the following happens:

In [1]: pd.__version__ 
Out[1]: '0.13.0rc1-92-gf6fd509'

In [2]: data = pd.Series(dtype=np.float)

In [3]: chunk = pd.Series(dtype=np.float)

In [4]: pd.concat([data, chunk])
Out[4]: Series([], dtype: object)

After that ufuncs like isnan cease to work on data.values since its dtype has changed to object. This PR fixes it.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

this is fine, pls add a release notes entry (use this PR number as the issue number); you can add to bug_fixes at the end

@@ -11806,6 +11806,23 @@ def test_to_csv_date_format(self):

assert_frame_equal(test, nat_frame)

def test_concat_empty_accounts_dtypes(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't use the name accounts for this. test_concat_empty_dataframe_dtypes is fine.

@immerrr
Copy link
Contributor Author

immerrr commented Dec 20, 2013

Done that and squashed all the commits to a single one.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2013

@immerrr pls rebase...this is going to break after #5757, so need to resolve that

@immerrr
Copy link
Contributor Author

immerrr commented Dec 20, 2013

@jreback here you go

UPD: doesn't work, I guess simply merging it was too easy to be the solution :) will see to it later.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2013

@immerrr thanks

@immerrr
Copy link
Contributor Author

immerrr commented Dec 21, 2013

The test fails because bool_ columns are made equivalent to object_ in _concat_single_item (pandas/tools/merge.py) which seems weird: any numeric type is good enough to represent the boolean domain. And numpy kind of agrees with that: np.result_type(np.bool_, np.int8) == np.int8.

Is there a reason I don't see behind this decision?

@jreback
Copy link
Contributor

jreback commented Dec 21, 2013

concat single item is only called when u r appending different dtypes in a single column (which is generally odd)

this is quite tricky because u don't want to automatically cast to object (which is is general the result type for pretty much any object and anything else)

because u can sometimes cast to a more appropriate dtype

for example if u have bool and then other frame is empty no matter the dtype you would be ok

so may have to handle that a bit like I do datetime/timedelata

unlike date like you cannot have a nan but non empty with bools - u can only append bools to bools or bools to empty frame

u can make a case for allowing appending with uint8 but I would not allow it

@jreback
Copy link
Contributor

jreback commented Dec 21, 2013

I am not a big dan or coercing bools to numeric either - u could put this in but again requires some special logic (eg if all types can be casted to numeric and u have bools, but no date like then prob ok)

@jreback
Copy link
Contributor

jreback commented Dec 21, 2013

Dan -> fan

@jtratner
Copy link
Contributor

I'm -1 on having pandas internals coerce bool to unsigned right now. We haven't built up particularly good support for unsigned ints yet.

@immerrr
Copy link
Contributor Author

immerrr commented Dec 23, 2013

Ok, it feels like boolean coercion itself is worth a discussion that won't fit here, so let's skip that.

Now, to the issue. _concat_compat created 1-d arrays unconditionally (even when 2-d arrays where passed in) raising exceptions and falling back to item-by-item concatenation. There goes another round of fixing/rebasing, have a look.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

can you rebase and move notes to 0.13.1....

@immerrr
Copy link
Contributor Author

immerrr commented Jan 16, 2014

Sure

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

can you just squash this down to 1 commit, thanks...otherwise looks fine

jreback added a commit that referenced this pull request Jan 18, 2014
…mpty-arraylikes

BUG: don't lose dtypes when concatenating empty array-likes
@jreback jreback merged commit a37900e into pandas-dev:master Jan 18, 2014
@jreback
Copy link
Contributor

jreback commented Jan 18, 2014

thanks!

@immerrr immerrr deleted the dont-lose-dtype-concatenating-empty-arraylikes branch February 12, 2014 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants