Skip to content

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 15, 2021

Conversation

danielhrisca
Copy link
Contributor

…e two DataFrame objects have the same columns

@pep8speaks
Copy link

pep8speaks commented Jan 9, 2021

Hello @danielhrisca! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-15 14:44:28 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls always add tests first.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 9, 2021
@danielhrisca danielhrisca force-pushed the keep_dtypes_on_combine_first branch from 7626b8b to 0c1d126 Compare January 11, 2021 08:25
@danielhrisca
Copy link
Contributor Author

@jreback I've changed the implementation and added tests

@@ -6438,6 +6440,11 @@ def combine_first(self, other: DataFrame) -> DataFrame:
other : DataFrame
Provided DataFrame to use to fill null values.

preserve_dtypes : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to add a flag for this. simply change it.

Please add some examples for this behaivor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that maybe it is a good idea to keep the current behavior as default, and provide the new behavior as an option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its better to just fix this, you can add a whatsnew note in 1.3. everywhere else we cast to common dtypes, this should be no different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into some failed tests that exceed my understand of the lib. Is it expected that if a Series is constructed from a list of None then the result of this combined with some other Series should have the latter's dtype (coercing to the respective NaN value)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of None -> object, so combined -> object

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need a whatsnew note in 1.3 under breaking changes, IOW it will need a subsection showing the before and after

# if the column has different dtype in the
# DataFrame objects then add the common dtype
# to the columns dtype conversion dict
if combined.dtypes[col] != self.dtypes[col]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_dtype_equal here

dtypes[col] = find_common_type(
[self.dtypes[col], other.dtypes[col]]
)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to do multiple try/excepts ever as these tend to hide errors.
in fact you should not need here at all. find_common_type will always succeed (it could of course be object).


dtypes = {}

for col in self.columns.intersection(other.columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a simple list-comprehension

frame = DataFrame([[na_value, na_value]], columns=["a", "b"])
other = DataFrame([[scalar1, scalar2]], columns=["b", "c"])

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use try/except in tests. explicity specify the expected



def test_combine_preserve_dtypes():
a = Series(["a", "b"], index=range(2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

@danielhrisca
Copy link
Contributor Author

I think all request have been met. The failed test does not seem to be related to the changes done in this PR, most probably come after I re-synced with the master branch


.. ipython:: python

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do the construction in an ioythok block before both of these as it's the same

only show the combined (and dtypes) in the before / after

col: find_common_type([self.dtypes[col], other.dtypes[col]])
for col in self.columns.intersection(other.columns)
if not is_dtype_equal(combined.dtypes[col], self.dtypes[col])
and not is_dtype_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do u have this complicated condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first if check for columns that are candidates for preserving the dtype, and the second if avoids extra computation if the common dtype is the same as the combined dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok no need for that, as astype will handle this if you pass copy=False, though i actually don't think this really matters, i would just remove the redudant check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, doc comments. ping on green.


.. code-block:: ipython

In [1]: (combined, "---------------", combined.dtypes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this extra printing, you can just whos combine.dtypes


.. ipython:: python

In [1]: (combined, "---------------", combined.dtypes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add ipython prompts, just the code

@danielhrisca
Copy link
Contributor Author

@jreback I've done the changes but again there are failed builds for which I can see no connection with this PR

@jreback jreback added this to the 1.3 milestone Jan 14, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jbrockmendel if any comments.


c = Series(["a", "b"], index=range(5, 7))
b = Series(range(-1, 1), index=range(5, 7))
g = DataFrame({"B": b, "C": c})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can you avoid 1-letter variable names? makes it harder to grep for things

@jbrockmendel
Copy link
Member

minor nitpick, otherwise looks good

@jreback jreback merged commit 963cf2b into pandas-dev:master Jan 15, 2021
@jreback
Copy link
Contributor

jreback commented Jan 15, 2021

thanks @danielhrisca very nice

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combine_first not retaining dtypes
4 participants