-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051
Conversation
Hello @danielhrisca! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-01-15 14:44:28 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls always add tests first.
7626b8b
to
0c1d126
Compare
@jreback I've changed the implementation and added tests |
pandas/core/frame.py
Outdated
@@ -6438,6 +6440,11 @@ def combine_first(self, other: DataFrame) -> DataFrame: | |||
other : DataFrame | |||
Provided DataFrame to use to fill null values. | |||
|
|||
preserve_dtypes : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not want to add a flag for this. simply change it.
Please add some examples for this behaivor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that maybe it is a good idea to keep the current behavior as default, and provide the new behavior as an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no its better to just fix this, you can add a whatsnew note in 1.3. everywhere else we cast to common dtypes, this should be no different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running into some failed tests that exceed my understand of the lib. Is it expected that if a Series is constructed from a list of None
then the result of this combined with some other Series should have the latter's dtype (coercing to the respective NaN
value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of None -> object, so combined -> object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need a whatsnew note in 1.3 under breaking changes, IOW it will need a subsection showing the before and after
pandas/core/frame.py
Outdated
# if the column has different dtype in the | ||
# DataFrame objects then add the common dtype | ||
# to the columns dtype conversion dict | ||
if combined.dtypes[col] != self.dtypes[col]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_dtype_equal
here
pandas/core/frame.py
Outdated
dtypes[col] = find_common_type( | ||
[self.dtypes[col], other.dtypes[col]] | ||
) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not want to do multiple try/excepts ever as these tend to hide errors.
in fact you should not need here at all. find_common_type
will always succeed (it could of course be object
).
pandas/core/frame.py
Outdated
|
||
dtypes = {} | ||
|
||
for col in self.columns.intersection(other.columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a simple list-comprehension
frame = DataFrame([[na_value, na_value]], columns=["a", "b"]) | ||
other = DataFrame([[scalar1, scalar2]], columns=["b", "c"]) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use try/except in tests. explicity specify the expected
|
||
|
||
def test_combine_preserve_dtypes(): | ||
a = Series(["a", "b"], index=range(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
I think all request have been met. The failed test does not seem to be related to the changes done in this PR, most probably come after I re-synced with the master branch |
doc/source/whatsnew/v1.3.0.rst
Outdated
|
||
.. ipython:: python | ||
|
||
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do the construction in an ioythok block before both of these as it's the same
only show the combined (and dtypes) in the before / after
pandas/core/frame.py
Outdated
col: find_common_type([self.dtypes[col], other.dtypes[col]]) | ||
for col in self.columns.intersection(other.columns) | ||
if not is_dtype_equal(combined.dtypes[col], self.dtypes[col]) | ||
and not is_dtype_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do u have this complicated condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first if check for columns that are candidates for preserving the dtype, and the second if avoids extra computation if the common dtype is the same as the combined dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok no need for that, as astype will handle this if you pass copy=False
, though i actually don't think this really matters, i would just remove the redudant check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, doc comments. ping on green.
doc/source/whatsnew/v1.3.0.rst
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [1]: (combined, "---------------", combined.dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this extra printing, you can just whos combine.dtypes
doc/source/whatsnew/v1.3.0.rst
Outdated
|
||
.. ipython:: python | ||
|
||
In [1]: (combined, "---------------", combined.dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't add ipython prompts, just the code
@jreback I've done the changes but again there are failed builds for which I can see no connection with this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @jbrockmendel if any comments.
|
||
c = Series(["a", "b"], index=range(5, 7)) | ||
b = Series(range(-1, 1), index=range(5, 7)) | ||
g = DataFrame({"B": b, "C": c}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: can you avoid 1-letter variable names? makes it harder to grep for things
minor nitpick, otherwise looks good |
thanks @danielhrisca very nice |
…e two DataFrame objects have the same columns (pandas-dev#39051)
…e two DataFrame objects have the same columns