ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

danielhrisca · 2021-01-09T08:27:59Z

…e two DataFrame objects have the same columns

closes combine_first not retaining dtypes #7509
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-01-09T08:28:05Z

Hello @danielhrisca! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-15 14:44:28 UTC

jreback

pls always add tests first.

danielhrisca · 2021-01-11T08:46:47Z

@jreback I've changed the implementation and added tests

jreback · 2021-01-11T14:13:01Z

pandas/core/frame.py

@@ -6438,6 +6440,11 @@ def combine_first(self, other: DataFrame) -> DataFrame:
        other : DataFrame
            Provided DataFrame to use to fill null values.

+        preserve_dtypes : bool, default False


we do not want to add a flag for this. simply change it.

Please add some examples for this behaivor

I was thinking that maybe it is a good idea to keep the current behavior as default, and provide the new behavior as an option

no its better to just fix this, you can add a whatsnew note in 1.3. everywhere else we cast to common dtypes, this should be no different.

I'm running into some failed tests that exceed my understand of the lib. Is it expected that if a Series is constructed from a list of None then the result of this combined with some other Series should have the latter's dtype (coercing to the respective NaN value)?

list of None -> object, so combined -> object

jreback

this will need a whatsnew note in 1.3 under breaking changes, IOW it will need a subsection showing the before and after

jreback · 2021-01-11T22:49:58Z

pandas/core/frame.py

+                # if the column has different dtype in the
+                # DataFrame objects then add the common dtype
+                # to the columns dtype conversion dict
+                if combined.dtypes[col] != self.dtypes[col]:


use is_dtype_equal here

jreback · 2021-01-11T22:51:09Z

pandas/core/frame.py

+                    dtypes[col] = find_common_type(
+                        [self.dtypes[col], other.dtypes[col]]
+                    )
+            except TypeError:


we do not want to do multiple try/excepts ever as these tend to hide errors.
in fact you should not need here at all. find_common_type will always succeed (it could of course be object).

jreback · 2021-01-11T22:51:22Z

pandas/core/frame.py

+
+        dtypes = {}
+
+        for col in self.columns.intersection(other.columns):


this should be a simple list-comprehension

jreback · 2021-01-11T22:51:56Z

pandas/tests/frame/methods/test_combine_first.py

    frame = DataFrame([[na_value, na_value]], columns=["a", "b"])
    other = DataFrame([[scalar1, scalar2]], columns=["b", "c"])

+    try:


don't use try/except in tests. explicity specify the expected

jreback · 2021-01-11T22:52:12Z

pandas/tests/frame/methods/test_combine_first.py

+
+
+def test_combine_preserve_dtypes():
+    a = Series(["a", "b"], index=range(2))


add the issue number as a comment

danielhrisca · 2021-01-12T14:47:50Z

I think all request have been met. The failed test does not seem to be related to the changes done in this PR, most probably come after I re-synced with the master branch

jreback · 2021-01-13T01:04:38Z

doc/source/whatsnew/v1.3.0.rst

+
+.. ipython:: python
+
+   df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])


you can do the construction in an ioythok block before both of these as it's the same

only show the combined (and dtypes) in the before / after

jreback · 2021-01-13T01:05:35Z

pandas/core/frame.py

+            col: find_common_type([self.dtypes[col], other.dtypes[col]])
+            for col in self.columns.intersection(other.columns)
+            if not is_dtype_equal(combined.dtypes[col], self.dtypes[col])
+            and not is_dtype_equal(


why do u have this complicated condition?

first if check for columns that are candidates for preserving the dtype, and the second if avoids extra computation if the common dtype is the same as the combined dtype

ok no need for that, as astype will handle this if you pass copy=False, though i actually don't think this really matters, i would just remove the redudant check

jreback

lgtm, doc comments. ping on green.

jreback · 2021-01-13T14:01:22Z

doc/source/whatsnew/v1.3.0.rst

+
+.. code-block:: ipython
+
+   In [1]: (combined, "---------------", combined.dtypes)


don't need this extra printing, you can just whos combine.dtypes

jreback · 2021-01-13T14:01:39Z

doc/source/whatsnew/v1.3.0.rst

+
+.. ipython:: python
+
+   In [1]: (combined, "---------------", combined.dtypes)


don't add ipython prompts, just the code

doc/source/whatsnew/v1.3.0.rst

danielhrisca · 2021-01-13T16:16:52Z

@jreback I've done the changes but again there are failed builds for which I can see no connection with this PR

jreback

lgtm. @jbrockmendel if any comments.

jbrockmendel · 2021-01-15T02:16:32Z

pandas/tests/frame/methods/test_combine_first.py

+
+    c = Series(["a", "b"], index=range(5, 7))
+    b = Series(range(-1, 1), index=range(5, 7))
+    g = DataFrame({"B": b, "C": c})


nitpick: can you avoid 1-letter variable names? makes it harder to grep for things

jbrockmendel · 2021-01-15T02:17:28Z

minor nitpick, otherwise looks good

jreback · 2021-01-15T16:30:28Z

thanks @danielhrisca very nice

…e two DataFrame objects have the same columns (pandas-dev#39051)

danielhrisca mentioned this pull request Jan 9, 2021

combine_first not retaining dtypes #7509

Closed

jreback requested changes Jan 9, 2021

View reviewed changes

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 9, 2021

ENH: add argument to preserve dtypes of common columns in combine_first

0c1d126

danielhrisca force-pushed the keep_dtypes_on_combine_first branch from 7626b8b to 0c1d126 Compare January 11, 2021 08:25

danielhrisca added 2 commits January 11, 2021 10:38

fix black code style

1a5fe0f

fix misspelled word in docstring

24f6ffc

jreback requested changes Jan 11, 2021

View reviewed changes

danielhrisca added 2 commits January 11, 2021 20:24

update tests and remove preserve_dtypes argument from combine_first

d0f9ed3

fix isort and flake8 errors

198eaa4

jreback requested changes Jan 11, 2021

View reviewed changes

danielhrisca added 4 commits January 12, 2021 09:05

updates according to erview

f209590

update whatsnew with example code

5e252d0

wrong header in documentation

7c67e3c

fix black code style

47d0911

jreback requested changes Jan 13, 2021

View reviewed changes

danielhrisca added 2 commits January 13, 2021 08:03

update whatsnew with example code as requested in the review

1b5691c

remove redundant check

ba49f9c

jreback requested changes Jan 13, 2021

View reviewed changes

further fix and polish the whatsnew entry

a2d4e38

jreback added this to the 1.3 milestone Jan 14, 2021

jreback approved these changes Jan 14, 2021

View reviewed changes

jbrockmendel reviewed Jan 15, 2021

View reviewed changes

fix single letter variable names

f937928

jreback merged commit 963cf2b into pandas-dev:master Jan 15, 2021

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

ENH: try to preserve the dtype on combine_first for the case where th…

96188d6

…e two DataFrame objects have the same columns (pandas-dev#39051)

jessestone7 mentioned this pull request Mar 3, 2023

BUG: preserve the dtype on Series.combine_first #51764

Closed

3 tasks

jessestone7 mentioned this pull request Nov 1, 2024

BUG: Series.combine_first loss of precision #60128

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

danielhrisca commented Jan 9, 2021

pep8speaks commented Jan 9, 2021 •

edited

Loading

jreback left a comment

danielhrisca commented Jan 11, 2021

jreback Jan 11, 2021

danielhrisca Jan 11, 2021

jreback Jan 11, 2021

danielhrisca Jan 11, 2021

jreback Jan 11, 2021

jreback left a comment

jreback Jan 11, 2021

jreback Jan 11, 2021

jreback Jan 11, 2021

jreback Jan 11, 2021

jreback Jan 11, 2021

danielhrisca commented Jan 12, 2021

jreback Jan 13, 2021

jreback Jan 13, 2021

danielhrisca Jan 13, 2021

jreback Jan 13, 2021

danielhrisca Jan 13, 2021

jreback left a comment

jreback Jan 13, 2021

jreback Jan 13, 2021

danielhrisca commented Jan 13, 2021

jreback left a comment

jbrockmendel Jan 15, 2021

jbrockmendel commented Jan 15, 2021

jreback commented Jan 15, 2021


		dtypes = {}

		for col in self.columns.intersection(other.columns):



		def test_combine_preserve_dtypes():
		a = Series(["a", "b"], index=range(2))


		.. ipython:: python

		df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])


		.. code-block:: ipython

		In [1]: (combined, "---------------", combined.dtypes)

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

Conversation

danielhrisca commented Jan 9, 2021

pep8speaks commented Jan 9, 2021 • edited Loading

Comment last updated at 2021-01-15 14:44:28 UTC

jreback left a comment

Choose a reason for hiding this comment

danielhrisca commented Jan 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielhrisca commented Jan 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielhrisca commented Jan 13, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 15, 2021

jreback commented Jan 15, 2021

pep8speaks commented Jan 9, 2021 •

edited

Loading