Skip to content

BUG: pd.DataFrame.from_records() raises a KeyError if passed a string index and an empty iterable #47319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -776,7 +776,7 @@ Conversion
Strings
^^^^^^^
- Bug in :meth:`str.startswith` and :meth:`str.endswith` when using other series as parameter _pat_. Now raises ``TypeError`` (:issue:`3485`)
-
- Bug in :meth:`DataFrame.from_records` raises a ``KeyError`` if passed a string index and an empty iterable (:isue:`#47285`)

Interval
^^^^^^^^
Expand Down
7 changes: 7 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2291,6 +2291,13 @@ def maybe_reorder(
arr_columns = ensure_index(arr_columns)
if columns is None:
columns = arr_columns

if index is not None:
if isinstance(index, str) or not hasattr(index, "__iter__"):
if index not in columns:
result_index = [index]
index = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This swallows KeyErrors when the iterable is populated but the column is not present.

data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_1": 1, "col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_14")
print(result)

This should still raise, I think.

cc @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that maybe is another bug, see #47285 (comment) where we get NaN in the index for a missing key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, this raises a KeyError right now, which is fine I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. if we do have data (not the empty list case) we should probably raise on missing keys

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think so too.

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. if we do have data (not the empty list case) we should probably raise on missing keys

Yeah agree with you.
Furthermore, when we have EMPTY list and we pass some list index then currently on Main, it returns that list index as output with empty list. We want same behavior for string index as well for EMPTY list instead of raising KeyError.

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This swallows KeyErrors when the iterable is populated but the column is not present.

data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_1": 1, "col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_14")
print(result)

This should still raise, I think.

cc @simonjayhawkins

As @simonjayhawkins mentioned, and I also tested it, it generates ValueError: Length of values (4) does not match length of index (1) when we've iterable populated. And if it should raise KeyError then that is another issue.

Here, in my PR, I tried to address actual issue when we have empty list ([]) and we pass as string index, It gives KeyError. But Ideally in output we want Empty list along with passed string as index.

We want Exact same behaviour when we have EMPTY list and we pass some list index then currently on Main, it returns that list index as output with empty list.

Code change I put here, it will swallow KeyError only when we have EMPTY list passed with string index. Otherwise it won't get executed.

But please correct me if I misunderstood any of your concerns in discussion. Thanks.

Screenshot from 2022-06-19 18-06-26

Screenshot from 2022-06-19 18-12-29

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My code example should still raise the same error. We do not want to change this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to fix the bug without breaking anything else.

Please also add my example as a test with the KeyError as expected

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohk. I apologize, earlier I misunderstood it. I fixed this issue and now it is behaving as expected (exactly what currently in main this raises a KeyError). I have added your example as a test as well. Thank you.


else:
arrays, arr_columns, result_index = maybe_reorder(
arrays, arr_columns, columns, index
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/io/formats/test_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -491,3 +491,10 @@ def test_info_int_columns():
"""
)
assert result == expected


def test_info_from_rec():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the test we discussed where the behavior changed initially?

Also pre-commit is failing

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussion, I fixed that issue and also added that test in test_info.py (please check lines 502-509) plus in that test case, I made correction as well suggested by @simonjayhawkins (#47319 (comment)).

Furthermore, I already have initial test case included as well (lines 498-500).

I'm unsure what else is required.

On pre-commit failure, I'm beginner here and not sure what causing it to fail (when last time it passed for earlier commits). Any help to get this resolved would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed that. Could you split that into two tests? Since technically they are testing different things.

You can run pre-commit locally and check the log. This should show the errors

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will do that. Thank you for the response.

# GH47285
df = DataFrame.from_records([], index="foo")
buf = StringIO()
df.info(buf=buf)