DOC: pandas.read_csv #48487

ryanbaekr · 2022-09-09T18:33:05Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Documentation problem

There are certain values that read_csv interprets as booleans even if the true_values and false_values parameters are set to none or an empty list. In my opinion, there are two oversights with the documentation here.

There is nothing in the documentation that lists out what string values are interpreted as booleans by default (I found true, True, TRUE, false, False, and FALSE but there could be more, it would be nice to know that)
There is no direct way to force read_csv to treat a column of data with values that could all be interpreted as booleans as strings.

Suggested fix for documentation

The values that are interpreted as true and false no matter what should be documented.
I managed to get around the issue by interpreting all columns as strings and then applying the to_numeric function to each column. If that is the only way to prevent those values from being interpreted as booleans, I think it should be noted on the page. I have an example of what I'm talking about here

The text was updated successfully, but these errors were encountered:

AlexKirko · 2022-09-10T20:27:11Z

Confirmed. Looks like true and false (case-insensitive) are treated as True and False by default.

Should we just add to the docs that the default values for true_values and false_values are None, but any case-insensitive variant of true and false is treated as boolean in addition to that?

ryanbaekr · 2022-09-10T22:56:49Z

Confirmed. Looks like true and false (case-insensitive) are treated as True and False by default.

Should we just add to the docs that the default values for true_values and false_values are None, but any case-insensitive variant of true and false is treated as boolean in addition to that?

I think that's a good start. I guess I'd like a definitive list, I tried to trace the source code but got to a .so and went no further. There could be more than just true and false that are treated this way for all we know.

I also think the second part of my request is important, documenting how to prevent this behavior

AlexKirko · 2022-09-11T06:45:08Z

Weird that you got stuck on .so. By default we use the python CSV engine, I think, which is coded here.
Do you mind looking there to check what might be causing this behavior? If not, no sweat, I'll take a deeper look tomorrow and try to figure it out.

ryanbaekr · 2022-09-11T19:27:36Z

Weird that you got stuck on .so. By default we use the python CSV engine, I think, which is coded here. Do you mind looking there to check what might be causing this behavior? If not, no sweat, I'll take a deeper look tomorrow and try to figure it out.

Not sure if this is right, but I was looking here. Which eventually calls this where I see libops.maybe_convert_bool. Sorry if I'm going down the complete wrong path here but that's what I remember looking at when I first ran into this a while back.

AlexKirko · 2022-09-12T06:48:10Z

Got it. The code you were looking for is in _libs/ops.pyx and it's writtein in Cython. Here is the link and the relevant bit:

    # the defaults
    true_vals = {'True', 'TRUE', 'true'}
    false_vals = {'False', 'FALSE', 'false'}

    if true_values is not None:
        true_vals = true_vals | set(true_values)

    if false_values is not None:
        false_vals = false_vals | set(false_values)

AlexKirko · 2022-09-12T06:53:01Z

My experiments also show that any case variant, such as TrUe or FAlse also gets treated as one of these, so there is probably a lower or an upper call somewhere, though I can't find it at the moment.

@ryanbaekr , would you like to make a PR with the Docs modification?

AlexKirko · 2022-09-13T07:01:01Z

No worries, I'll do it then.

ryanbaekr added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2022

AlexKirko added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2022

AlexKirko self-assigned this Sep 13, 2022

AlexKirko mentioned this issue Sep 17, 2022

DOC: include defaults into docs for true_values and false_values #48597

Merged

2 tasks

mroeschke closed this as completed in #48597 Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: pandas.read_csv #48487

DOC: pandas.read_csv #48487

ryanbaekr commented Sep 9, 2022 •

edited

Loading

AlexKirko commented Sep 10, 2022 •

edited

Loading

ryanbaekr commented Sep 10, 2022

AlexKirko commented Sep 11, 2022 •

edited

Loading

ryanbaekr commented Sep 11, 2022

AlexKirko commented Sep 12, 2022

AlexKirko commented Sep 12, 2022

AlexKirko commented Sep 13, 2022

DOC: pandas.read_csv #48487

DOC: pandas.read_csv #48487

Comments

ryanbaekr commented Sep 9, 2022 • edited Loading

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

AlexKirko commented Sep 10, 2022 • edited Loading

ryanbaekr commented Sep 10, 2022

AlexKirko commented Sep 11, 2022 • edited Loading

ryanbaekr commented Sep 11, 2022

AlexKirko commented Sep 12, 2022

AlexKirko commented Sep 12, 2022

AlexKirko commented Sep 13, 2022

ryanbaekr commented Sep 9, 2022 •

edited

Loading

AlexKirko commented Sep 10, 2022 •

edited

Loading

AlexKirko commented Sep 11, 2022 •

edited

Loading