Skip to content

BUG: AttributeError: 'BooleanArray' object has no attribute 'sum' while infer types #44079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zhangxiaoxing opened this issue Oct 18, 2021 · 4 comments · Fixed by #44442
Closed
Assignees
Labels
Bug IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@zhangxiaoxing
Copy link
Contributor

zhangxiaoxing commented Oct 18, 2021

Line 694 in the pandas/io/parsers/base_parser.py file,

        if issubclass(values.dtype.type, (np.number, np.bool_)):
            # error: Argument 2 to "isin" has incompatible type "List[Any]"; expected
            # "Union[Union[ExtensionArray, ndarray], Index, Series]"
            mask = algorithms.isin(values, list(na_values))  # type: ignore[arg-type]
            na_count =mask.sum()
            ^^^^^^^^^^^^^THIS
            if na_count > 0:
                if is_integer_dtype(values):
                    values = values.astype(np.float64)
                np.putmask(values, mask, np.nan)
            return values, na_count

Caused an "AttributeError: 'BooleanArray' object has no attribute 'sum'" following some global ananconda update. Changed the line to
na_count =mask.astype('uint8').sum()

fixed the error in my case.

Edit by rhshadrach:

Minimal example

from io import StringIO

import pandas as pd

df = pd.read_csv(
     StringIO("0,1"),
     names=['a', 'b'],
     index_col=['a'],
     dtype={'a': 'UInt8'},
)
@rhshadrach
Copy link
Member

Can you post a reproducible example along with the version of pandas you're using.

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Oct 18, 2021
@mroeschke mroeschke added the Bug label Oct 30, 2021
@LawrenceJGD
Copy link

I'm having the same problem but from another place, it happens when I use read_csv to read a CSV file that has multiple indexes whose columns use the UInt8 dtype, although it also happens with UInt32; I don't know if the same will happen with the other numeric dtypes.

This is the code that I'm using:

>>> import pandas as pd
>>> df = pd.read_csv(
...     'geo20210715_pp.txt',
...     names=[
...         'cod_est', 'cod_mun', 'cod_par',
...         'nombre_est', 'nombre_mun', 'nombre_par'
...     ],
...     index_col=['cod_est', 'cod_mun', 'cod_par'],
...     dtype={
...         'cod_est': 'UInt8', 'cod_mun': 'UInt8', 'cod_par': 'UInt8',
...         'nombre_est': 'string', 'nombre_mun': 'string',
...         'nombre_par': 'string'
...     },
...     engine='c',
...     encoding='cp1252'
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 488, in _read
    return parser.read(nrows)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 309, in read
    index, names = self._make_index(data, alldata, names)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 416, in _make_index
    index = self._agg_index(index)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 512, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/usr/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 695, in _infer_types
    na_count = mask.sum()
AttributeError: 'BooleanArray' object has no attribute 'sum'

And this is the CSV file (don't worry, is public information): geo20210715_pp.txt.

My way for circumventing it is this:

>>> import pandas as pd
>>> df = pd.read_csv(
...     '/home/galaxyljgd/Documentos/otros/registro_electoral_2021/geo20210715_pp.txt',
...     names=[
...         'cod_est', 'cod_mun', 'cod_par',
...         'nombre_est', 'nombre_mun', 'nombre_par'
...     ],
...     dtype={
...         'cod_est': 'UInt8', 'cod_mun': 'UInt8', 'cod_par': 'UInt8',
...         'nombre_est': 'string', 'nombre_mun': 'string',
...         'nombre_par': 'string'
...     },
...     engine='c',
...     encoding='cp1252'
... )
>>> df.set_index(['cod_est', 'cod_mun', 'cod_par'], inplace=True)
>>> df
                           nombre_est      nombre_mun                 nombre_par
cod_est cod_mun cod_par                                                         
21      1       3          EDO. ZULIA      MP. BARALT   PQ. MANUEL GUANIPA MATOS
                4          EDO. ZULIA      MP. BARALT      PQ. MARCELINO BRICEÑO
                5          EDO. ZULIA      MP. BARALT            PQ. SAN TIMOTEO
                6          EDO. ZULIA      MP. BARALT           PQ. PUEBLO NUEVO
        2       1          EDO. ZULIA  MP. SANTA RITA  PQ. PEDRO LUCAS URRIBARRI
...                               ...             ...                        ...
6       1       11       EDO. BOLIVAR      MP. CARONI             PQ. 5 DE JULIO
99      13      4            EMBAJADA          CANADA                  VANCOUVER
        97      1            EMBAJADA        JORDANIA                      AMMAN
        98      1            EMBAJADA          CHIPRE                    NICOSIA
        99      1            EMBAJADA          SERBIA                   BELGRADO

[1394 rows x 3 columns]

@rhshadrach
Copy link
Member

Thanks @LawrenceJGD, I've updated the OP with a minimal example and confirmed this exists on master. Investigations and PRs to fix are most welcome!

@rhshadrach rhshadrach added this to the Contributions Welcome milestone Nov 12, 2021
@rhshadrach rhshadrach added IO CSV read_csv, to_csv and removed Needs Info Clarification about behavior needed to assess issue labels Nov 12, 2021
@zhangxiaoxing
Copy link
Contributor Author

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment