-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
MultiIndex.__contains__ try-catch exception types too narrow #24570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you provide an example that doesn't use an external package |
Sure. Here's the reduced example that only uses pandas and numpy. The exception is the same. import pandas as pd
import numpy as np
tx = pd.timedelta_range('09:30:00','16:00:00', freq='30 min')
# ys is Series with MultiIndex, first level is Timedelta
ys = pd.DataFrame(np.random.randn(len(tx), 2), index=tx, columns = ('c0','c1')).set_index('c0', append=True)['c1']
# this is what sklearn.metrics would do
hasattr(ys, 'fit') OutputIn [4]: hasattr(ys, 'fit')ValueError Traceback (most recent call last) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer) ~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance) pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new() pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string() ValueError: unit abbreviation w/o a number |
can you edit the example at the top with this |
yeah this should be handled here:
where if the conversion fails (e.g. put a try/except ValueError) around this. PRs welcome. |
OK, I updated the example and output in the first comment. Regarding the fix, why not try-catch Exception in the _can_hold_identifiers_and_holds_name function? This example is caused by Timedelta, but it could very well be caused by Timestamp, or any other types that fail to be constructed from a str. |
yes u can try that see if anything breaks |
The patch is very simple, I could create pull request easily. However, the tests (python setup.py test) failed right away, but not due to my changes. Anyone please tell me where the problem might be? Thanks run test output(pandas-dev) [02:19:47|xwang0-el7 pandas-xw]$ python setup.py test I forked from master, here's the top of my 'git log --name-status' output: git log --name-statuscommit 6f9ab14
M pandas/core/indexes/base.py commit f57bd72
M pandas/io/formats/format.py |
I was about to raise an issue for a problem with chart_data.pivot_table(
index='age',
columns='priority',
values='key',
aggfunc='count'
) Here Before upgrading to Pandas 0.24.1 this worked. It now throws:
If I change the |
@optilude can you try to give a small reproducible example? (eg by creating a dummy dataframe with the specific dtypes you have that shows the problem) |
Sure: import pandas as pd
df = pd.DataFrame({
'key': ['A', 'B', 'C'],
'age': [pd.Timedelta('1 days'), pd.Timedelta('2 days'), pd.Timedelta('2 days')],
'priority': ['Low', 'Low', 'High']})
df.pivot_table(index='age', columns='priority', values='key', aggfunc='count') Throws:
|
OK, thanks. I can confirm the regression. It looks similar in underlying reason as #25087, another reported regression. |
This is still broken in 0.24.2, with the same stack trace. |
Code Sample
Problem description
sklearn tries to handle numpy array-like args, including pd.Series and pd.DataFrame. In trying to validate the argument, sklearn calls hasattr(ys, 'fit'), which calls into the _can_hold_identifiers_and_holds_name function of the MultiIndex object. In line 2111 of pandas/core/indexes/base.py, it tries to check whether 'fit' is in the MultiIndex, which in turns calls into .
pandas/core/indexes/multi.py line 539 for 'self.get_loc(key)'. The get_loc is surrounded by try-catch, but only for LookupError and TypeError, whereas here it's ValueError that's thrown.
The old version 0.22.0 works fine.
For now I added a try-catch _can_hold_identifiers_and_holds_name function and return False. It works around the problem. It looks like this:
Expected Output
In [4]: hasattr(ys, 'fit')
ValueError Traceback (most recent call last)
in
----> 1 hasattr(ys, 'fit')
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
4372 return object.getattribute(self, name)
4373 else:
-> 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
4376 return object.getattribute(self, name)
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_hold_identifiers_and_holds_name(self, name)
2109 """
2110 if self.is_object() or self.is_categorical():
-> 2111 return name in self
2112 return False
2113
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in contains(self, key)
547 hash(key)
548 try:
--> 549 self.get_loc(key)
550 return True
551 except (LookupError, TypeError):
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2235
2236 if not isinstance(key, tuple):
-> 2237 loc = self._get_level_indexer(key, level=0)
2238
2239 # _get_level_indexer returns an empty slice if the key has
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2494 else:
2495
-> 2496 loc = level_index.get_loc(key)
2497 if isinstance(loc, slice):
2498 return loc
~/app/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
783
784 if _is_convertible_to_td(key):
--> 785 key = Timedelta(key)
786 return Index.get_loc(self, key, method, tolerance)
787
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.new()
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()
ValueError: unit abbreviation w/o a number
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
, and sklearn: 0.20.1
The text was updated successfully, but these errors were encountered: