Skip to content

Join several dataframes with MultiIndex containing dates raises ValueError #33692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
neshitov opened this issue Apr 21, 2020 · 8 comments · Fixed by #46660
Closed
2 of 3 tasks

Join several dataframes with MultiIndex containing dates raises ValueError #33692

neshitov opened this issue Apr 21, 2020 · 8 comments · Fixed by #46660
Assignees
Labels
datetime.date stdlib datetime.date support Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@neshitov
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

When I try to join 3 dataframes with multiindices containing dates I got ValueError:

import pandas as pd

date = pd.Timestamp(2000, 1, 1).date()

df1 = pd.DataFrame({'index_0': 0, 'date': date, 'col1': [0]})
df2 = pd.DataFrame({'index_0': 0, 'date': date, 'col2': [0]})
df1 = df1.groupby(by=['index_0', 'date']).first()
df2 = df2.groupby(by=['index_0', 'date']).first()

multi_index = pd.MultiIndex.from_tuples([(0, date)], names=['index_0', 'date'])
df3 = pd.DataFrame(index=multi_index, columns=['col3'], data=[0])

result = df1.join([df2, df3])

Traceback

ValueError                                Traceback (most recent call last)
<ipython-input-555-5947f20f11fd> in <module>
      9 df3 = pd.DataFrame(index=multi_index, columns=['col3'], data=[0])
     10 
---> 11 result = df1.join([df2, df3])
     12 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   7207         """
   7208         return self._join_compat(
-> 7209             other, on=on, how=how, lsuffix=lsuffix, rsuffix=rsuffix, sort=sort
   7210         )
   7211 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   7248                         frames, axis=1, join="outer", verify_integrity=True, sort=sort
   7249                     )
-> 7250                     return res.reindex(self.index, copy=False)
   7251                 else:
   7252                     return concat(

/opt/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    225         @wraps(func)
    226         def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 227             return func(*args, **kwargs)
    228 
    229         kind = inspect.Parameter.POSITIONAL_OR_KEYWORD

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in reindex(self, *args, **kwargs)
   3854         kwargs.pop("axis", None)
   3855         kwargs.pop("labels", None)
-> 3856         return super().reindex(**kwargs)
   3857 
   3858     def drop(

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in reindex(self, *args, **kwargs)
   4542         # perform the reindex on the axes
   4543         return self._reindex_axes(
-> 4544             axes, level, limit, tolerance, method, fill_value, copy
   4545         ).__finalize__(self)
   4546 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   3742         if index is not None:
   3743             frame = frame._reindex_index(
-> 3744                 index, method, copy, level, fill_value, limit, tolerance
   3745             )
   3746 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _reindex_index(self, new_index, method, copy, level, fill_value, limit, tolerance)
   3758     ):
   3759         new_index, indexer = self.index.reindex(
-> 3760             new_index, method=method, level=level, limit=limit, tolerance=tolerance
   3761         )
   3762         return self._reindex_with_indexers(

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in reindex(self, target, method, level, limit, tolerance)
   2414                     )
   2415                 else:
-> 2416                     raise ValueError("cannot handle a non-unique multi-index!")
   2417 
   2418         if not isinstance(target, MultiIndex):

ValueError: cannot handle a non-unique multi-index!

Problem description

The issue seems to be that groupby produces dataframes df1 and df2 having multiindex with level 'date' consisting of 'dateteime.date' objects, while in df3 we get multiindex with level 'date' consisting of 'pandas._libs.tslibs.timestamps.Timestamp' objects. This does not lead to an error in subsequent joins:

result = df1.join(df2)
result = result.join(df3)

or when dealing with single level indices, but creates a problem when joining 3 dataframes at once.

Expected Output

                    col1  col2  col3
index_0 date                        
0       2000-01-01     0     0     0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-124-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 45.2.0.post20200210
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : 0.3.0
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : 0.8.3
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
numba : 0.45.0

@neshitov neshitov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2020
@TomAugspurger
Copy link
Contributor

Thanks for the report @neshitov. Something strange is going on here. Are you interested in investigating more? Can you start by looking at the behavior of concat with date and Timestamp objects?

Note that we're hitting this date vs. datetime vs. Timestamp issue in several places right now (#35830 and related issues).

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@phofl
Copy link
Member

phofl commented Sep 12, 2020

@TomAugspurger

I looked a bit into it. The error is in here:

fast_unique_multiple in https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx

Gets a datetime.date and a Timestamp as input. Are they supposed to be equal?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 14, 2020 via email

@jbrockmendel
Copy link
Member

#36131 deprecates date vs Timestamp comparisons to match the stdlib datetime.

@jbrockmendel jbrockmendel added the datetime.date stdlib datetime.date support label Jun 5, 2021
@mroeschke
Copy link
Member

It appears this works in master now. Could use a test

In [2]: import pandas as pd
   ...:
   ...: date = pd.Timestamp(2000, 1, 1).date()
   ...:
   ...: df1 = pd.DataFrame({'index_0': 0, 'date': date, 'col1': [0]})
   ...: df2 = pd.DataFrame({'index_0': 0, 'date': date, 'col2': [0]})
   ...: df1 = df1.groupby(by=['index_0', 'date']).first()
   ...: df2 = df2.groupby(by=['index_0', 'date']).first()
   ...:
   ...: multi_index = pd.MultiIndex.from_tuples([(0, date)], names=['index_0', '
   ...: date'])
   ...: df3 = pd.DataFrame(index=multi_index, columns=['col3'], data=[0])
   ...:
   ...: result = df1.join([df2, df3])

In [3]: result
Out[3]:
                    col1  col2  col3
index_0 date
0       2000-01-01     0     0     0

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 31, 2021
@DAKSHA2001
Copy link

Give me this issue. I will work on it

@jreback jreback added this to the Contributions Welcome milestone Nov 28, 2021
@anchal0997
Copy link

Can I have a look at this issue?

@dospix
Copy link
Contributor

dospix commented Apr 5, 2022

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datetime.date stdlib datetime.date support Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
9 participants