Skip to content

BUG: merging with mixed types objects in py3 when unorderable #12814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sbuser opened this issue Apr 6, 2016 · 6 comments
Closed

BUG: merging with mixed types objects in py3 when unorderable #12814

sbuser opened this issue Apr 6, 2016 · 6 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@sbuser
Copy link

sbuser commented Apr 6, 2016

Code Sample, a copy-pastable example if possible

    df1 = dfAllStudies.copy(deep=True)  #make 2 copies of the same data
    df2 = dfAllStudies.copy(deep=True)

    #take a subset of 1 of the identical frames by dropping rows with certain dates
    df2 = df2[(df2['Reviewed on'] >= np.datetime64(DateWindowStart))]  

    #attempt to merge the frames on indexes
    common = pd.merge(df1, df2, left_index=True, right_index=True)

Whatever my data looks like, a copy, minus some rows, merged (on indexes) with a copy should function without error, no?

Expected Output

a merged dataframe and not:

\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 535, in _get_join_indexers llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys))) TypeError: type object argument after * must be a sequence, not map

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 30 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.0.3
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
None

@jreback
Copy link
Contributor

jreback commented Apr 6, 2016

pls provide a minimal but complete copy-pastable example

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 6, 2016
@sbuser
Copy link
Author

sbuser commented Apr 6, 2016

Weirdly the following trivial example does work:

import pandas as pd

d = {'col1': 'foo', 'col2': 'bar'}
df = pd.DataFrame(data=d, index=[1, 2, 3, str(1)])

df1 = df.copy(deep=True)
print(df1)

df2 = df.copy(deep=True)
print(df2)

df2 = df2[(df2.index != 1)] #take a subset
print(df2)

common = pd.merge(df1, df2, left_index=True, right_index=True)
print(common)

So the problem must be with my data somehow. What's the best way to get my data into a trivial example? If I copy out strings I potentially lose datatypes and so forth. Will a pickled version of a few rows work?

@jreback
Copy link
Contributor

jreback commented Apr 6, 2016

sure if you have a reproducible and you don't mind sharing then that would work.

@adamdivak
Copy link

This is rather tricky to reproduce, but I had the same issue. Here is a minimal example that triggers it for me:

import pandas as pd
from math import nan
a = pd.DataFrame({'a': [1, 2, 3]}, index=[1, 2, 'a'])
b = pd.DataFrame({'b': [2, 3, 4]}, index=[1, nan, nan])
a.join(b)

I had to try a lot of combinations to nail it down, and it seems that the following conditions are needed to trigger this:

  • Exactly one of the indices is of type object, the other one is of type float - that's why one index contains a string in the example. (If both are object or both are float then it does not produce an error)
  • One index contains at least two nan values
  • The values are irrelevant, this is specifically about indices

Cheers,
Adam

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-36-generic
machine: x86_64
processor: 
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 22.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.8.9
lxml: None
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

you realized that doing from math import nan is completely useless as numpy is the definer of nan (they are the same), but that is completely non-idiomatic and just plain confusing.

The issue is mixed object indexes, not a good idea to have mixed types like this in the index ever (or in a column for that matter).

Yes this does trigger an error. If you want to have a look, go for it.

ipdb> p list(map(fkeys, left_keys, right_keys))
*** TypeError: unorderable types: str() > int()

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Jun 14, 2016
@jreback jreback added this to the Next Major Release milestone Jun 14, 2016
@jreback jreback changed the title pd.merge() fails in odd ways? BUG: merging with mixed types objects in py3 when unorderable Jun 14, 2016
@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

xref #13432 which is the same unsortable condition.

cc @pijucha

@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 27, 2016
pijucha added a commit to pijucha/pandas that referenced this issue Jul 17, 2016
1. Added an internal `safe_sort` to safely sort mixed-integer
arrays in Python3.

2. Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (pandas-dev#13432)
- improve performance (pandas-dev#12044)

3. Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814)

4. Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (pandas-dev#13432)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants