DataFrame that deserialized from pyarrow cannot be used to `merge` #27943

sighingnow · 2019-08-16T03:34:21Z

Code Sample, a copy-pastable example if possible

In [1]: data1 = pd.DataFrame(np.arange(20).reshape((4, 5)) + 1, columns=['a', 'b', 'c', 'd', 'e'])
In [2]: data2 = pd.DataFrame(np.arange(20).reshape((5, 4)) + 1, columns=['a', 'b', 'x', 'y'])

In [3]: import pyarrow as pa
In [4]: d1 = pa.deserialize(pa.serialize(data1).to_buffer())
In [5]: d2 = pa.deserialize(pa.serialize(data2).to_buffer())

In [6]: d1.merge(d2)

Problem description

The above code raises an exception:

In [7]: d1.merge(d2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-f852b96f603a> in <module>
----> 1 d1.merge(d2)

~/pandas/pandas/core/frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   7261             copy=copy,
   7262             indicator=indicator,
-> 7263             validate=validate,
   7264         )
   7265

~/pandas/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     82         validate=validate,
     83     )
---> 84     return op.get_result()
     85
     86

~/pandas/pandas/core/reshape/merge.py in get_result(self)
    625             self.left, self.right = self._indicator_pre_merge(self.left, self.right)
    626
--> 627         join_index, left_indexer, right_indexer = self._get_join_info()
    628
    629         ldata, rdata = self.left._data, self.right._data

~/pandas/pandas/core/reshape/merge.py in _get_join_info(self)
    842             )
    843         else:
--> 844             (left_indexer, right_indexer) = self._get_join_indexers()
    845
    846             if self.right_index:

~/pandas/pandas/core/reshape/merge.py in _get_join_indexers(self)
    821         """ return the join indexers """
    822         return _get_join_indexers(
--> 823             self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
    824         )
    825

~/pandas/pandas/core/reshape/merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
   1285
   1286     # get left & right join labels and num. of levels at each location
-> 1287     llab, rlab, shape = map(list, zip(*map(fkeys, left_keys, right_keys)))
   1288
   1289     # get flat i8 keys from label lists

~/pandas/pandas/core/reshape/merge.py in _factorize_keys(lk, rk, sort)
   1882     rizer = klass(max(len(lk), len(rk)))
   1883
-> 1884     llab = rizer.factorize(lk)
   1885     rlab = rizer.factorize(rk)
   1886

~/pandas/pandas/_libs/hashtable.pyx in pandas._libs.hashtable.Int64Factorizer.factorize()
    109         return self.count
    110
--> 111     def factorize(self, int64_t[:] values, sort=False,
    112                   na_sentinel=-1, na_value=None):
    113         """

~/pandas/pandas/_libs/hashtable.cpython-37m-darwin.so in View.MemoryView.memoryview_cwrapper()

~/pandas/pandas/_libs/hashtable.cpython-37m-darwin.so in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

Expected Output

d1.copy(deep=True).merge(d2.copy(deep=True)) could give the correct result:

In [10]: d1.copy(deep=True).merge(d2.copy(deep=True))
Out[10]:
   a  b  c  d  e  x  y
0  1  2  3  4  5  3  4

Output of `pd.show_versions()`

I'm working with pandas master so show_version() doesn't work. The git commit hash is a818281a45f7b5bd24f050e5d6868894c5108db6 (the latest version on master branch at 2019-08-16).

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-08-16T11:03:30Z

Missing a const here

diff --git a/pandas/_libs/hashtable.pyx b/pandas/_libs/hashtable.pyx
index 3e620f593..b8df78e60 100644
--- a/pandas/_libs/hashtable.pyx
+++ b/pandas/_libs/hashtable.pyx
@@ -108,7 +108,7 @@ cdef class Int64Factorizer:
     def get_count(self):
         return self.count
 
-    def factorize(self, int64_t[:] values, sort=False,
+    def factorize(self, const int64_t[:] values, sort=False,
                   na_sentinel=-1, na_value=None):
         """
         Factorize values with nans replaced by na_sentinel

Just need to construct a test-case that doesn't rely on PyArrow.

sighingnow · 2019-08-16T11:29:09Z

Thanks for fixing that so quickly!

TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 16, 2019

TomAugspurger added this to the 0.25.1 milestone Aug 16, 2019

TomAugspurger mentioned this issue Aug 16, 2019

BUG: Merge with readonly arrays #27946

Merged

jreback closed this as completed in #27946 Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame that deserialized from pyarrow cannot be used to `merge` #27943

DataFrame that deserialized from pyarrow cannot be used to `merge` #27943

sighingnow commented Aug 16, 2019 •

edited

Loading

TomAugspurger commented Aug 16, 2019

sighingnow commented Aug 16, 2019

DataFrame that deserialized from pyarrow cannot be used to merge #27943

DataFrame that deserialized from pyarrow cannot be used to merge #27943

Comments

sighingnow commented Aug 16, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Aug 16, 2019

sighingnow commented Aug 16, 2019

DataFrame that deserialized from pyarrow cannot be used to `merge` #27943

DataFrame that deserialized from pyarrow cannot be used to `merge` #27943

sighingnow commented Aug 16, 2019 •

edited

Loading

Output of `pd.show_versions()`