Skip to content

BUG: in case of dtype mismatch (int vs category), error message from concat is not crystal clear #42552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
yohplala opened this issue Jul 15, 2021 · 12 comments
Closed
3 tasks done
Labels
Bug Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas IO Parquet parquet, feather Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@yohplala
Copy link

yohplala commented Jul 15, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
exi = pd.read_parquet('/home/yoh/Documents/code/data/existing.parquet', engine='fastparquet')
new = pd.read_parquet('/home/yoh/Documents/code/data/new.parquet', engine='fastparquet')
to_record = pd.concat([exi, new])

Please, find files enclosed. I could not succeed to re-create manually faulty dataframes.
faulty_dataframes.zip
(size of each is 5 rows x 8 columns)

Problem description

Before installation of pandas 1.3.0, I was using pandas 1.2.5 and fastparquet 0.6.4.dev0 and this extract of data was not causing problem.
After I installed pandas 1.3.0 the concat command is now issuing following error:

to_record = pd.concat([exi, new])
Traceback (most recent call last):

  File "<ipython-input-2-9967cb321e9e>", line 4, in <module>
    to_record = pd.concat([exi, new])

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 403, in get_reindexed_values
    values = self.block.get_values()

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1360, in get_values
    return np.asarray(values).reshape(self.shape)

ValueError: cannot reshape array of size 5 into shape (1,0)

I could notice that using pyarrow to read the files back allows having dataframes not causing any error.
I also tried to concat various extract of the dataframes by selecting column one by one, or even several at once, and it does not raise the error. For instance, following concat do not raise trouble:

to_record = pd.concat([exi[['timestamp','period','side']], new[['side','timestamp','period']]])
to_record = pd.concat([exi[['period','id']], new[['id','period']]])
to_record = pd.concat([exi['tracking'], new['tracking']])
# etc...

I am at a loss to reduce the trouble to the root cause.
Please, would anyone has some advice?

Expected Output

No error :)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-59-generic
Version : #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.06.0
fastparquet : 0.6.4.dev0
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.19
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@yohplala yohplala added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2021
@ParfaitG
Copy link
Contributor

How were your parquet files created? Did you specify engine='fastparquet' in DataFrame.to_parquet()? You may not be able to interchange between engine types (i.e., writing pyarrow engine outputs cannot be later read in with fastparquet engine). IO docs indicate differences in these engine types.

Below shows your zipped files may have been pyarrow engine outputs since it raised no error on read and staying consistent with fastparquet in write/read operations works without issue.

# READ WITH DEFAULT pyarrow ENGINE
exi = pd.read_parquet('existing.parquet')
new = pd.read_parquet('new.parquet')

to_record = pd.concat([exi, new])


# WRITE WITH fastparquet ENGINE
exi.to_parquet('existing2.parquet', engine='fastparquet')
new.to_parquet('new2.parquet', engine='fastparquet')

# READ WITH fastparquet ENGINE
exi2 = pd.read_parquet('existing2.parquet', engine='fastparquet')
new2 = pd.read_parquet('new2.parquet', engine='fastparquet')

to_record2 = pd.concat([exi2, new2])
to_record2

# COMPARE DATA FRAMES
to_record.eq(to_record2)
#    price  amount    id  volume  tracking  timestamp  period  side
# 0   True    True  True    True      True       True    True  True
# 1   True    True  True    True      True       True    True  True
# 2   True    True  True    True      True       True    True  True
# 3   True    True  True    True      True       True    True  True
# 4   True    True  True    True      True       True    True  True
# 0   True    True  True    True      True       True    True  True
# 1   True    True  True    True      True       True    True  True
# 2   True    True  True    True      True       True    True  True
# 3   True    True  True    True      True       True    True  True
# 4   True    True  True    True      True       True    True  True

Given pandas and fastparquet (even pyarrow) are actively developed libraries, staying consistent with engines without mixing types may be the best strategy going forward.

@yohplala
Copy link
Author

yohplala commented Jul 17, 2021

How were your parquet files created? Did you specify engine='fastparquet' in DataFrame.to_parquet()?

Hi, thanks for trying to help.
Original parquet files were both written with fastparquet, not using pandas actually.

import fastparquet as fp
fp.write(filename, data)

@yohplala yohplala reopened this Jul 17, 2021
@ParfaitG
Copy link
Contributor

ParfaitG commented Jul 17, 2021

Hmmmm...I cannot reproduce issue with latest fastparquet pip version (0.7.0) not dev 0.6.4.dev0.

import pandas as pd
import fastparquet as fp

exi = pd.DataFrame(
    {"price":{"0":10488.01,"1":10486.01,"2":10488.0,"3":10488.0,"4":10486.01},
     "amount":{"0":0.001144,"1":0.020454,"2":0.020194,"3":0.001549,"4":0.035631},
     "id":{"0":390626450,"1":390627927,"2":390626448,"3":390626449,"4":390627926},
     "volume":{"0":11.99828344,"1":214.48084854,"2":211.794672,"3":16.245912,"4":373.62702231},
     "tracking":{"0":"not_verified","1":"not_verified","2":"not_verified","3":"not_verified","4":"not_verified"},
     "timestamp":{"0":1601676533012,"1":1601676839128,"2":1601676532198,"3":1601676532647,"4":1601676839128},
     "period":{"0":1601769600,"1":1601769600,"2":1601769600,"3":1601769600,"4":1601769600},
     "side":{"0":"buy","1":"buy","2":"sell","3":"sell","4":"sell"}}
)
fp.write("existing2.parquet", exi)

new = pd.DataFrame(
    {"price":{"0":10488.01,"1":10488.0,"2":10488.01,"3":10563.85,"4":10563.97},
     "amount":{"0":0.001144,"1":0.047674,"2":0.01986,"3":0.030651,"4":0.029388},
     "id":{"0":390626450,"1":390626458,"2":390626459,"3":390637018,"4":390637019},
     "volume":{"0":11.99828344,"1":500.004912,"2":208.2918786,"3":323.79256635,"4":310.45395036},
     "tracking":{"0":"not_verified","1":"not_verified","2":"not_verified","3":"not_verified","4":"not_verified"},
     "side":{"0":"buy","1":"sell","2":"buy","3":"buy","4":"buy"},
     "timestamp":{"0":1601676533012,"1":1601676533647,"2":1601676534556,"3":1601678073988,"4":1601678073988},
     "period":{"0":1601769600,"1":1601769600,"2":1601769600,"3":1601769600,"4":1601769600}
    }
)
fp.write("new2.parquet", new)

exi2 = pd.read_parquet("existing2.parquet")
new2 = pd.read_parquet("new2.parquet")

df = pd.concat([exi2, new2])
df
#           price    amount         id  ...      timestamp      period  side
# index                                 ...                                 
# 0      10488.01  0.001144  390626450  ...  1601676533012  1601769600   buy
# 1      10486.01  0.020454  390627927  ...  1601676839128  1601769600   buy
# 2      10488.00  0.020194  390626448  ...  1601676532198  1601769600  sell
# 3      10488.00  0.001549  390626449  ...  1601676532647  1601769600  sell
# 4      10486.01  0.035631  390627926  ...  1601676839128  1601769600  sell
# 0      10488.01  0.001144  390626450  ...  1601676533012  1601769600   buy
# 1      10488.00  0.047674  390626458  ...  1601676533647  1601769600  sell
# 2      10488.01  0.019860  390626459  ...  1601676534556  1601769600   buy
# 3      10563.85  0.030651  390637018  ...  1601678073988  1601769600   buy
# 4      10563.97  0.029388  390637019  ...  1601678073988  1601769600   buy
# 
# [10 rows x 8 columns]

@yohplala
Copy link
Author

yohplala commented Jul 17, 2021

Hmmmm...I cannot reproduce issue with latest fastparquet pip version (0.7.0) not dev 0.6.4.dev0.

I am not sure challenging use of fastparquet or not is the right track.

Ok, I have been deep diving into pandas sources, and I have some more information to bring up.
First off, I rewrote manually said data, which does not raise any issue when using concat.

e_pr = [10488.01, 10486.01, 10488.00, 10488.00, 10486.01]
n_pr = [10488.01, 10488.00, 10488.01, 10563.85, 10563.97]
e_am = [0.001144, 0.020454, 0.020194, 0.001549, 0.035631]
n_am = [0.001144, 0.047674, 0.019860, 0.030651, 0.029388]
e_id = [390626450, 390627927, 390626448, 390626449, 390627926]
n_id = [390626450, 390626458, 390626459, 390637018, 390637019]
e_vo = [11.998283, 214.480849, 211.794672, 16.245912, 373.627022]
n_vo = [11.998283, 500.004912, 208.291879, 323.792566, 310.453950]
e_tr = ['not_verified', 'not_verified', 'not_verified', 'not_verified', 'not_verified']
n_tr = ['not_verified', 'not_verified', 'not_verified', 'not_verified', 'not_verified']
e_ts = [pd.Timestamp('2020-10-02 22:08:53.012999'),
        pd.Timestamp('2020-10-02 22:13:59.128999'),
        pd.Timestamp('2020-10-02 22:08:52.198999'),
        pd.Timestamp('2020-10-02 22:08:52.647000'),
        pd.Timestamp('2020-10-02 22:13:59.128999')]
n_ts = [pd.Timestamp('2020-10-02 22:08:53.012999'),
        pd.Timestamp('2020-10-02 22:08:53.647000'),
        pd.Timestamp('2020-10-02 22:08:54.556000'),
        pd.Timestamp('2020-10-02 22:34:33.988000'),
        pd.Timestamp('2020-10-02 22:34:33.988000')]
e_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
n_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
e_si = ['buy', 'buy', 'sell', 'sell', 'sell']
n_si = ['buy', 'sell', 'buy', 'buy', 'buy']

exi2 = pd.DataFrame({'price': e_pr,
                    'amount': e_am,
                    'id': e_id,
                    'volume': e_vo,
                    'tracking': e_tr,
                    'timestamp': e_ts,
                    'period': e_pe,
                    'side': e_si})
new2 = pd.DataFrame({'price': n_pr,
                    'amount': n_am,
                    'id': n_id,
                    'volume': n_vo,
                    'tracking': n_tr,
                    'side': n_si,
                    'timestamp': n_ts,
                    'period': n_pe})

from pandas.api.types import CategoricalDtype
tracking = CategoricalDtype(categories=['not_verified',
                                        'verified_ok',
                                        'hole_start',
                                        'hole_end'],
                            ordered = False)

exi2['tracking'] = exi2['tracking'].astype(tracking)
exi2['side'] = exi2['side'].astype('category')
exi2['period'] = exi2['period'].astype('category')
new2['tracking'] = new2['tracking'].astype(tracking)
new2['side'] = new2['side'].astype('category')
new2['period'] = new2['period'].astype('category')

to_record = pd.concat([exi2, new2])    # no error this time

We can check that df created manually and from files can be compared, so index information is ok (just some volume values are nook, i don't think this is important).

exi2 == exi
Out[46]: 
   price  amount    id  volume  tracking  timestamp  period  side
0   True    True  True   False      True       True    True  True
1   True    True  True   False      True       True    True  True
2   True    True  True    True      True       True    True  True
3   True    True  True    True      True       True    True  True
4   True    True  True   False      True       True    True  True

new2 == new
Out[47]: 
       price  amount    id  volume  tracking  side  timestamp  period
index                                                                
0       True    True  True   False      True  True       True    True
1       True    True  True    True      True  True       True    True
2       True    True  True   False      True  True       True    True
3       True    True  True   False      True  True       True    True
4       True    True  True   False      True  True       True    True

And as aid, concat is working.

Now the additional information I could gather using print statements in pandas sources (pandas/core/internals/blocks.py & pandas/core/internals/concat.py) is that:

  • the columns that is raising trouble is the column period which contains categorical.
  • in concat.py, for some reason, with exi2 and new2 (df created manually), this column returns True when being checked with _is_uniform_join_units, and thus this column is managed through this part of the code (line 205):
            vals = [ju.block.values for ju in join_units]

            if not blk.is_extension:
                # _is_uniform_join_units ensures a single dtype, so
                #  we can use np.concatenate, which is more performant
                #  than concat_compat
                values = np.concatenate(vals, axis=blk.ndim - 1)
            else:
                # TODO(EA2D): special-casing not needed with 2D EAs
                values = concat_compat(vals, axis=1)
                values = ensure_block_shape(values, blk.ndim)

            values = ensure_wrapped_if_datetimelike(values)

            fastpath = blk.values.dtype == values.dtype
  • and for some reason, with exi and new (from recorded files), this column returns False when being checked with _is_uniform_join_units, and thus this column is managed through the subsequent part of the code (line 221):
            values = _concatenate_join_units(join_units, concat_axis, copy=copy)
            fastpath = False

The error is then raised in _concatenate_join_units, which I am suspecting, is not fit to handle categorical data (this is a guess).

I am unfortunately not able to go further by inspecting _is_uniform_join_units which requires a join_unit as input. I identify this as the next step, but I simply don't know what a join_unit is, nor how to create/retrieve it from the input Dataframes.

Would someone be able to give some pointers starting from here? Thanks in advance, bests

@rhshadrach rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 18, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.2 milestone Jul 27, 2021
@yohplala yohplala changed the title BUG: concat 'broken' in pandas 1.3.0? BUG: in case of dtype mismatch (int vs category), error message from concat is not crystal clear Aug 1, 2021
@yohplala
Copy link
Author

yohplala commented Aug 1, 2021

@ParfaitG @rhshadrach @simonjayhawkins

Hi,
Ok, I just got back on the topic, and I have finally clarified what is wrong.
'before', I can say it was working, but I cannot say why it was working, maybe concat was more permissive in 1.2.5.

So the cause of the trouble is with the period column.
In 'exi', data are categories, while in new, data are int.

new['period']
Out[17]: 
index
0    1601769600
1    1601769600
2    1601769600
3    1601769600
4    1601769600
Name: period, dtype: int64

exi['period']
Out[18]: 
0    1601769600
1    1601769600
2    1601769600
3    1601769600
4    1601769600
Name: period, dtype: category
Categories (1, int64): [1601769600]

Fixing the mismatch by converting dtype of 'period' column from 'new' to category solves the trouble.
If something to be done in pandfas, I would suggest trying to raise an error before the concat step in case of dtype mismatch?

Thanks for trying to help @ParfaitG, and sorry for the trouble.
Have a good day,
Bests,

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@jbrockmendel
Copy link
Member

@yohplala is there a copy/paste-able example of the problem? https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@yohplala
Copy link
Author

@yohplala is there a copy/paste-able example of the problem? https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Hi, trying to reproduce on a small test case here below.
With last pandas version, I cannot raise any longer error.
Closing the ticket.
Thanks for taking care of this issue.
Bests

import pandas as pd

e_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
n_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]

exi2 = pd.DataFrame({'period': e_pe})
new2 = pd.DataFrame({'period': n_pe})

exi2['period'] = exi2['period'].astype('category')
new2['period'] = new2['period'].astype('int64')

to_record = pd.concat([exi2, new2])

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type IO Parquet parquet, feather labels Nov 19, 2021
@simonjayhawkins
Copy link
Member

@yohplala Thanks for the code sample. That code works in all 1.3.x releases?

@yohplala
Copy link
Author

@yohplala Thanks for the code sample. That code works in all 1.3.x releases?

Hi @simonjayhawkins ,
I have not tested this very test case against 1.3.0, but as per my previous comments, it was failing on 1.3.0.
The pandas version I now have is 1.3.4 and there is no error raised.

@simonjayhawkins
Copy link
Member

I have run the code sample #42552 (comment) on 1.3.0 and it is not raising ValueError.

@yohplala
Copy link
Author

yohplala commented Nov 19, 2021

I have run the code sample #42552 (comment) on 1.3.0 and it is not raising ValueError.

Ok, I am sorry, I am making you lose your time here then.
I thought I had understood what was the cause of the error.
Since then, I have not been working with those files, and I have not encountered the trouble any longer.
I propose to drop this here.
Thanks again for trying to help, and sorry again for the disturbance.
Bests,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas IO Parquet parquet, feather Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants