-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: in case of dtype mismatch (int vs category), error message from concat is not crystal clear #42552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How were your Below shows your zipped files may have been # READ WITH DEFAULT pyarrow ENGINE
exi = pd.read_parquet('existing.parquet')
new = pd.read_parquet('new.parquet')
to_record = pd.concat([exi, new])
# WRITE WITH fastparquet ENGINE
exi.to_parquet('existing2.parquet', engine='fastparquet')
new.to_parquet('new2.parquet', engine='fastparquet')
# READ WITH fastparquet ENGINE
exi2 = pd.read_parquet('existing2.parquet', engine='fastparquet')
new2 = pd.read_parquet('new2.parquet', engine='fastparquet')
to_record2 = pd.concat([exi2, new2])
to_record2
# COMPARE DATA FRAMES
to_record.eq(to_record2)
# price amount id volume tracking timestamp period side
# 0 True True True True True True True True
# 1 True True True True True True True True
# 2 True True True True True True True True
# 3 True True True True True True True True
# 4 True True True True True True True True
# 0 True True True True True True True True
# 1 True True True True True True True True
# 2 True True True True True True True True
# 3 True True True True True True True True
# 4 True True True True True True True True Given |
Hi, thanks for trying to help.
|
Hmmmm...I cannot reproduce issue with latest import pandas as pd
import fastparquet as fp
exi = pd.DataFrame(
{"price":{"0":10488.01,"1":10486.01,"2":10488.0,"3":10488.0,"4":10486.01},
"amount":{"0":0.001144,"1":0.020454,"2":0.020194,"3":0.001549,"4":0.035631},
"id":{"0":390626450,"1":390627927,"2":390626448,"3":390626449,"4":390627926},
"volume":{"0":11.99828344,"1":214.48084854,"2":211.794672,"3":16.245912,"4":373.62702231},
"tracking":{"0":"not_verified","1":"not_verified","2":"not_verified","3":"not_verified","4":"not_verified"},
"timestamp":{"0":1601676533012,"1":1601676839128,"2":1601676532198,"3":1601676532647,"4":1601676839128},
"period":{"0":1601769600,"1":1601769600,"2":1601769600,"3":1601769600,"4":1601769600},
"side":{"0":"buy","1":"buy","2":"sell","3":"sell","4":"sell"}}
)
fp.write("existing2.parquet", exi)
new = pd.DataFrame(
{"price":{"0":10488.01,"1":10488.0,"2":10488.01,"3":10563.85,"4":10563.97},
"amount":{"0":0.001144,"1":0.047674,"2":0.01986,"3":0.030651,"4":0.029388},
"id":{"0":390626450,"1":390626458,"2":390626459,"3":390637018,"4":390637019},
"volume":{"0":11.99828344,"1":500.004912,"2":208.2918786,"3":323.79256635,"4":310.45395036},
"tracking":{"0":"not_verified","1":"not_verified","2":"not_verified","3":"not_verified","4":"not_verified"},
"side":{"0":"buy","1":"sell","2":"buy","3":"buy","4":"buy"},
"timestamp":{"0":1601676533012,"1":1601676533647,"2":1601676534556,"3":1601678073988,"4":1601678073988},
"period":{"0":1601769600,"1":1601769600,"2":1601769600,"3":1601769600,"4":1601769600}
}
)
fp.write("new2.parquet", new)
exi2 = pd.read_parquet("existing2.parquet")
new2 = pd.read_parquet("new2.parquet")
df = pd.concat([exi2, new2])
df
# price amount id ... timestamp period side
# index ...
# 0 10488.01 0.001144 390626450 ... 1601676533012 1601769600 buy
# 1 10486.01 0.020454 390627927 ... 1601676839128 1601769600 buy
# 2 10488.00 0.020194 390626448 ... 1601676532198 1601769600 sell
# 3 10488.00 0.001549 390626449 ... 1601676532647 1601769600 sell
# 4 10486.01 0.035631 390627926 ... 1601676839128 1601769600 sell
# 0 10488.01 0.001144 390626450 ... 1601676533012 1601769600 buy
# 1 10488.00 0.047674 390626458 ... 1601676533647 1601769600 sell
# 2 10488.01 0.019860 390626459 ... 1601676534556 1601769600 buy
# 3 10563.85 0.030651 390637018 ... 1601678073988 1601769600 buy
# 4 10563.97 0.029388 390637019 ... 1601678073988 1601769600 buy
#
# [10 rows x 8 columns] |
I am not sure challenging use of fastparquet or not is the right track. Ok, I have been deep diving into pandas sources, and I have some more information to bring up. e_pr = [10488.01, 10486.01, 10488.00, 10488.00, 10486.01]
n_pr = [10488.01, 10488.00, 10488.01, 10563.85, 10563.97]
e_am = [0.001144, 0.020454, 0.020194, 0.001549, 0.035631]
n_am = [0.001144, 0.047674, 0.019860, 0.030651, 0.029388]
e_id = [390626450, 390627927, 390626448, 390626449, 390627926]
n_id = [390626450, 390626458, 390626459, 390637018, 390637019]
e_vo = [11.998283, 214.480849, 211.794672, 16.245912, 373.627022]
n_vo = [11.998283, 500.004912, 208.291879, 323.792566, 310.453950]
e_tr = ['not_verified', 'not_verified', 'not_verified', 'not_verified', 'not_verified']
n_tr = ['not_verified', 'not_verified', 'not_verified', 'not_verified', 'not_verified']
e_ts = [pd.Timestamp('2020-10-02 22:08:53.012999'),
pd.Timestamp('2020-10-02 22:13:59.128999'),
pd.Timestamp('2020-10-02 22:08:52.198999'),
pd.Timestamp('2020-10-02 22:08:52.647000'),
pd.Timestamp('2020-10-02 22:13:59.128999')]
n_ts = [pd.Timestamp('2020-10-02 22:08:53.012999'),
pd.Timestamp('2020-10-02 22:08:53.647000'),
pd.Timestamp('2020-10-02 22:08:54.556000'),
pd.Timestamp('2020-10-02 22:34:33.988000'),
pd.Timestamp('2020-10-02 22:34:33.988000')]
e_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
n_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
e_si = ['buy', 'buy', 'sell', 'sell', 'sell']
n_si = ['buy', 'sell', 'buy', 'buy', 'buy']
exi2 = pd.DataFrame({'price': e_pr,
'amount': e_am,
'id': e_id,
'volume': e_vo,
'tracking': e_tr,
'timestamp': e_ts,
'period': e_pe,
'side': e_si})
new2 = pd.DataFrame({'price': n_pr,
'amount': n_am,
'id': n_id,
'volume': n_vo,
'tracking': n_tr,
'side': n_si,
'timestamp': n_ts,
'period': n_pe})
from pandas.api.types import CategoricalDtype
tracking = CategoricalDtype(categories=['not_verified',
'verified_ok',
'hole_start',
'hole_end'],
ordered = False)
exi2['tracking'] = exi2['tracking'].astype(tracking)
exi2['side'] = exi2['side'].astype('category')
exi2['period'] = exi2['period'].astype('category')
new2['tracking'] = new2['tracking'].astype(tracking)
new2['side'] = new2['side'].astype('category')
new2['period'] = new2['period'].astype('category')
to_record = pd.concat([exi2, new2]) # no error this time We can check that df created manually and from files can be compared, so index information is ok (just some volume values are nook, i don't think this is important). exi2 == exi
Out[46]:
price amount id volume tracking timestamp period side
0 True True True False True True True True
1 True True True False True True True True
2 True True True True True True True True
3 True True True True True True True True
4 True True True False True True True True
new2 == new
Out[47]:
price amount id volume tracking side timestamp period
index
0 True True True False True True True True
1 True True True True True True True True
2 True True True False True True True True
3 True True True False True True True True
4 True True True False True True True True And as aid, Now the additional information I could gather using
vals = [ju.block.values for ju in join_units]
if not blk.is_extension:
# _is_uniform_join_units ensures a single dtype, so
# we can use np.concatenate, which is more performant
# than concat_compat
values = np.concatenate(vals, axis=blk.ndim - 1)
else:
# TODO(EA2D): special-casing not needed with 2D EAs
values = concat_compat(vals, axis=1)
values = ensure_block_shape(values, blk.ndim)
values = ensure_wrapped_if_datetimelike(values)
fastpath = blk.values.dtype == values.dtype
values = _concatenate_join_units(join_units, concat_axis, copy=copy)
fastpath = False The error is then raised in I am unfortunately not able to go further by inspecting Would someone be able to give some pointers starting from here? Thanks in advance, bests |
@ParfaitG @rhshadrach @simonjayhawkins Hi, So the cause of the trouble is with the new['period']
Out[17]:
index
0 1601769600
1 1601769600
2 1601769600
3 1601769600
4 1601769600
Name: period, dtype: int64
exi['period']
Out[18]:
0 1601769600
1 1601769600
2 1601769600
3 1601769600
4 1601769600
Name: period, dtype: category
Categories (1, int64): [1601769600] Fixing the mismatch by converting Thanks for trying to help @ParfaitG, and sorry for the trouble. |
changing milestone to 1.3.5 |
@yohplala is there a copy/paste-able example of the problem? https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Hi, trying to reproduce on a small test case here below. import pandas as pd
e_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
n_pe = [1601769600, 1601769600, 1601769600, 1601769600, 1601769600]
exi2 = pd.DataFrame({'period': e_pe})
new2 = pd.DataFrame({'period': n_pe})
exi2['period'] = exi2['period'].astype('category')
new2['period'] = new2['period'].astype('int64')
to_record = pd.concat([exi2, new2]) |
@yohplala Thanks for the code sample. That code works in all 1.3.x releases? |
Hi @simonjayhawkins , |
I have run the code sample #42552 (comment) on 1.3.0 and it is not raising |
Ok, I am sorry, I am making you lose your time here then. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Please, find files enclosed. I could not succeed to re-create manually faulty dataframes.
faulty_dataframes.zip
(size of each is 5 rows x 8 columns)
Problem description
Before installation of
pandas 1.3.0
, I was usingpandas 1.2.5
andfastparquet 0.6.4.dev0
and this extract of data was not causing problem.After I installed
pandas 1.3.0
theconcat
command is now issuing following error:I could notice that using
pyarrow
to read the files back allows having dataframes not causing any error.I also tried to concat various extract of the dataframes by selecting column one by one, or even several at once, and it does not raise the error. For instance, following
concat
do not raise trouble:I am at a loss to reduce the trouble to the root cause.
Please, would anyone has some advice?
Expected Output
No error :)
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : f00ed8f
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-59-generic
Version : #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.0
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.06.0
fastparquet : 0.6.4.dev0
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.19
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
The text was updated successfully, but these errors were encountered: