-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DataFrame shift along columns not respecting position when dtypes are mixed #29417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice catch! Investigation and PR are welcome! |
Can I give this issue a try? |
@haichuan0424 : Yes! You are more than welcome to! |
@gfyoung The result becomes: It seems to downcast the float point number to integer if the decimal is zero. I have also tried booleans and it works as well. Do you think this is worth submitting a pull request? Or the float number with 0 decimal has to be maintained? |
Try making the change and see if tests pass for you. If they do, add a test for this issue, and submit the PR for further review. |
After some investigation, I found that when mixed type DataFrame is used, it is broken down into several blocks with different types. Below is an example: df = pd.DataFrame(dict(
A=[1, 2], B=[3., 4.], C=['X', 'Y'],
D=[5., 6.], E=[7, 8], F=['W', 'Z']
))
df
A B C D E F
0 1 3.0 X 5.0 7 W
1 2 4.0 Y 6.0 8 Z The system breaks this DataFrame into three blocks by indexing and shift each separately. BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(1, 5, 2), 2 x 2, dtype: float64
IntBlock: slice(0, 8, 4), 2 x 2, dtype: int64
ObjectBlock: slice(2, 8, 3), 2 x 2, dtype: object
FloatBlock:
[[3. 4.]
[5. 6.]]
after shift:
[[nan nan]
[ 3. 4.]]
IntBlock:
[[1 2]
[7 8]]
after shift:
[[nan nan]
[ 1. 2.]]
ObjectBlock:
[['X' 'Y']
['W' 'Z']]
after shift:
[[nan nan]
['X' 'Y']] After shift each block, it assembles back based on original index slice. This is why the result is: A B C D E F
0 NaN NaN NaN 3.0 1.0 X
1 NaN NaN NaN 4.0 2.0 Y This is more complicated than I originally thought. We probably need to cast everything to object so it can be treated as one block and shift so only first column changes to NaN, then the result need to be cast back to what it supposed to be so we don't lose accuracy. |
Hi can I work on this issue ? |
Shifts are done on NDFrames and NDFrame executes |
closed by #35578 |
Setup
Outputs
Problem description
See Associated Stackoverflow question
The shift places a column's values into the next column that shares the same dtype. The expectation is that the column's values are placed into the adjacent column.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1062.4.1.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: 0.7.4
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: