Skip to content

ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) #22229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 30, 2018

Conversation

qinghao1
Copy link
Contributor

@qinghao1 qinghao1 commented Aug 7, 2018

This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

@qinghao1 qinghao1 changed the title ENH: Add string_as_bytes option for df.to_records() (#18146) ENH: Add strings_as_bytes parameter for df.to_records() (#18146) Aug 7, 2018
@qinghao1 qinghao1 changed the title ENH: Add strings_as_bytes parameter for df.to_records() (#18146) ENH: Add strings_as_bytes parameter for df.to_records() (#18146) Aug 7, 2018
@codecov
Copy link

codecov bot commented Aug 7, 2018

Codecov Report

Merging #22229 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22229      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         166      166              
  Lines       52391    52426      +35     
==========================================
+ Hits        48363    48397      +34     
- Misses       4028     4029       +1
Flag Coverage Δ
#multiple 90.73% <100%> (ø) ⬆️
#single 42.82% <17.24%> (-0.23%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.95% <100%> (+0.04%) ⬆️
pandas/core/dtypes/inference.py 98.46% <100%> (+0.07%) ⬆️
pandas/core/arrays/timedeltas.py 87.42% <0%> (-0.26%) ⬇️
pandas/util/testing.py 87.68% <0%> (-0.1%) ⬇️
pandas/core/arrays/period.py 98.35% <0%> (-0.04%) ⬇️
pandas/core/arrays/datetimes.py 97.91% <0%> (+0.17%) ⬆️
pandas/core/arrays/datetimelike.py 95.75% <0%> (+0.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ac0d6...ec69fe0. Read the comment docs.

@qinghao1 qinghao1 force-pushed the string-dtype branch 5 times, most recently from 04624a7 to 7672439 Compare August 7, 2018 09:34
@gfyoung gfyoung added Enhancement Output-Formatting __repr__ of pandas objects, to_string Dtype Conversions Unexpected or buggy dtype conversions labels Aug 7, 2018
@qinghao1
Copy link
Contributor Author

qinghao1 commented Aug 8, 2018

Thanks for the comments! Made the changes you requested.

@qinghao1 qinghao1 force-pushed the string-dtype branch 3 times, most recently from ff9a0e3 to 4b02c89 Compare August 10, 2018 01:39
@qinghao1
Copy link
Contributor Author

qinghao1 commented Aug 10, 2018

@jreback Thanks for the feedback! I've made the changes. They are squashed into a new commit and rebased on top of upstream/master.

@qinghao1 qinghao1 changed the title ENH: Add strings_as_bytes parameter for df.to_records() (#18146) ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) Aug 10, 2018
@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

can you rebase

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

can you merge master

@jreback
Copy link
Contributor

jreback commented Dec 14, 2018

@gfyoung would you be able to merge master and update this

@gfyoung
Copy link
Member

gfyoung commented Dec 15, 2018

@jreback: Sure. I’ll have time next week.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 28, 2018 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 28, 2018 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 28, 2018 via email

@TomAugspurger
Copy link
Contributor

Does anyone have thoughts on the API discussion for combining index_dtypes and column_dtypes?

@gfyoung
Copy link
Member

gfyoung commented Dec 28, 2018

@TomAugspurger : I actually commented about this earlier, but not seeing it here in the comments...weird...

I actually prefer to keep separate, in part because I suspect the use case in the wild will largely focus on the columns, not the indices. Thus, it makes sense to keep separate to facilitate this use case, while still making it possible to specify dtypes for the indices if so desired.

Also, I prefer to avoid nested dict to reduce complexity for end users (in the case when we pass in different dtypes for columns and index).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 28, 2018 via email

@gfyoung
Copy link
Member

gfyoung commented Dec 29, 2018

@jreback @TomAugspurger : Managed to expand is_dict_like without issues, and I added a test in to_records to ensure that generic dict-like's are accepted. PTAL.

@gfyoung
Copy link
Member

gfyoung commented Dec 30, 2018

@jreback : Added a test for the expanded is_dict_like, which is all green. PTAL.

@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung lgtm. just a couple of minor things. merge on green.

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.
More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.
@gfyoung gfyoung merged commit 0769688 into pandas-dev:master Dec 30, 2018
@gfyoung
Copy link
Member

gfyoung commented Dec 30, 2018

Thanks for the help, @qinghao1 !

thoo added a commit to thoo/pandas that referenced this pull request Dec 30, 2018
* upstream/master:
  REF/TST: replace capture_stdout with pytest capsys fixture (pandas-dev#24501)
  BUG: fix .iat assignment creates a new column (pandas-dev#24495)
  DOC: add checks on the returns section in the docstrings (pandas-dev#23138) (pandas-dev#23432)
  ENH: Add strings_as_fixed_length parameter for df.to_records() (pandas-dev#18146) (pandas-dev#22229)
  TST: Skip db tests unless explicitly specified in -m pattern (pandas-dev#24492)
  Mix EA into DTA/TDA; part of 24024 (pandas-dev#24502)
  DOC: Fix building of a single API document (pandas-dev#24506)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add to_records() option to output NumPy string dtypes, not objects
5 participants