ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) #22229

qinghao1 · 2018-08-07T03:49:30Z

This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

closes ENH: add to_records() option to output NumPy string dtypes, not objects #18146
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-08-07T04:31:41Z

Codecov Report

Merging #22229 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22229      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         166      166              
  Lines       52391    52426      +35     
==========================================
+ Hits        48363    48397      +34     
- Misses       4028     4029       +1

Flag	Coverage Δ
#multiple	`90.73% <100%> (ø)`	⬆️
#single	`42.82% <17.24%> (-0.23%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.95% <100%> (+0.04%)`	⬆️
pandas/core/dtypes/inference.py	`98.46% <100%> (+0.07%)`	⬆️
pandas/core/arrays/timedeltas.py	`87.42% <0%> (-0.26%)`	⬇️
pandas/util/testing.py	`87.68% <0%> (-0.1%)`	⬇️
pandas/core/arrays/period.py	`98.35% <0%> (-0.04%)`	⬇️
pandas/core/arrays/datetimes.py	`97.91% <0%> (+0.17%)`	⬆️
pandas/core/arrays/datetimelike.py	`95.75% <0%> (+0.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ac0d6...ec69fe0. Read the comment docs.

pandas/core/frame.py

qinghao1 · 2018-08-08T03:00:57Z

Thanks for the comments! Made the changes you requested.

pandas/core/frame.py

pandas/tests/frame/test_convert_to.py

qinghao1 · 2018-08-10T01:39:58Z

@jreback Thanks for the feedback! I've made the changes. They are squashed into a new commit and rebased on top of upstream/master.

pandas/core/frame.py

pandas/tests/frame/test_convert_to.py

pandas/core/frame.py

jreback · 2018-09-23T21:41:44Z

can you rebase

jreback · 2018-12-03T01:44:28Z

can you merge master

jreback · 2018-12-14T15:57:14Z

@gfyoung would you be able to merge master and update this

gfyoung · 2018-12-15T00:25:04Z

@jreback: Sure. I’ll have time next week.

TomAugspurger · 2018-12-28T17:09:14Z

collections.abc.Mapping ensures that. Seems like exactly what we want here. Messing with the various is_*_like methods makes me nervous.

…

On Fri, Dec 28, 2018 at 11:06 AM gfyoung ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/frame.py <#22229 (comment)>: > + + index_len = len(index_names) + formats = [] + + for i, v in enumerate(arrays): + index = i + + if index < index_len: + dtype_mapping = index_dtypes + name = index_names[index] + else: + index -= index_len + dtype_mapping = column_dtypes + name = self.columns[index] + + if isinstance(dtype_mapping, dict): One thing though: is_dict_like does not guarantee existence of the __contains__ method, which is what my implementation also relies on. I could do some try-except logic to make it more generic, or I could make the is_dict_like check slightly stricter. Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIo4XYuy2E7pohfVAISp-Np7aMnRMks5u9k-jgaJpZM4VxcGz> .

TomAugspurger · 2018-12-28T17:29:22Z

FWIW, collections.abc.Mapping is *the* definition of dict-like in the Python world.

…

On Fri, Dec 28, 2018 at 11:15 AM Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/frame.py <#22229 (comment)>: > + + index_len = len(index_names) + formats = [] + + for i, v in enumerate(arrays): + index = i + + if index < index_len: + dtype_mapping = index_dtypes + name = index_names[index] + else: + index -= index_len + dtype_mapping = column_dtypes + name = self.columns[index] + + if isinstance(dtype_mapping, dict): I really hate to use other things because they are completely unmaintanable in the codebase, hey just pick your favorite thing to check (dict, collections.Mapping), oh and you need something else, just reinvent it. This is the reason we have the is_* things. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIq0eEBMyVHCUczCsJStAAl1qhoU-ks5u9lGbgaJpZM4VxcGz> .

TomAugspurger · 2018-12-28T17:37:11Z

Unfortunately, isinstance checks on ABCs are relatively slow, so I don't think it should be used everywhere. On Fri, Dec 28, 2018 at 11:29 AM Tom Augspurger <[email protected]> wrote:

…

FWIW, collections.abc.Mapping is *the* definition of dict-like in the Python world. On Fri, Dec 28, 2018 at 11:15 AM Jeff Reback ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In pandas/core/frame.py > <#22229 (comment)>: > > > + > + index_len = len(index_names) > + formats = [] > + > + for i, v in enumerate(arrays): > + index = i > + > + if index < index_len: > + dtype_mapping = index_dtypes > + name = index_names[index] > + else: > + index -= index_len > + dtype_mapping = column_dtypes > + name = self.columns[index] > + > + if isinstance(dtype_mapping, dict): > > I really hate to use other things because they are completely > unmaintanable in the codebase, hey just pick your favorite thing to check > (dict, collections.Mapping), oh and you need something else, just reinvent > it. > > This is the reason we have the is_* things. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#22229 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABQHIq0eEBMyVHCUczCsJStAAl1qhoU-ks5u9lGbgaJpZM4VxcGz> > . >

TomAugspurger · 2018-12-28T18:28:05Z

Does anyone have thoughts on the API discussion for combining index_dtypes and column_dtypes?

gfyoung · 2018-12-28T18:38:03Z

@TomAugspurger : I actually commented about this earlier, but not seeing it here in the comments...weird...

I actually prefer to keep separate, in part because I suspect the use case in the wild will largely focus on the columns, not the indices. Thus, it makes sense to keep separate to facilitate this use case, while still making it possible to specify dtypes for the indices if so desired.

Also, I prefer to avoid nested dict to reduce complexity for end users (in the case when we pass in different dtypes for columns and index).

TomAugspurger · 2018-12-28T19:30:10Z

Fair enough.

…

On Fri, Dec 28, 2018 at 12:38 PM gfyoung ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> : I actually commented about this earlier, but not seeing it here in the comments...weird... I actually prefer to keep separate, in part because I suspect the use case in the wild will largely focus on the columns, not the indices. Thus, it makes sense to keep separate to facilitate this use case, while still making it possible to specify dtypes for the indices if so desired. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkK0L3_zTj-1nfoSCKnfRWaOVmeyks5u9mUPgaJpZM4VxcGz> .

gfyoung · 2018-12-29T02:43:19Z

@jreback @TomAugspurger : Managed to expand is_dict_like without issues, and I added a test in to_records to ensure that generic dict-like's are accepted. PTAL.

pandas/core/dtypes/inference.py

gfyoung · 2018-12-30T02:17:20Z

@jreback : Added a test for the expanded is_dict_like, which is all green. PTAL.

pandas/core/frame.py

pandas/tests/dtypes/test_inference.py

jreback

@gfyoung lgtm. just a couple of minor things. merge on green.

@qinghao1

Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts.

The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users.

More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method.

gfyoung · 2018-12-30T22:40:49Z

Thanks for the help, @qinghao1 !

* upstream/master: REF/TST: replace capture_stdout with pytest capsys fixture (pandas-dev#24501) BUG: fix .iat assignment creates a new column (pandas-dev#24495) DOC: add checks on the returns section in the docstrings (pandas-dev#23138) (pandas-dev#23432) ENH: Add strings_as_fixed_length parameter for df.to_records() (pandas-dev#18146) (pandas-dev#22229) TST: Skip db tests unless explicitly specified in -m pattern (pandas-dev#24492) Mix EA into DTA/TDA; part of 24024 (pandas-dev#24502) DOC: Fix building of a single API document (pandas-dev#24506)

@qinghao1

…s-dev#18146) (pandas-dev#22229) * ENH: Allow fixed-length strings in df.to_records() Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts. * Add dtype parameters instead of fix-string-like The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users. * MAINT: Use is_dict_like in to_records More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method. * TST: Add test for is_dict_like expanded def * MAINT: Address final comments

@qinghao1

…s-dev#18146) (pandas-dev#22229) * ENH: Allow fixed-length strings in df.to_records() Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts. * Add dtype parameters instead of fix-string-like The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users. * MAINT: Use is_dict_like in to_records More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method. * TST: Add test for is_dict_like expanded def * MAINT: Address final comments

qinghao1 changed the title ~~ENH: Add string_as_bytes option for df.to_records() (#18146)~~ ENH: Add strings_as_bytes parameter for df.to_records() (#18146) Aug 7, 2018

qinghao1 changed the title ~~ENH: Add strings_as_bytes parameter for df.to_records() (#18146)~~ ENH: Add strings_as_bytes parameter for df.to_records() (#18146) Aug 7, 2018

qinghao1 force-pushed the string-dtype branch from 674053d to dcb9d0e Compare August 7, 2018 04:31

qinghao1 force-pushed the string-dtype branch 5 times, most recently from 04624a7 to 7672439 Compare August 7, 2018 09:34

gfyoung added Enhancement Output-Formatting __repr__ of pandas objects, to_string Dtype Conversions Unexpected or buggy dtype conversions labels Aug 7, 2018

gfyoung reviewed Aug 7, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

gfyoung reviewed Aug 7, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

gfyoung requested a review from jreback August 7, 2018 16:58

qinghao1 force-pushed the string-dtype branch from 7672439 to d10605d Compare August 8, 2018 02:54

qinghao1 force-pushed the string-dtype branch from d10605d to 1a2ee8f Compare August 8, 2018 03:02

jreback requested changes Aug 8, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

pandas/tests/frame/test_convert_to.py Outdated Show resolved Hide resolved

qinghao1 force-pushed the string-dtype branch 3 times, most recently from ff9a0e3 to 4b02c89 Compare August 10, 2018 01:39

qinghao1 force-pushed the string-dtype branch from 4b02c89 to 089defc Compare August 10, 2018 05:35

qinghao1 changed the title ~~ENH: Add strings_as_bytes parameter for df.to_records() (#18146)~~ ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) Aug 10, 2018

jreback requested changes Aug 10, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/tests/frame/test_convert_to.py Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

gfyoung force-pushed the string-dtype branch from 2028225 to 4b2eb47 Compare December 29, 2018 02:01

TomAugspurger approved these changes Dec 29, 2018

View reviewed changes

jreback reviewed Dec 29, 2018

View reviewed changes

pandas/core/dtypes/inference.py Show resolved Hide resolved

gfyoung force-pushed the string-dtype branch from 4b2eb47 to 75a4ded Compare December 30, 2018 01:16

jreback approved these changes Dec 30, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Dec 30, 2018

jreback reviewed Dec 30, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

jreback reviewed Dec 30, 2018

View reviewed changes

pandas/tests/dtypes/test_inference.py Outdated Show resolved Hide resolved

jreback approved these changes Dec 30, 2018

View reviewed changes

gfyoung added 4 commits December 30, 2018 20:52

ENH: Allow fixed-length strings in df.to_records()

24d1b6a

Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts.

Add dtype parameters instead of fix-string-like

fa5e1ea

The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users.

MAINT: Use is_dict_like in to_records

4cacb52

More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method.

TST: Add test for is_dict_like expanded def

3b100c3

gfyoung force-pushed the string-dtype branch from 75a4ded to 2857f87 Compare December 30, 2018 20:59

MAINT: Address final comments

ec69fe0

gfyoung force-pushed the string-dtype branch from 2857f87 to ec69fe0 Compare December 30, 2018 21:00

gfyoung merged commit 0769688 into pandas-dev:master Dec 30, 2018

qwhelan mentioned this pull request Jan 24, 2019

BUG: support dtypes in column_dtypes for to_records() #24895

Merged

4 tasks

Uh oh!

ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) #22229

ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) #22229

Uh oh!

Conversation

qinghao1 commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

qinghao1 commented Aug 8, 2018

Uh oh!

Uh oh!

Uh oh!

qinghao1 commented Aug 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback commented Sep 23, 2018

Uh oh!

jreback commented Dec 3, 2018

Uh oh!

jreback commented Dec 14, 2018

Uh oh!

gfyoung commented Dec 15, 2018

Uh oh!

TomAugspurger commented Dec 28, 2018 via email

Uh oh!

TomAugspurger commented Dec 28, 2018 via email

Uh oh!

TomAugspurger commented Dec 28, 2018 via email

Uh oh!

TomAugspurger commented Dec 28, 2018

Uh oh!

gfyoung commented Dec 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 28, 2018 via email

Uh oh!

gfyoung commented Dec 29, 2018

Uh oh!

Uh oh!

gfyoung commented Dec 30, 2018

Uh oh!

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Dec 30, 2018

Uh oh!

Uh oh!

qinghao1 commented Aug 7, 2018 •

edited

Loading

codecov bot commented Aug 7, 2018 •

edited

Loading

qinghao1 commented Aug 10, 2018 •

edited

Loading

gfyoung commented Dec 28, 2018 •

edited

Loading