-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146) #22229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
strings_as_bytes
parameter for df.to_records()
(#18146)
strings_as_bytes
parameter for df.to_records()
(#18146)
Codecov Report
@@ Coverage Diff @@
## master #22229 +/- ##
==========================================
+ Coverage 92.31% 92.31% +<.01%
==========================================
Files 166 166
Lines 52391 52426 +35
==========================================
+ Hits 48363 48397 +34
- Misses 4028 4029 +1
Continue to review full report at Codecov.
|
04624a7
to
7672439
Compare
Thanks for the comments! Made the changes you requested. |
ff9a0e3
to
4b02c89
Compare
@jreback Thanks for the feedback! I've made the changes. They are squashed into a new commit and rebased on top of upstream/master. |
can you rebase |
can you merge master |
@gfyoung would you be able to merge master and update this |
@jreback: Sure. I’ll have time next week. |
collections.abc.Mapping ensures that. Seems like exactly what we want here.
Messing with the various is_*_like methods makes me nervous.
…On Fri, Dec 28, 2018 at 11:06 AM gfyoung ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/frame.py
<#22229 (comment)>:
> +
+ index_len = len(index_names)
+ formats = []
+
+ for i, v in enumerate(arrays):
+ index = i
+
+ if index < index_len:
+ dtype_mapping = index_dtypes
+ name = index_names[index]
+ else:
+ index -= index_len
+ dtype_mapping = column_dtypes
+ name = self.columns[index]
+
+ if isinstance(dtype_mapping, dict):
One thing though: is_dict_like does not guarantee existence of the
__contains__ method, which is what my implementation also relies on.
I could do some try-except logic to make it more generic, or I could make
the is_dict_like check slightly stricter.
Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIo4XYuy2E7pohfVAISp-Np7aMnRMks5u9k-jgaJpZM4VxcGz>
.
|
FWIW, collections.abc.Mapping is *the* definition of dict-like in the
Python world.
…On Fri, Dec 28, 2018 at 11:15 AM Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/frame.py
<#22229 (comment)>:
> +
+ index_len = len(index_names)
+ formats = []
+
+ for i, v in enumerate(arrays):
+ index = i
+
+ if index < index_len:
+ dtype_mapping = index_dtypes
+ name = index_names[index]
+ else:
+ index -= index_len
+ dtype_mapping = column_dtypes
+ name = self.columns[index]
+
+ if isinstance(dtype_mapping, dict):
I really hate to use other things because they are completely
unmaintanable in the codebase, hey just pick your favorite thing to check
(dict, collections.Mapping), oh and you need something else, just reinvent
it.
This is the reason we have the is_* things.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIq0eEBMyVHCUczCsJStAAl1qhoU-ks5u9lGbgaJpZM4VxcGz>
.
|
Unfortunately, isinstance checks on ABCs are relatively slow, so I don't
think it should be used everywhere.
On Fri, Dec 28, 2018 at 11:29 AM Tom Augspurger <[email protected]>
wrote:
… FWIW, collections.abc.Mapping is *the* definition of dict-like in the
Python world.
On Fri, Dec 28, 2018 at 11:15 AM Jeff Reback ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In pandas/core/frame.py
> <#22229 (comment)>:
>
> > +
> + index_len = len(index_names)
> + formats = []
> +
> + for i, v in enumerate(arrays):
> + index = i
> +
> + if index < index_len:
> + dtype_mapping = index_dtypes
> + name = index_names[index]
> + else:
> + index -= index_len
> + dtype_mapping = column_dtypes
> + name = self.columns[index]
> +
> + if isinstance(dtype_mapping, dict):
>
> I really hate to use other things because they are completely
> unmaintanable in the codebase, hey just pick your favorite thing to check
> (dict, collections.Mapping), oh and you need something else, just reinvent
> it.
>
> This is the reason we have the is_* things.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#22229 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABQHIq0eEBMyVHCUczCsJStAAl1qhoU-ks5u9lGbgaJpZM4VxcGz>
> .
>
|
Does anyone have thoughts on the API discussion for combining |
@TomAugspurger : I actually commented about this earlier, but not seeing it here in the comments...weird... I actually prefer to keep separate, in part because I suspect the use case in the wild will largely focus on the columns, not the indices. Thus, it makes sense to keep separate to facilitate this use case, while still making it possible to specify dtypes for the indices if so desired. Also, I prefer to avoid nested dict to reduce complexity for end users (in the case when we pass in different dtypes for columns and index). |
Fair enough.
…On Fri, Dec 28, 2018 at 12:38 PM gfyoung ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> : I actually commented
about this earlier, but not seeing it here in the comments...weird...
I actually prefer to keep separate, in part because I suspect the use case
in the wild will largely focus on the columns, not the indices. Thus, it
makes sense to keep separate to facilitate this use case, while still
making it possible to specify dtypes for the indices if so desired.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkK0L3_zTj-1nfoSCKnfRWaOVmeyks5u9mUPgaJpZM4VxcGz>
.
|
2028225
to
4b2eb47
Compare
@jreback @TomAugspurger : Managed to expand |
4b2eb47
to
75a4ded
Compare
@jreback : Added a test for the expanded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung lgtm. just a couple of minor things. merge on green.
Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts.
The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users.
More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method.
75a4ded
to
2857f87
Compare
2857f87
to
ec69fe0
Compare
Thanks for the help, @qinghao1 ! |
* upstream/master: REF/TST: replace capture_stdout with pytest capsys fixture (pandas-dev#24501) BUG: fix .iat assignment creates a new column (pandas-dev#24495) DOC: add checks on the returns section in the docstrings (pandas-dev#23138) (pandas-dev#23432) ENH: Add strings_as_fixed_length parameter for df.to_records() (pandas-dev#18146) (pandas-dev#22229) TST: Skip db tests unless explicitly specified in -m pattern (pandas-dev#24492) Mix EA into DTA/TDA; part of 24024 (pandas-dev#24502) DOC: Fix building of a single API document (pandas-dev#24506)
…s-dev#18146) (pandas-dev#22229) * ENH: Allow fixed-length strings in df.to_records() Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts. * Add dtype parameters instead of fix-string-like The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users. * MAINT: Use is_dict_like in to_records More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method. * TST: Add test for is_dict_like expanded def * MAINT: Address final comments
…s-dev#18146) (pandas-dev#22229) * ENH: Allow fixed-length strings in df.to_records() Adds parameter to allow string-like columns to be cast as fixed-length string-like dtypes for more efficient storage. Closes pandas-devgh-18146. Originally authored by @qinghao1 but cleaned up by @gfyoung to fix merge conflicts. * Add dtype parameters instead of fix-string-like The original parameter was causing a lot of acrobatics with regards to string dtypes between 2.x and 3.x. The new parameters simplify the internal logic and pass the responsibility and motivation of memory efficiency back to the users. * MAINT: Use is_dict_like in to_records More generic than checking whether our mappings are instances of dict. Expands is_dict_like check to include whether it has a __contains__ method. * TST: Add test for is_dict_like expanded def * MAINT: Address final comments
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
git diff upstream/master -u -- "*.py" | flake8 --diff