Skip to content

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jzwinck
Copy link
Contributor

@jzwinck jzwinck commented Aug 3, 2016

@@ -1378,6 +1378,12 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
self.has_mi_columns = (isinstance(obj.columns, MultiIndex) and
not self.tupleize_cols)

# in Python 3, decode bytes to str so strings print without b''
if sys.version_info[0] >= 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use compat.PY3

if self.bytes_encoding:
if ix.dtype == object:
lib.object_array_decode_bytes(ix, self.bytes_encoding)
for arr in self.data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be pushed down to to_native_types as an addtional parameter.

@sinhrks sinhrks added Bug IO CSV read_csv, to_csv labels Aug 3, 2016
@codecov-io
Copy link

codecov-io commented Aug 3, 2016

Current coverage is 85.31% (diff: 90.00%)

Merging #13890 into master will increase coverage by <.01%

@@             master     #13890   diff @@
==========================================
  Files           139        139          
  Lines         50143      50161    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42777      42793    +16   
- Misses         7366       7368     +2   
  Partials          0          0          

Powered by Codecov. Last update 7e15923...4d88771

@@ -2020,6 +2020,14 @@ def re_replacer(s):

return block

def to_native_types(self, slicer=None, na_rep='nan', quoting=None,
bytes_encoding=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to Block, where to_native_types already exists, you just need to incorporate this argument after stringifying (we have another conditional if quoting or not, but and am not sure if you need to do this or not for that one as I think the csv writer handles that).

around line 590.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in ObjectBlock because that's the only class that could contain objects--correct? Given that a precondition of the code I'm adding is dtype==object, are you sure you'd want the logic to go in the base class for all types (even if it's overridden by some)?

@jzwinck
Copy link
Contributor Author

jzwinck commented Aug 5, 2016

OK so let me summarize where this PR is at:

  1. Cython crashes on the Python 2 build only. I haven't investigated this yet, because:
  2. @jreback seems to dislike the fact that new code is required to fix this problem, so:
  3. I posted a question on Stack Overflow to see if Python 3's csv module can be made to do the heavy lifting, but:
  4. Writing bytes without the prefixes and quotes does not seem to be supported in Python 3 directly.

It might be interesting to try to get support in Python 3.7 (or whatever) to write bytes as plain strings in the csv module. But this would take a long time to reach Pandas users. I do think the patch I've provided here is useful and appropriate, and I do not have a simpler one to offer.

I hope we can get #9712 fixed somehow. PRs welcome! ;-)

""" slice and dice then format """
values = self
if slicer is not None:
values = values[slicer]
return values._format_native_types(**kwargs)
result = values._format_native_types(**kwargs)
if bytes_encoding is not None and result.dtype == object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_object_dtype function

@jreback
Copy link
Contributor

jreback commented Aug 6, 2016

@jzwinck I have made some more comments. Please use the existing functions. It is not clear that adding code is needed here, it is just more repeated code and adds to the maintenance burden.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@jzwinck
Copy link
Contributor Author

jzwinck commented Sep 13, 2016

@jreback I'm not sure what you're asking for. I know what rebase means but the last time we spoke about this PR you had fundamental problems with the way I was approaching it. I don't have much time these days to do anything more with it.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2016

@jzwinck rebase and update means exactly what it says. rebase on master and respond to comments. pls reopen if you want to continue work.

@jreback jreback closed this Sep 13, 2016
@jorisvandenbossche
Copy link
Member

@jzwinck Looking back at the comments, I think one of the main suggestions of @jreback is to use the existing pd.core.strings.str_decode function instead of adding your new cython object_array_decode_bytes function, which seems to do exactly what you need from a quick glance (only cannot deal with mixed types).
What is the reason you cannot use this?

In [73]: pd.core.strings.str_decode(np.array([b'a', np.nan], dtype=object), encoding='UTF8')
Out[73]: array(['a', nan], dtype=object)

@jzwinck
Copy link
Contributor Author

jzwinck commented Sep 14, 2016

@jorisvandenbossche As you say, object_array_decode_bytes does not deal with mixed types. I wanted the patch to work even in the presence of mixed types, and included a unit test demonstrating this. Do you think it is better to make a patch which does not address the case of mixed types?

In any case, I have limited time to work on this further now.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2016

the mixed-dtype case in not relevant (nor tested). expanding the code base at the expense of readability is not necessary nor warranted. as I said, if you want to fix according to comments, pls do so. or maybe someone else will pick this up.

@jzwinck
Copy link
Contributor Author

jzwinck commented Sep 14, 2016

@jreback If I am the user, and I am telling you I want support for mixed dtypes, how do you then conclude it is "not relevant?" Also you are mistaken when you say it is "not tested" because it plainly is in the unit test I added: (b'hello', ['a', b'b']),.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2016

@jzwinck mixed-types (in a single array) are a complete disaster. I would rather not make them worse by adding additional code which makes them worse.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 14, 2016

@jzwinck Although the wording of @jreback was maybe a bit harsh on the 'relevant' aspect, there is a difference between "relevant for you as a user" and "relevant enough in general to warrant additional code-complexity in pandas" (and to be clear: I am not not saying it is one of either, just noting that it is debatable)
I am not in a good position to judge its relevance as I don't have a personal use case for it, but if supporting writing bytes to csv without "b" for columns with all bytes (so not mixed) is less complex to implement (because you can use existing functionality), IMO it is seems better to start with that, as it probably already solves the main use cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_csv and bytes on Python 3.
6 participants