BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

jzwinck · 2016-08-03T19:34:05Z

closes to_csv and bytes on Python 3. #9712
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-08-03T19:37:18Z

pandas/formats/format.py

@@ -1378,6 +1378,12 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
        self.has_mi_columns = (isinstance(obj.columns, MultiIndex) and
                               not self.tupleize_cols)

+        # in Python 3, decode bytes to str so strings print without b''
+        if sys.version_info[0] >= 3:


use compat.PY3

jreback · 2016-08-03T19:42:58Z

pandas/formats/format.py

+        if self.bytes_encoding:
+            if ix.dtype == object:
+                lib.object_array_decode_bytes(ix, self.bytes_encoding)
+            for arr in self.data:


this needs to be pushed down to to_native_types as an addtional parameter.

codecov-io · 2016-08-03T22:11:11Z

Current coverage is 85.31% (diff: 90.00%)

Merging #13890 into master will increase coverage by <.01%

@@             master     #13890   diff @@
==========================================
  Files           139        139          
  Lines         50143      50161    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42777      42793    +16   
- Misses         7366       7368     +2   
  Partials          0          0

Powered by Codecov. Last update 7e15923...4d88771

)

jreback · 2016-08-03T22:31:31Z

pandas/core/internals.py

@@ -2020,6 +2020,14 @@ def re_replacer(s):

        return block

+    def to_native_types(self, slicer=None, na_rep='nan', quoting=None,
+                        bytes_encoding=None, **kwargs):


move this to Block, where to_native_types already exists, you just need to incorporate this argument after stringifying (we have another conditional if quoting or not, but and am not sure if you need to do this or not for that one as I think the csv writer handles that).

around line 590.

I put it in ObjectBlock because that's the only class that could contain objects--correct? Given that a precondition of the code I'm adding is dtype==object, are you sure you'd want the logic to go in the base class for all types (even if it's overridden by some)?

jzwinck · 2016-08-05T22:52:45Z

OK so let me summarize where this PR is at:

Cython crashes on the Python 2 build only. I haven't investigated this yet, because:
@jreback seems to dislike the fact that new code is required to fix this problem, so:
I posted a question on Stack Overflow to see if Python 3's csv module can be made to do the heavy lifting, but:
Writing bytes without the prefixes and quotes does not seem to be supported in Python 3 directly.

It might be interesting to try to get support in Python 3.7 (or whatever) to write bytes as plain strings in the csv module. But this would take a long time to reach Pandas users. I do think the patch I've provided here is useful and appropriate, and I do not have a simpler one to offer.

I hope we can get #9712 fixed somehow. PRs welcome! ;-)

jreback · 2016-08-06T23:10:09Z

pandas/indexes/base.py

        """ slice and dice then format """
        values = self
        if slicer is not None:
            values = values[slicer]
-        return values._format_native_types(**kwargs)
+        result = values._format_native_types(**kwargs)
+        if bytes_encoding is not None and result.dtype == object:


use is_object_dtype function

jreback · 2016-08-06T23:12:56Z

@jzwinck I have made some more comments. Please use the existing functions. It is not clear that adding code is needed here, it is just more repeated code and adds to the maintenance burden.

jreback · 2016-09-09T22:36:33Z

can you rebase / update?

jzwinck · 2016-09-13T07:16:57Z

@jreback I'm not sure what you're asking for. I know what rebase means but the last time we spoke about this PR you had fundamental problems with the way I was approaching it. I don't have much time these days to do anything more with it.

jreback · 2016-09-13T10:05:42Z

@jzwinck rebase and update means exactly what it says. rebase on master and respond to comments. pls reopen if you want to continue work.

jorisvandenbossche · 2016-09-13T22:44:19Z

@jzwinck Looking back at the comments, I think one of the main suggestions of @jreback is to use the existing pd.core.strings.str_decode function instead of adding your new cython object_array_decode_bytes function, which seems to do exactly what you need from a quick glance (only cannot deal with mixed types).
What is the reason you cannot use this?

In [73]: pd.core.strings.str_decode(np.array([b'a', np.nan], dtype=object), encoding='UTF8')
Out[73]: array(['a', nan], dtype=object)

jzwinck · 2016-09-14T01:56:34Z

@jorisvandenbossche As you say, object_array_decode_bytes does not deal with mixed types. I wanted the patch to work even in the presence of mixed types, and included a unit test demonstrating this. Do you think it is better to make a patch which does not address the case of mixed types?

In any case, I have limited time to work on this further now.

jreback · 2016-09-14T01:58:38Z

the mixed-dtype case in not relevant (nor tested). expanding the code base at the expense of readability is not necessary nor warranted. as I said, if you want to fix according to comments, pls do so. or maybe someone else will pick this up.

jzwinck · 2016-09-14T13:09:46Z

@jreback If I am the user, and I am telling you I want support for mixed dtypes, how do you then conclude it is "not relevant?" Also you are mistaken when you say it is "not tested" because it plainly is in the unit test I added: (b'hello', ['a', b'b']),.

jreback · 2016-09-14T13:19:20Z

@jzwinck mixed-types (in a single array) are a complete disaster. I would rather not make them worse by adding additional code which makes them worse.

jorisvandenbossche · 2016-09-14T13:21:29Z

@jzwinck Although the wording of @jreback was maybe a bit harsh on the 'relevant' aspect, there is a difference between "relevant for you as a user" and "relevant enough in general to warrant additional code-complexity in pandas" (and to be clear: I am not not saying it is one of either, just noting that it is debatable)
I am not in a good position to judge its relevance as I don't have a personal use case for it, but if supporting writing bytes to csv without "b" for columns with all bytes (so not mixed) is less complex to implement (because you can use existing functionality), IMO it is seems better to start with that, as it probably already solves the main use cases?

jreback reviewed Aug 3, 2016
View reviewed changes

jzwinck force-pushed the fix-9712 branch from b68488d to da92586 Compare August 3, 2016 19:41

jreback reviewed Aug 3, 2016
View reviewed changes

jzwinck force-pushed the fix-9712 branch from da92586 to 1c2c346 Compare August 3, 2016 19:53

sinhrks added Bug IO CSV read_csv, to_csv labels Aug 3, 2016

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (pandas-dev#9712

d165a4b

)

jzwinck force-pushed the fix-9712 branch from 1c2c346 to d165a4b Compare August 3, 2016 22:21

jreback reviewed Aug 3, 2016
View reviewed changes

jreback reviewed Aug 6, 2016
View reviewed changes

Merge branch 'master' into fix-9712

4d88771

jreback closed this Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

jzwinck commented Aug 3, 2016 •

edited

Loading

jreback Aug 3, 2016

jreback Aug 3, 2016

codecov-io commented Aug 3, 2016 •

edited

Loading

jreback Aug 3, 2016

jzwinck Aug 4, 2016

jzwinck commented Aug 5, 2016

jreback Aug 6, 2016

jreback commented Aug 6, 2016

jreback commented Sep 9, 2016

jzwinck commented Sep 13, 2016

jreback commented Sep 13, 2016

jorisvandenbossche commented Sep 13, 2016

jzwinck commented Sep 14, 2016

jreback commented Sep 14, 2016

jzwinck commented Sep 14, 2016

jreback commented Sep 14, 2016 •

edited

Loading

jorisvandenbossche commented Sep 14, 2016 •

edited

Loading

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

BUG: avoid "b" prefix for bytes in to_csv() on Python 3 (#9712) #13890

Conversation

jzwinck commented Aug 3, 2016 • edited Loading

jreback Aug 3, 2016

Choose a reason for hiding this comment

jreback Aug 3, 2016

Choose a reason for hiding this comment

codecov-io commented Aug 3, 2016 • edited Loading

Current coverage is 85.31% (diff: 90.00%)

jreback Aug 3, 2016

Choose a reason for hiding this comment

jzwinck Aug 4, 2016

Choose a reason for hiding this comment

jzwinck commented Aug 5, 2016

jreback Aug 6, 2016

Choose a reason for hiding this comment

jreback commented Aug 6, 2016

jreback commented Sep 9, 2016

jzwinck commented Sep 13, 2016

jreback commented Sep 13, 2016

jorisvandenbossche commented Sep 13, 2016

jzwinck commented Sep 14, 2016

jreback commented Sep 14, 2016

jzwinck commented Sep 14, 2016

jreback commented Sep 14, 2016 • edited Loading

jorisvandenbossche commented Sep 14, 2016 • edited Loading

jzwinck commented Aug 3, 2016 •

edited

Loading

codecov-io commented Aug 3, 2016 •

edited

Loading

jreback commented Sep 14, 2016 •

edited

Loading

jorisvandenbossche commented Sep 14, 2016 •

edited

Loading