BUG: Error in to_stata when DataFrame contains non-string column names #6622

bashtage · 2014-03-13T00:37:24Z

to_stata does not work correctly when used with non-string names. Since
Stata requires string names, the proposed fix attempts to rename columns using
the string representation of the column name used. A warning is raised if
the column name is changed.

bashtage · 2014-03-13T08:23:25Z

There was a unicode - string issue in my first attempt. I looked in compat but didn't see a good way to safely text whether a column label was with string or unicode in Python 2 or just string in Python 3, so I added a simple one in stata.py. If this doesn't exist, perhaps it should be in compat?

bashtage · 2014-03-13T08:44:49Z

I found the already provided method - isinstance(value,string_types) and have removed my unnecessary solution.

jreback · 2014-03-13T10:23:31Z

can you add a test to assert that this warning is raised
and add a release notes (closing 4558)

bashtage · 2014-03-13T11:22:14Z

This has created one other problem - since the writer expects columns to have string names, I am now converting columns to have unique string names, no problem. However, if someone tries to use {0:'tc'} to convert column 0 to a date, then there now is a new problem.

The solution as it now stands would be to check for this too and to convert to {'0':'tc'}, which would allow the rest of the code to run without further modification. However, the variable name '0' is also not legal and so it is converted later to _0 in a different step. Assuming that the magic shoudl be preserved, then it would make sense to do all of the at once so that the dataframe is known from the beginning to only have legal variable names.

I can fix the date conversion issue too, but am starting to wonder if it might be more appropriate to raise an InvalidColumnName exception with an explanation rather than performing lots of magic behind the scenes. If you think this is better (explicit vs now implicit) then it will require refactoring some tests too.

@jseabold @jreback Any thoughts on magic vs. error?

jreback · 2014-03-13T11:29:00Z

@bashtage all for raising when it gets ambiguous / too tricky. This is an export format that has certain requirements if its difficult to meet those then simply defer back to the user.

bashtage · 2014-03-13T11:32:27Z

My preference would be to kick any non-trivial variable name change, that is something other than 'my variable' -> 'my_variable' back to the user.

Currently the conversion handles lots of other invalid names such as unicode characters, reserved word or variable names that start with an number, by either replacement with _ or prepending _ with some other rules to ensure uniqueness.

jreback · 2014-03-13T11:35:12Z

you could have a argument, say translation=strict|substitue|drop which will determine what do to with the various cases, e.g. raise, translate as now, or drop.

I am not sure how common this is, maybe best simply to kick it back to the user at this point.

bashtage · 2014-03-13T12:32:43Z

I have decided to simply consolidate the magic to 1 place which makes it seem somewhat less magical. It also means no API changes.

It this passes on 3.x and 2.6 then I'll do the doc update and it should be ready to close.

bashtage · 2014-03-13T17:21:08Z

@jreback Rebased on master, added issue, improved the commit message a bit, and so hopefully ready to close.

jreback · 2014-03-13T17:48:00Z

ideally can you put in a test for tm.assert_produces_warning(InvalidColumnName) e.g. checks that you are in fact warning at the correcttime (and not at the wrong time).

bashtage · 2014-03-13T18:05:04Z

Made this change.

jreback · 2014-03-13T18:43:50Z

From the 2.7 build. This is ok, but maybe you want to catch this too? (ideally the nosetests won't produce warnings).

So maybe need another assert_produces? (or can simply filter around the other ones)

packages/pandas/io/stata.py:1178: InvalidColumnName: 
Not all pandas column names were valid Stata variable names.
The following replacements have been made:
    0   ->   _0
    1   ->   _1
    2   ->   _2
    3   ->   _3
    4   ->   _4
If this is not what you expect, please make sure you have Stata-compliant
column names in your DataFrame (strings only, max 32 characters, only alphanumerics and
underscores, no Stata reserved words)
  warnings.warn(ws, InvalidColumnName)

to_stata does not work correctly when used with non-string names. Since Stata requires string names, the proposed fix attempts to rename columns using the string representation of the column name used. The main method that reformats column names was refactored to handle this case. Patch includes additional fixes for detecting invalid names. Patch includes some minor documentation fixes.

bashtage · 2014-03-13T21:46:39Z

Trapped that warning, so looks clean now.

BUG: Error in to_stata when DataFrame contains non-string column names

jreback · 2014-03-13T22:19:45Z

thanks! you are getting really fast / good with these!

bashtage mentioned this pull request Mar 13, 2014

API/ENH: stata export should better support np.datetime64 based times #4558

Closed

jreback added Bug labels Mar 13, 2014

jreback added this to the 0.14.0 milestone Mar 13, 2014

jreback added a commit that referenced this pull request Mar 13, 2014

Merge pull request #6622 from bashtage/stata-timestamps

403f778

BUG: Error in to_stata when DataFrame contains non-string column names

jreback merged commit 403f778 into pandas-dev:master Mar 13, 2014

bashtage deleted the stata-timestamps branch March 13, 2014 22:25

bashtage restored the stata-timestamps branch March 21, 2014 19:49

bashtage deleted the stata-timestamps branch March 23, 2014 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Error in to_stata when DataFrame contains non-string column names #6622

BUG: Error in to_stata when DataFrame contains non-string column names #6622

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

BUG: Error in to_stata when DataFrame contains non-string column names #6622

BUG: Error in to_stata when DataFrame contains non-string column names #6622

Conversation

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014

bashtage commented Mar 13, 2014

jreback commented Mar 13, 2014