Skip to content

BUG: Error in to_stata when DataFrame contains non-string column names #6622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2014

Conversation

bashtage
Copy link
Contributor

closes #4558

to_stata does not work correctly when used with non-string names. Since
Stata requires string names, the proposed fix attempts to rename columns using
the string representation of the column name used. A warning is raised if
the column name is changed.

@bashtage
Copy link
Contributor Author

There was a unicode - string issue in my first attempt. I looked in compat but didn't see a good way to safely text whether a column label was with string or unicode in Python 2 or just string in Python 3, so I added a simple one in stata.py. If this doesn't exist, perhaps it should be in compat?

@bashtage
Copy link
Contributor Author

I found the already provided method - isinstance(value,string_types) and have removed my unnecessary solution.

@jreback jreback added this to the 0.14.0 milestone Mar 13, 2014
@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

can you add a test to assert that this warning is raised
and add a release notes (closing 4558)

@bashtage
Copy link
Contributor Author

This has created one other problem - since the writer expects columns to have string names, I am now converting columns to have unique string names, no problem. However, if someone tries to use {0:'tc'} to convert column 0 to a date, then there now is a new problem.

The solution as it now stands would be to check for this too and to convert to {'0':'tc'}, which would allow the rest of the code to run without further modification. However, the variable name '0' is also not legal and so it is converted later to _0 in a different step. Assuming that the magic shoudl be preserved, then it would make sense to do all of the at once so that the dataframe is known from the beginning to only have legal variable names.

I can fix the date conversion issue too, but am starting to wonder if it might be more appropriate to raise an InvalidColumnName exception with an explanation rather than performing lots of magic behind the scenes. If you think this is better (explicit vs now implicit) then it will require refactoring some tests too.

@jseabold @jreback Any thoughts on magic vs. error?

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

@bashtage all for raising when it gets ambiguous / too tricky. This is an export format that has certain requirements if its difficult to meet those then simply defer back to the user.

@bashtage
Copy link
Contributor Author

My preference would be to kick any non-trivial variable name change, that is something other than 'my variable' -> 'my_variable' back to the user.

Currently the conversion handles lots of other invalid names such as unicode characters, reserved word or variable names that start with an number, by either replacement with _ or prepending _ with some other rules to ensure uniqueness.

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

you could have a argument, say translation=strict|substitue|drop which will determine what do to with the various cases, e.g. raise, translate as now, or drop.

I am not sure how common this is, maybe best simply to kick it back to the user at this point.

@bashtage
Copy link
Contributor Author

I have decided to simply consolidate the magic to 1 place which makes it seem somewhat less magical. It also means no API changes.

It this passes on 3.x and 2.6 then I'll do the doc update and it should be ready to close.

@bashtage
Copy link
Contributor Author

@jreback Rebased on master, added issue, improved the commit message a bit, and so hopefully ready to close.

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

ideally can you put in a test for tm.assert_produces_warning(InvalidColumnName) e.g. checks that you are in fact warning at the correcttime (and not at the wrong time).

@bashtage
Copy link
Contributor Author

Made this change.

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

From the 2.7 build. This is ok, but maybe you want to catch this too? (ideally the nosetests won't produce warnings).

So maybe need another assert_produces? (or can simply filter around the other ones)

packages/pandas/io/stata.py:1178: InvalidColumnName: 
Not all pandas column names were valid Stata variable names.
The following replacements have been made:
    0   ->   _0
    1   ->   _1
    2   ->   _2
    3   ->   _3
    4   ->   _4
If this is not what you expect, please make sure you have Stata-compliant
column names in your DataFrame (strings only, max 32 characters, only alphanumerics and
underscores, no Stata reserved words)
  warnings.warn(ws, InvalidColumnName)

to_stata does not work correctly when used with non-string names.  Since
Stata requires string names, the proposed fix attempts to rename columns using
the string representation of the column name used.  The main method that
reformats column names was refactored to handle this case.

Patch includes additional fixes for detecting invalid names.

Patch includes some minor documentation fixes.
@bashtage
Copy link
Contributor Author

Trapped that warning, so looks clean now.

jreback added a commit that referenced this pull request Mar 13, 2014
BUG: Error in to_stata when DataFrame contains non-string column names
@jreback jreback merged commit 403f778 into pandas-dev:master Mar 13, 2014
@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

thanks! you are getting really fast / good with these!

@bashtage bashtage deleted the stata-timestamps branch March 13, 2014 22:25
@bashtage bashtage restored the stata-timestamps branch March 21, 2014 19:49
@bashtage bashtage deleted the stata-timestamps branch March 23, 2014 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: stata export should better support np.datetime64 based times
2 participants