Skip to content

Three or more unnamed fields block loc assignment #13017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JakeCowton opened this issue Apr 28, 2016 · 10 comments
Closed

Three or more unnamed fields block loc assignment #13017

JakeCowton opened this issue Apr 28, 2016 · 10 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@JakeCowton
Copy link

Writing a dataframe to csv using df.to_csv("/path/to/file.csv") causes the creation of an "unnamed" field containing the indexes of the rows. Continually writing/reading to/from this file will result in many "unnamed" fields. Once there are three unnamed fields you can no longer use loc to replace values, what's more, it fails silently.

In [2]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [3]: df
Out[3]:
   A    B
0  1    4
1  2    5
2  3  501

In [4]: df.B.loc[df.B > 500]
Out[4]:
2    501
Name: B, dtype: int64

In [5]: df.B.loc[df.B > 500] = None

In [6]: df
Out[6]:
   A    B
0  1  4.0
1  2  5.0
2  3  NaN

So far so good, I was able to replace all values in df.B with NaN. I then write this out and read it back in


In [7]: df.to_csv("./test.csv")

In [8]: df = read_csv("./test.csv")

In [9]: df.columns
Out[9]: Index([u'Unnamed: 0', u'A', u'B'], dtype='object')

In [10]: df
Out[10]:
   Unnamed: 0  A    B
0           0  1  4.0
1           1  2  5.0
2           2  3  NaN

As you can see, this has create an unnamed field, but let's continue

In [14]: df.B.fillna(501, inplace=True)

This is jsut to get 501 in place of the NaN I created earlier which I forgot to do before writing out

In [15]: df.B
Out[15]:
0      4.0
1      5.0
2    501.0
Name: B, dtype: float64

In [16]: df.B.loc[df.B > 500]
Out[16]:
2    501.0
Name: B, dtype: float64

In [17]: df.B.loc[df.B > 500] = None

...SettingWithCopyWarning...

In [18]: df.B
Out[18]:
0    4.0
1    5.0
2    NaN
Name: B, dtype: float64

Everything working fine


In [19]: df.fillna(501, inplace=True)

In [20]: df.to_csv("./test.csv")

In [21]: df = read_csv("./test.csv")

In [22]: df.columns
Out[22]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'A', u'B'], dtype='object')

In [23]: df
Out[23]:
   Unnamed: 0  Unnamed: 0.1  A      B
0           0             0  1    4.0
1           1             1  2    5.0
2           2             2  3  501.0

Writing and reading again creates a 2nd unnamed field


In [24]: df.B.loc[df.B > 500]
Out[24]:
2    501.0
Name: B, dtype: float64

In [25]: df.B.loc[df.B > 500] = None

In [26]: df.B
Out[26]:
0    4.0
1    5.0
2    NaN
Name: B, dtype: float64

Which is no problem, everything still works so far...however


In [27]: df.fillna(501, inplace=True)

In [28]: df
Out[28]:
   Unnamed: 0  Unnamed: 0.1  A      B
0           0             0  1    4.0
1           1             1  2    5.0
2           2             2  3  501.0

In [29]: df.to_csv("./test.csv")

In [30]: df = read_csv("./test.csv")

In [31]: df.columns
Out[31]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'Unnamed: 0.1', u'A', u'B'], dtype='object')

In [32]: df
Out[32]:
   Unnamed: 0  Unnamed: 0.1  Unnamed: 0.1  A      B
0           0             0             0  1    4.0
1           1             1             1  2    5.0
2           2             2             2  3  501.0

We now have 3 unnamed fields


In [33]: df.B.loc[df.B > 500]
Out[33]:
2    501.0
Name: B, dtype: float64

In [34]: df.B.loc[df.B > 500] = None

In [35]: df.B
Out[35]:
0      4.0
1      5.0
2    501.0
Name: B, dtype: float64

The method of replacing all values over 500 with Nan no longer works but also throws no errors or warnings.

You CAN get around this using df.loc[df.B > 500, 'B'] = None but obviously you shouldn't have to.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

you need to use specify index_col=0 on the read-back.

or use DataFame.from_csv which is the inverse of .to_csv

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

or dont' write the index in the first place, e.g. index=False.

@jreback jreback added the IO CSV read_csv, to_csv label Apr 28, 2016
@JakeCowton
Copy link
Author

I'm aware you can get around it, but it's still a bug

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

@JakeCowton how so? pls read the doc-string. It is very clear. You are not reading with a correct option.

@JakeCowton
Copy link
Author

  1. What if the fields were not the indexes, but instead were genuine, unnamed fields?
  2. I'm calling a command which fails without error or warning which works in other circumstances.
  3. Having 2 unnamed fields is fine, but 3 isn't. Where is the logic in that?

I'm not demanding a fix for it. Like I said in my post, you can work around it and you've provided 2 ways to avoid getting into the situation in the first place; but it is a bug.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

@JakeCowton I still done't understand what you are saying, pls provide a short self-reproducing, copy-pastable example. What you did above is pure usage.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

I suppose you are saying this is a bug.

In [33]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [34]: pd.read_csv(StringIO(df.to_csv()))          
Out[34]: 
   Unnamed: 0  A    B
0           0  1    4
1           1  2    5
2           2  3  501

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()))
Out[35]: 
   Unnamed: 0  Unnamed: 0.1  A    B
0           0             0  1    4
1           1             1  2    5
2           2             2  3  501

This looks legitimate, and to be honest is user error if this is the case.

@jorisvandenbossche
Copy link
Member

The actual issue you raised is a bit burried in the long explanation, but I think you wanted to highlight the following:

In [27]: df = pd.DataFrame(np.random.randn(3,3), columns=['a', 'b', 'c'])

In [28]: df.c.loc[df.c > 0] = None

In [29]: df.c
Out[29]:
0         NaN
1         NaN
2   -0.244531
Name: c, dtype: float64

In [30]: df = pd.DataFrame(np.random.randn(3,3), columns=['a', 'a', 'c'])

In [31]: df.c.loc[df.c > 0] = None

In [32]: df.c
Out[32]:
0   -0.474796
1    2.337849
2   -1.880815
Name: c, dtype: float64

So the last assignment (df.c.loc[df.c > 0] = None) does not work when there are duplicate column names.
Either this is a bug in the assignment, either a failure to raise a SettingWithCopyWarning.

In any case, not using chained assignment works:

In [37]: df.loc[df.c > 0, 'c'] = np.nan

In [38]: df
Out[38]:
          a         a         c
0 -0.747081 -0.900634 -0.474796
1  0.587197 -1.547151       NaN
2 -0.107341 -1.428424 -1.880815

Although using None here instead of np.nan also raises an error which looks like a bug:

In [39]: df.loc[df.c > 0, 'c'] = None
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

The reason you get duplicate column names here is also due to read_csv changing the "Unnamed: 0" to "Unnamed: 0.1", but not considering there is already a "Unnamed: 0.1" column. This could be regarded as a bug, although I would personally label that as a wont-fix.

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves and removed Usage Question labels Apr 29, 2016
@jreback jreback added this to the Next Major Release milestone Apr 29, 2016
@jreback jreback added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate and removed API Design labels Apr 29, 2016
@gfyoung gfyoung removed the IO CSV read_csv, to_csv label Nov 4, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018
@gfyoung gfyoung added the Bug label Nov 4, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
@NxNiki
Copy link

NxNiki commented Sep 26, 2019

I suppose you are saying this is a bug.

In [33]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [34]: pd.read_csv(StringIO(df.to_csv()))          
Out[34]: 
   Unnamed: 0  A    B
0           0  1    4
1           1  2    5
2           2  3  501

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()))
Out[35]: 
   Unnamed: 0  Unnamed: 0.1  A    B
0           0             0  1    4
1           1             1  2    5
2           2             2  3  501

This looks legitimate, and to be honest is user error if this is the case.

Hi, the problem is when you set index_col=0 in the last line, the column name is still changed to unnamed: 0.1

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()), index_col=0)

Out[35]:
Unnamed: 0.1 A B
0 0 1 4
1 1 2 5
2 2 3 501

I expect it should be:

Out[35]:
Unnamed: 0 A B
0 0 1 4
1 1 2 5
2 2 3 501

@mroeschke
Copy link
Member

As mentioned in Joris' comment, it appears the "bug" is in a chained indexing operation which is high discouraged and we're not actively looking to support. Agreed that this is a won't fix. Closing but happy to reopen if I misunderstood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

7 participants