Three or more unnamed fields block loc assignment #13017

JakeCowton · 2016-04-28T14:03:33Z

Writing a dataframe to csv using df.to_csv("/path/to/file.csv") causes the creation of an "unnamed" field containing the indexes of the rows. Continually writing/reading to/from this file will result in many "unnamed" fields. Once there are three unnamed fields you can no longer use loc to replace values, what's more, it fails silently.

In [2]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [3]: df
Out[3]:
   A    B
0  1    4
1  2    5
2  3  501

In [4]: df.B.loc[df.B > 500]
Out[4]:
2    501
Name: B, dtype: int64

In [5]: df.B.loc[df.B > 500] = None

In [6]: df
Out[6]:
   A    B
0  1  4.0
1  2  5.0
2  3  NaN

So far so good, I was able to replace all values in df.B with NaN. I then write this out and read it back in


In [7]: df.to_csv("./test.csv")

In [8]: df = read_csv("./test.csv")

In [9]: df.columns
Out[9]: Index([u'Unnamed: 0', u'A', u'B'], dtype='object')

In [10]: df
Out[10]:
   Unnamed: 0  A    B
0           0  1  4.0
1           1  2  5.0
2           2  3  NaN

As you can see, this has create an unnamed field, but let's continue

In [14]: df.B.fillna(501, inplace=True)

This is jsut to get 501 in place of the NaN I created earlier which I forgot to do before writing out

In [15]: df.B
Out[15]:
0      4.0
1      5.0
2    501.0
Name: B, dtype: float64

In [16]: df.B.loc[df.B > 500]
Out[16]:
2    501.0
Name: B, dtype: float64

In [17]: df.B.loc[df.B > 500] = None

...SettingWithCopyWarning...

In [18]: df.B
Out[18]:
0    4.0
1    5.0
2    NaN
Name: B, dtype: float64

Everything working fine


In [19]: df.fillna(501, inplace=True)

In [20]: df.to_csv("./test.csv")

In [21]: df = read_csv("./test.csv")

In [22]: df.columns
Out[22]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'A', u'B'], dtype='object')

In [23]: df
Out[23]:
   Unnamed: 0  Unnamed: 0.1  A      B
0           0             0  1    4.0
1           1             1  2    5.0
2           2             2  3  501.0

Writing and reading again creates a 2nd unnamed field


In [24]: df.B.loc[df.B > 500]
Out[24]:
2    501.0
Name: B, dtype: float64

In [25]: df.B.loc[df.B > 500] = None

In [26]: df.B
Out[26]:
0    4.0
1    5.0
2    NaN
Name: B, dtype: float64

Which is no problem, everything still works so far...however


In [27]: df.fillna(501, inplace=True)

In [28]: df
Out[28]:
   Unnamed: 0  Unnamed: 0.1  A      B
0           0             0  1    4.0
1           1             1  2    5.0
2           2             2  3  501.0

In [29]: df.to_csv("./test.csv")

In [30]: df = read_csv("./test.csv")

In [31]: df.columns
Out[31]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'Unnamed: 0.1', u'A', u'B'], dtype='object')

In [32]: df
Out[32]:
   Unnamed: 0  Unnamed: 0.1  Unnamed: 0.1  A      B
0           0             0             0  1    4.0
1           1             1             1  2    5.0
2           2             2             2  3  501.0

We now have 3 unnamed fields


In [33]: df.B.loc[df.B > 500]
Out[33]:
2    501.0
Name: B, dtype: float64

In [34]: df.B.loc[df.B > 500] = None

In [35]: df.B
Out[35]:
0      4.0
1      5.0
2    501.0
Name: B, dtype: float64

The method of replacing all values over 500 with Nan no longer works but also throws no errors or warnings.

You CAN get around this using df.loc[df.B > 500, 'B'] = None but obviously you shouldn't have to.

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-28T14:07:38Z

you need to use specify index_col=0 on the read-back.

or use DataFame.from_csv which is the inverse of .to_csv

jreback · 2016-04-28T14:08:03Z

or dont' write the index in the first place, e.g. index=False.

JakeCowton · 2016-04-28T14:23:45Z

I'm aware you can get around it, but it's still a bug

jreback · 2016-04-28T14:35:29Z

@JakeCowton how so? pls read the doc-string. It is very clear. You are not reading with a correct option.

JakeCowton · 2016-04-28T15:07:57Z

What if the fields were not the indexes, but instead were genuine, unnamed fields?
I'm calling a command which fails without error or warning which works in other circumstances.
Having 2 unnamed fields is fine, but 3 isn't. Where is the logic in that?

I'm not demanding a fix for it. Like I said in my post, you can work around it and you've provided 2 ways to avoid getting into the situation in the first place; but it is a bug.

jreback · 2016-04-28T15:47:37Z

@JakeCowton I still done't understand what you are saying, pls provide a short self-reproducing, copy-pastable example. What you did above is pure usage.

jreback · 2016-04-28T15:56:55Z

I suppose you are saying this is a bug.

In [33]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [34]: pd.read_csv(StringIO(df.to_csv()))          
Out[34]: 
   Unnamed: 0  A    B
0           0  1    4
1           1  2    5
2           2  3  501

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()))
Out[35]: 
   Unnamed: 0  Unnamed: 0.1  A    B
0           0             0  1    4
1           1             1  2    5
2           2             2  3  501

This looks legitimate, and to be honest is user error if this is the case.

jorisvandenbossche · 2016-04-29T13:43:49Z

The actual issue you raised is a bit burried in the long explanation, but I think you wanted to highlight the following:

In [27]: df = pd.DataFrame(np.random.randn(3,3), columns=['a', 'b', 'c'])

In [28]: df.c.loc[df.c > 0] = None

In [29]: df.c
Out[29]:
0         NaN
1         NaN
2   -0.244531
Name: c, dtype: float64

In [30]: df = pd.DataFrame(np.random.randn(3,3), columns=['a', 'a', 'c'])

In [31]: df.c.loc[df.c > 0] = None

In [32]: df.c
Out[32]:
0   -0.474796
1    2.337849
2   -1.880815
Name: c, dtype: float64

So the last assignment (df.c.loc[df.c > 0] = None) does not work when there are duplicate column names.
Either this is a bug in the assignment, either a failure to raise a SettingWithCopyWarning.

In any case, not using chained assignment works:

In [37]: df.loc[df.c > 0, 'c'] = np.nan

In [38]: df
Out[38]:
          a         a         c
0 -0.747081 -0.900634 -0.474796
1  0.587197 -1.547151       NaN
2 -0.107341 -1.428424 -1.880815

Although using None here instead of np.nan also raises an error which looks like a bug:

In [39]: df.loc[df.c > 0, 'c'] = None
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

The reason you get duplicate column names here is also due to read_csv changing the "Unnamed: 0" to "Unnamed: 0.1", but not considering there is already a "Unnamed: 0.1" column. This could be regarded as a bug, although I would personally label that as a wont-fix.

xref pandas-devgh-13017.

xref gh-13017.

xref pandas-devgh-13017.

NxNiki · 2019-09-26T23:01:41Z

I suppose you are saying this is a bug.

In [33]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})

In [34]: pd.read_csv(StringIO(df.to_csv()))          
Out[34]: 
   Unnamed: 0  A    B
0           0  1    4
1           1  2    5
2           2  3  501

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()))
Out[35]: 
   Unnamed: 0  Unnamed: 0.1  A    B
0           0             0  1    4
1           1             1  2    5
2           2             2  3  501

This looks legitimate, and to be honest is user error if this is the case.

Hi, the problem is when you set index_col=0 in the last line, the column name is still changed to unnamed: 0.1

In [35]: pd.read_csv(StringIO(pd.read_csv(StringIO(df.to_csv())).to_csv()), index_col=0)

Out[35]:
Unnamed: 0.1 A B
0 0 1 4
1 1 2 5
2 2 3 501

I expect it should be:

Out[35]:
Unnamed: 0 A B
0 0 1 4
1 1 2 5
2 2 3 501

mroeschke · 2021-04-24T03:50:21Z

As mentioned in Joris' comment, it appears the "bug" is in a chained indexing operation which is high discouraged and we're not actively looking to support. Agreed that this is a won't fix. Closing but happy to reopen if I misunderstood

jreback closed this as completed Apr 28, 2016

jreback added the Usage Question label Apr 28, 2016

jreback added the IO CSV read_csv, to_csv label Apr 28, 2016

jorisvandenbossche reopened this Apr 29, 2016

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves and removed Usage Question labels Apr 29, 2016

jreback added this to the Next Major Release milestone Apr 29, 2016

jreback added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate and removed API Design labels Apr 29, 2016

gfyoung removed the IO CSV read_csv, to_csv label Nov 4, 2018

gfyoung mentioned this issue Nov 4, 2018

TST: Add test of assignment chaining and dupe cols #23487

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018

TST: Add test for mangling of unnamed columns

21d526a

xref pandas-devgh-13017.

gfyoung mentioned this issue Nov 4, 2018

TST: Add test for mangling of unnamed columns #23485

Merged

gfyoung added the Bug label Nov 4, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018

TST: Add test of assignment chaining and dupe cols

7b27b19

xref pandas-devgh-13017.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018

TST: Add test for mangling of unnamed columns

0834631

xref pandas-devgh-13017.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2018

TST: Add test of assignment chaining and dupe cols

1888536

xref pandas-devgh-13017.

jreback pushed a commit that referenced this issue Nov 5, 2018

TST: Add test of assignment chaining and dupe cols (#23487)

19baa61

xref gh-13017.

jreback pushed a commit that referenced this issue Nov 6, 2018

TST: Add test for mangling of unnamed columns (#23485)

ce8e05d

xref gh-13017.

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018

TST: Add test of assignment chaining and dupe cols (pandas-dev#23487)

ed46d6d

xref pandas-devgh-13017.

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018

TST: Add test for mangling of unnamed columns (pandas-dev#23485)

6a88f0e

xref pandas-devgh-13017.

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

TST: Add test of assignment chaining and dupe cols (pandas-dev#23487)

0df2793

xref pandas-devgh-13017.

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

TST: Add test for mangling of unnamed columns (pandas-dev#23485)

6f76c80

xref pandas-devgh-13017.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

TST: Add test of assignment chaining and dupe cols (pandas-dev#23487)

267fd8a

xref pandas-devgh-13017.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

TST: Add test for mangling of unnamed columns (pandas-dev#23485)

1579d54

xref pandas-devgh-13017.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

TST: Add test of assignment chaining and dupe cols (pandas-dev#23487)

d99474e

xref pandas-devgh-13017.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

TST: Add test for mangling of unnamed columns (pandas-dev#23485)

365fe52

xref pandas-devgh-13017.

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke closed this as completed Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three or more unnamed fields block loc assignment #13017

Three or more unnamed fields block loc assignment #13017

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

jorisvandenbossche commented Apr 29, 2016

NxNiki commented Sep 26, 2019 •

edited

Loading

mroeschke commented Apr 24, 2021

Three or more unnamed fields block loc assignment #13017

Three or more unnamed fields block loc assignment #13017

Comments

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

JakeCowton commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

jorisvandenbossche commented Apr 29, 2016

NxNiki commented Sep 26, 2019 • edited Loading

mroeschke commented Apr 24, 2021

NxNiki commented Sep 26, 2019 •

edited

Loading