Pandas get_dummies validate "columns" input #28383

TonyCongqianWang · 2019-09-11T05:35:00Z

Code Sample, a copy-pastable example if possible

import string, pandas

p_csv = pandas.read_csv(my_dir/myFile), index_col=0)

sepanames = sorted(p_csv["SEPARATOR"].unique())

for i in range(0, 14):
print(i)
col = p_csv.columns.get_loc("SEPARATOR") + 1 + i
p_csv.insert(col, "SEPARATOR_" + sepanames[i].upper(), p_csv["SEPARATOR"].apply(lambda x: int(x == sepanames[i])))

p_csv.to_csv("my_dir/new_file.csv")

''' p_csv= pandas.read_csv(("/myDir/myFile.csv"), index_col=0)
pandas.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
p_csv.to_csv("/myDir/myNew.csv")'''

FILES:
https://www.amazon.de/clouddrive/share/h37d1hqtrj5SrZTvKdrs9gXltVKUgo8Is9BxL8WH7Sf

Problem description

This is all of my code. The quoted part is what I first tried, but after 20 minutes it ended through a sigkill without any result. The required files are available for download. I would think, that my code does the equivalent in this very case and it works just fine in under about a minute.

Pandas Version: 25.1.0
Pandas git version: '171c71611886aab8549a8620c5b0071a129ad685'

Expected Output

No error and changed csv file

jbrockmendel · 2019-09-11T14:59:59Z

Can you post an example that we can copy/paste to replicate the behavior

TonyCongqianWang · 2019-09-11T16:37:02Z

I already posted the example in quotes. It is:

Import pandas

p_csv= pandas.read_csv(("/myDir/myFile.csv"), index_col=0)
pandas.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
p_csv.to_csv("/myDir/myNew.csv")

WillAyd · 2019-09-11T16:57:13Z

That code sample is not self-contained, so no one can copy / paste to replicate your issue. The below link might help:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

TonyCongqianWang · 2019-09-11T18:02:11Z

alright. here is a copy-pastable version with random data

###CODE

import pandas, numpy as np

columns = np.random.rand(145)
columns = columns.astype(str)
columns[110] = "SEPARATOR"
columns[0] = "INSTANCES"

separators = ["and", "clique", "gomory", "gomory1", "gomory2", "gomory3", "gomory4", "zerohalf"]
instances = ["021erkdjfiejrk", "5484ierdkj", "5487ehej"]

d = {}

for i in range(145):
    d[columns[i]]= np.random.rand(145000)

d["SEPARATOR"] = [np.random.choice(instances) for i in range(145000)]
d["INSTANCES"] = [np.random.choice(instances) for i in range(145000)]

p_csv= pd.DataFrame(d)
p_csv["SEPARATOR"]
print("getting dummies", p_csv.shape)
pd.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
print("success")

jbrockmendel · 2019-09-11T18:10:20Z

That is liable to produce a DataFrame with shape on the order of (145000, 145000), which I think is on the order of 80GB, which certainly won't fit on my laptop. SIGKILL is likely coming from the OOM killer.

TonyCongqianWang · 2019-09-11T18:13:13Z

that is what I guessed what probably happened. But why would the api function do that when there are only 6 different values for separator?

jbrockmendel · 2019-09-11T19:06:25Z

(edited your comment to make it copy/paste more neatly, LMK if you object and I'll revert)

jbrockmendel · 2019-09-11T19:10:37Z

do you get the expected result if you pass columns=["SEPARATOR"] instead of columns="SEPARATOR"?

TonyCongqianWang · 2019-09-11T19:14:51Z

Yes! That worked, very fast too. Now that you mention it, I do see that the expected parameter is "list-like". Didn't expect that to be a problem, since I didn't get any error and also do get an error if I pass 'columns="fake_column"'

WillAyd · 2019-09-12T00:06:00Z

I think this should raise if columns is a string to avoid confusion like this. @TonyCongqianWang any interest in submitting a PR for that?

saurav2608 · 2019-09-12T05:13:18Z

I will take a shot at this.

R1j1t · 2019-09-17T05:04:35Z

@WillAyd I created a pull request for this validation, but I wanted to confirm one thing. If columns is not list_like then should it raise error or warning?
I referred to the code in reshape/utils.py and it raised error there. Would like to verify if I am correctly interpreting it.

WillAyd · 2019-09-17T15:40:26Z

Yea I think that would be good to emulate - nice find!

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 11, 2019

WillAyd changed the title ~~Pandas get dummies sigkill~~ Pandas get_dummies validate "columns" input Sep 12, 2019

WillAyd added API Design Error Reporting Incorrect or improved errors from pandas and removed Needs Info Clarification about behavior needed to assess issue labels Sep 12, 2019

WillAyd added this to the Contributions Welcome milestone Sep 12, 2019

R1j1t mentioned this issue Sep 16, 2019

Pandas get_dummies validate columns input #28463

Merged

5 tasks

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 18, 2019

WillAyd closed this as completed in #28463 Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas get_dummies validate "columns" input #28383

Pandas get_dummies validate "columns" input #28383

TonyCongqianWang commented Sep 11, 2019

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

WillAyd commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019 •

edited by jbrockmendel

Loading

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

jbrockmendel commented Sep 11, 2019

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

WillAyd commented Sep 12, 2019

saurav2608 commented Sep 12, 2019

R1j1t commented Sep 17, 2019 •

edited

Loading

WillAyd commented Sep 17, 2019

Pandas get_dummies validate "columns" input #28383

Pandas get_dummies validate "columns" input #28383

Comments

TonyCongqianWang commented Sep 11, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

WillAyd commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019 • edited by jbrockmendel Loading

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

jbrockmendel commented Sep 11, 2019

jbrockmendel commented Sep 11, 2019

TonyCongqianWang commented Sep 11, 2019

WillAyd commented Sep 12, 2019

saurav2608 commented Sep 12, 2019

R1j1t commented Sep 17, 2019 • edited Loading

WillAyd commented Sep 17, 2019

TonyCongqianWang commented Sep 11, 2019 •

edited by jbrockmendel

Loading

R1j1t commented Sep 17, 2019 •

edited

Loading