Skip to content

BUG: convert nan to None before insert data into mysql #4200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

BUG: convert nan to None before insert data into mysql #4200

wants to merge 1 commit into from

Conversation

simomo
Copy link

@simomo simomo commented Jul 11, 2013

For issue #4199

@hayd
Copy link
Contributor

hayd commented Jul 11, 2013

cc #2754 @danielballan @jreback

@danielballan
Copy link
Contributor

@simomo Can you include a test demonstrating expected behavior? See pandas/io/tests/test_sql.py for examples.

Also, of course including None requires the object dtype. Does this come with a performance cost? @jreback?

In [30]: frame
Out[30]: 
   0         1
0  2       NaN
1  3 -1.029046

In [31]: frame.dtypes
Out[31]: 
0      int64
1    float64
dtype: object

In [32]: frame.where(pd.notnull(frame), None).dtypes
Out[32]: 
0     int64
1    object
dtype: object

@jreback
Copy link
Contributor

jreback commented Jul 11, 2013

This is quite complicated and shouldn't be done this way. The main issue is different reprs for datetime/non-datetime. There are better ways of doing this (these are internal routines), e.g. see core/format.py/CSVFormatter/_save_chunk. This has to do with how things are converted/passed to SQL, e.g. whether they need to be stringfied or not.

You are going to need to segregate by block type, then convert or (not-convert) as needed, substituting appropriate 'null' sentinals (which might be different for different flavors?)

@hayd
Copy link
Contributor

hayd commented Jul 11, 2013

@jreback Why is it necessary to have different sentinals for NaN and NaT?

(I agree this should be done on write/like _save_chunk...)

@jreback
Copy link
Contributor

jreback commented Jul 11, 2013

Its not necessary per se, but I suspect that the different SQL have different sentials (if None works for everything then great)....

for perf though...this may need to be optimized

@hayd
Copy link
Contributor

hayd commented Jul 11, 2013

Idea being to abstract problem of None to SQLAlchemy, assuming it Just WorksTM. Which I thought was kind of the point of it...

Yeah, perf could be an issue - in which case we'll end up writing a load of platform specific stuff? :s

@jreback
Copy link
Contributor

jreback commented Jul 11, 2013

I am not sure what perf diff will be, just have to profile it. You might simply want to something like:

values = df.values.astype(object)
values[pd.isnull(df)] = None

prob should work and be pretty fast (not 100% sure what this will do to dates though)

@jreback
Copy link
Contributor

jreback commented Jul 11, 2013

I think you are going to have to do specific backend specific conversions (mostly on NaN/None, but also datetimes). IIRC mysql stores dates as strings? though some of this may be converted from datestimes

@hayd
Copy link
Contributor

hayd commented Jul 11, 2013

@danielballan do we care how it's stored... provided the roundtrip works (and probably also that the dtype is sensible) is all good...?

Do we need to know in order to query (for None)? Can't SQLAlchemy compile your query in a clever way (worrying about the platform specific bit), maybe I've got it totally wrong?

@danielballan
Copy link
Contributor

No, I don't think we care. The second comment on the SO question is troubling, or at least confusing to me. I think all flavors of SQL just have NULL, and we'll want those to ultimately come out as np.nan. Certainty not 'NaN'.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2013

@danielballan you will for sure need to do type conversions on the readback, e.g. make sure dates are correct (you can just use convert_objects(convert_dates='coerce')

you also may want to do convert_objects(convert_numeric=True) on the numeric columns (may only be necessary depending on how results are returned)

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

related to #4163

@hayd hayd mentioned this pull request Jul 20, 2013
20 tasks
@stared
Copy link

stared commented Sep 11, 2013

Filling NaN with None:

df['col1'].fillna(None)

produces an error:

ValueError: must specify a fill method or value

Is it the same bug as reported in this thread?

@jreback
Copy link
Contributor

jreback commented Sep 11, 2013

@stared what are you trying to do?

@stared
Copy link

stared commented Sep 12, 2013

@jreback Convert np.nan fields to None values (for dtype=object, of course).

At the same time I can do (i.e. there is no error):

df['col1'].apply(lambda x: None if pd.isnull(x) else x)

which seems to be equivalent to:

df['col1'].fillna(None)

@jreback
Copy link
Contributor

jreback commented Sep 12, 2013

@stared

well aside fro the apply be MUCH slower, fillna also does dtype infererence.

I meant what is your purpose in doing this?

@stared
Copy link

stared commented Sep 12, 2013

@jreback I am performing an outer join of two tables, so I am getting np.nans.
Later I am using this data to interact with MongoDB and I want to have None for missing fields. (I don't want to make conversions each time when I read from or write to the database.)

BTW: Why apply is much slower? Or, in general, for mapping columns what should be used?

@jreback
Copy link
Contributor

jreback commented Sep 12, 2013

ahh...then this is the same issue (has to do with exporting np.nan -> None, or the appropriate if say its NaT)

well, apply is not vectorized so you should do that if at all possible; fillna is cython based so pretty fast.

apply is very general though

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2013

@stared apply will be much slower than most (all?) ops that are built into pandas already. E.g., fillna does a specific thing so it doesn't need to accept an arbitrary Python function like apply does. It is there free to use whatever numpy and maybe Cython is available to do its job. apply must be very general and slow it is going to be slow.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

@hayd I believe you are going to do this as part of big SQL refactor (and its already linked), so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants