Skip to content

problem with to_sql with NA #8778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
clembou opened this issue Nov 10, 2014 · 12 comments · Fixed by #8926 or #8973
Closed

problem with to_sql with NA #8778

clembou opened this issue Nov 10, 2014 · 12 comments · Fixed by #8926 or #8973
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@clembou
Copy link

clembou commented Nov 10, 2014

i am experiencing issues with writing NA values for a column of boolean with missing values.

data = [True, None, False]
df = pd.DataFrame(a, columns=['test'])

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 1 columns):
test    2 non-null object
dtypes: object(1)
memory usage: 48.0 bytes

I am reading data from a mssql server and am trying to write it unchanged in another table.
As the data contains NULL values, pandas changes the dtype to object. when calling to_sql, object fields seem to be mapped to a "text" column type, and the export then fails:

df.to_sql('test_table', engine_local, if_exists='append', index=True)

DataError: (DataError) ('22018', '[22018] [Microsoft][ODBC SQL Server Driver][SQL Server]Operand type clash: bit is incompatible with text (206) (SQLExecDirectW); [42000] [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. (8180)') u'INSERT INTO test_table ([index], test) VALUES (?, ?)' ((0L, True), (1L, None), (2L, False))

'object' dtypes are tricky to handle systematically I guess, but maybe we could add the ability to tell pandas how to handle it on a column by column basis?.

something like:

df.to_sql('test_table', engine_local, if_exists='append', index=True, col_types={'test': bool})

could do the trick?

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Nov 10, 2014
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 10, 2014
@jorisvandenbossche
Copy link
Member

I think this would be a reasonable addition to the api (a dtype kwarg in to_sql). A way to override the default sqlalchemy type that is used per column. That would enable to workaround such cornercases as above.

@jorisvandenbossche
Copy link
Member

comment from @artemyk in #8926:

What if instead of having to specify the types as in this PR, we iterated over the columns with dtype==object, and did pandas.core.common._infer_dtype_from_scalar on their first non-null entry?

I think the problem with that is that the series can also be really mixed (and not only because of the nan's), and then the checking on one scalar can give wrong results.
But maybe it would be interesting to have some kind of infer function that does not take into account NaNs to determine if something is mixed or not. So that the following would return 'boolean' instead of 'mixed':

In [61]: s = pd.Series([False, True, np.nan])

In [62]: pd.lib.infer_dtype(s)
Out[62]: 'mixed'

@artemyk
Copy link
Contributor

artemyk commented Nov 30, 2014

@jorisvandenbossche Yes, I think that would be very useful (though perhaps the real solution would be to have typed NA values --- I suppose that would require some big structural changes though). I can see how having the dtypes parameter can be good for corner cases, and for solving this issue now, but it would be great if (at least in most cases) NAs could be inserted w.o having to specify dtypes.

@artemyk
Copy link
Contributor

artemyk commented Nov 30, 2014

@jorisvandenbossche I looked into how DataFrame infers types and it seems like it uses numpy.array inference for the most part. We could use something similar, or, perhaps more robustly,

def infer_notnull_dtype(series):
    if series.dtype != 'object':
        return series.dtype
    else:
        s = pd.Series(series[~series.isnull()].tolist())
        return s.dtype

Not sure how slow this would be for large dataframes.

@jorisvandenbossche
Copy link
Member

The tolist will certainly be not that good for large dataframes I think ...
But you have the lib.infer_dtype for this, and maybe just providing series[~series.isnull()] to that can already do the trick?

@jorisvandenbossche
Copy link
Member

In [1]: s = pd.Series([False, True, np.nan])

In [2]: pd.lib.infer_dtype(s)
Out[2]: 'mixed'

In [3]: pd.lib.infer_dtype(s[s.notnull()])
Out[3]: 'boolean'

Hmm, this seems certainly like something that can be added to the type interference just before the last step when for object Text is returned.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2014

@jorisvandenbossche yep, much better. In fact the only series you really need to infer is a mixed-type boolean I think. (e.g. if its object, then infer it; check if its mixed, then remove nans and reinfer).

@artemyk
Copy link
Contributor

artemyk commented Nov 30, 2014

@jorisvandenbossche Oh, nice! Didn't realize that lib.infer_dtype is something that already exists. I could probably do the patch, though would be easier to wait until #8926 is merged.

@artemyk
Copy link
Contributor

artemyk commented Nov 30, 2014

A little weird though since infer_dtype returns different named types than numpy types (the ones we test for in sql.py). E.g. 'boolean' vs 'bool', 'integer' vs 'int', etc.. Is there an existing way to standardize these? Or just setup checks for both variants?

@jreback
Copy link
Contributor

jreback commented Dec 1, 2014

@artemyk these are 'specific' to pandas and not real 'types'. These are to try to figure out things when given unknown input. These are purely internal.

@jorisvandenbossche
Copy link
Member

Going to keep this open for now, given the discussion above

@clembou
Copy link
Author

clembou commented Dec 15, 2014

@artemyk @jorisvandenbossche @jreback Thanks a lot all, I just upgraded to 0.15.2 and can confirm that I can now successfully round trip data from / to sql following this bug fix. Due to the second PR I don't even need to use the specific dtypes parameter in my test case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
4 participants