-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Cannot convert numbers to strings when reading an Excel spreadsheet #5891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
have you tried passing the |
Thanks, but the |
did you try it? it passes thru other keywords thru. |
Yes. I get 'ValueError: dtype is not supported with python parser' |
How exactly are you doing that? As far as I can see from the So unless I am missing something I don't see how Pandas can be made to read the columns as strings. It will always get a float from The CSV readers behave differently and can be type specified because they return all read data as strings regardless of whether it looks like a number of not. Perhaps @jreback or @jtratner can suggest a different way to convert a column from But apart from a workaround I don't think this issue is fixable in the way that you would like. |
Have you tried this in 0.13? If all your numbers are ints, pandas will read them in as ints (recent change in 0.13) and you can just do |
closing as fixed in 0.13 |
why is this fixed? There is still no way of reading excel file columns as objects. |
see @jmcnamara comment unless their is an easy way to tell xlrd to read it as not a number (eg do not do any conversions) |
Would it be possible then to keep every column as a float? Because this means datetime columns in excel are very hard to handle (cannot read as text, cannot convert as text) |
I believe that xlrd converts datetime properly (iirc you have to specify what is a date column) is their a problem with some conversion? |
Well, in my case I don't know the file structure before hand, and hence i cannot specify which columns are dates at the moment of import. Maybe there is another way I can detect the date columns before using the read_excel function? |
Not easily. Dates in Excel are stored as real numbers. The only thing that distinguishes them from other real numbers is the number format that is applied to them. These number formats aren't generally stored with the number in the Excel file and in general some level of heuristic is required to determine if the cell contains a date. Xlrd does a lot of work around this so you don't have to. However, could we take a step back and ask what you are trying and do? |
well, my company receives several files of client data (normally Excel files) on a constant basis. I am in charge of ensuring that data is formatted according to our main import logic. I built a Flask app to help our account managers formatting the files. This app relies on pandas heavily, because pandas is amazing. However, since these files are coming straight from clients, my app doesn't know what is on each file, so each account manager uses my app to decide if a column should be an int, or a date, etc. So how i built the app it would turn every column into a dtype object column and then format every column according to the requirements. This works great with csv files (because of the dtype arg), but with excel files the date columns are giving me a hard type. |
read_json does this type of inference |
I have the same problem about numbers and string trying to convert an xls/xlsx to csv. Reading from excel, pandas read them as integers and cut all the 0's and results to "7453" , "314".. (which is wrong in my case). |
Pandas uses the xlrd module for reading Excel files and xlrd preserves Excel's types when there is a corresponding type in Python. Therefore if You can check that yourself by selecting that cell in Excel and changing the format to If the data is actually a string but is being read as a number then that would be an issue. If not then there isn't any workaround when reading the data. Perhaps it would be possible to reformat the data prior to writing it to the CSV file. @jreback might have some suggestions there. |
Just to say that the solution I adopted was to use xlrd to inspect the type and then convert if necessary. This worked for my use case (which was avoid large number-like strings being converted to scientific notation; I don't think it will help with the removal of leading zeroes):
|
Maybe it would be possible to use xlrd to create a csv tempfile of the csv that pandas would then import with the same arguments as read_csv? |
the 'real' soln here is to support a PR anyone? |
Exactly, that would be the solution. |
not sure |
vote for suggested solution which is to support dtype for excel file. |
i can't believe it, i have the same problem right now and i have to use somewhat else to read excel files because of this issue |
I'm having an issue with this too. @jreback -- there is an issue with xlrd: < python-excel/xlrd#77 >. It appears to clip millisecond information off of the column when reading a time. My fall back was going to be reading the column as a string and then manually using I suppose my best option right now is to open the files manually and re-save them as csv, then load them with read_csv so I can use the dtype argument. Is that correct? |
@spearsem I think that is right |
@spearsem Try upgrade to the latest version of xlrd. |
Here is a workaround to the problem of leading zeros. If you know that your column has a fixed-width you can first turn the column into a string, and then add the gone padding zeros with the string method zfill, as done in here: http://stackoverflow.com/questions/339007/nicest-way-to-pad-zeroes-to-string Overall, I agree that read_excel should have a dtype option. |
where are we on getting read_excel to have a dtype option? I'm reading in an excessively large excel file which has over 1000 columns and 33 thousand rows. The cell values are comprised of ints, floats, and strings, but it seems that read_excel is treating all the numerics (ints and floats) as floats. So a 99 is becoming 99.0. Having the ability to specify to treat everything as a string would be ideal. |
see the open issue, #5342; pull-requests are welcome for that. Its actually pretty easy. Though you realize that you generally don't need it. Only to specific make things Furthermore treating ints/floats as a string is NOT good practice at all, and can lead to completely non-performant code. |
I'm reading some excel spreadsheets (xlsx format) into pandas using
read_excel
, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.
See stackoverflow question: http://stackoverflow.com/q/20970483/690890
The text was updated successfully, but these errors were encountered: