-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
KeyError from pandas DataFrame groupby for Windows based csv files #16690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you paste the output of |
You may need |
@TomAugspurger, oh, sorry, the full question/info is from https://stackoverflow.com/questions/44508502/. Here are the part you wanted;
Note that, using
|
OK, I'll try with newer version of Python (v3), but for pandas, I'm not sure, because I'm using IBM's datascientistworkbench.com, which updates their tool chain quite often, but out of my control. So how can I use the |
Is pandas: 0.18.1 not recently enough? That bug was nearly a year ago. |
The root problem is that you have a BOM (
I don't think BOMs are considered whitespace by python, so they won't be stripped.
pass it to
That fix was in 0.19 (check the milestone on the pull request). |
And the problem being a BOM at the start of the file is just a guess. I could be wrong. |
Hi @TomAugspurger, you are absolutely right. Using |
Code Sample, a copy-pastable example if possible
Problem description
The
df.groupby(['Id'])
threw an exception:Expected Output
No exception. Returns
pandas.core.groupby.DataFrameGroupBy object
.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.4
pip: 8.1.2
setuptools: None
Cython: 0.19.2
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.8.0
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None
Further explanation
There are time-formatted string in the csv file, but that's not the cause:
I am certain that the real problem is the format of the file -- the "test.csv" is Windows based, and is output from SQL Server SSMS.
Using
file test.csv
under Linux shows:Here are the top several bytes from the file:
This is very important and the root cause. Proofs:
dos2unix
under Linux, then try the above same code, it would work. Thegroupby
will not threw exception any more.Here are the top several bytes from the working file:
The text was updated successfully, but these errors were encountered: