Skip to content

#26545 Fix: same .tsv file, get different data-frame structure using engine 'python' and 'c' #26634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 12, 2019

Conversation

luckydenis
Copy link
Contributor

@luckydenis luckydenis commented Jun 3, 2019

BugFix:
When using engine='python', columns were handled incorrectly if the first header had in the bom.

Bug:

In [1]: import pandas as pd                                                             

In [2]: pd.read_csv('test.txt', engine='python', delimiter='\t')              
Out[2]: 
Empty DataFrame
Columns: [Project ID]
Index: []

In [3]: pd.read_csv('test.txt', engine='python', delimiter='\t').shape        
Out[3]: (0, 1)

In [4]: pd.read_csv('test.txt', delimiter='\t')                               
Out[4]: 
Empty DataFrame
Columns: [Project ID, Project Name, Product Name]
Index: []

In [5]: pd.read_csv('test, delimiter='\t').shape                         
Out[5]: (0, 3)

Fix:

In [1]: import pandas as pd                                                             

In [2]: pd.read_csv('test.txt', engine='python', delimiter='\t')              
Out[2]: 
Empty DataFrame
Columns: [Project ID, Project Name, Product Name]
Index: []

In [3]: pd.read_csv('test.txt', engine='python', delimiter='\t').shape        
Out[3]: (0, 3)

In [4]: pd.read_csv('test.txt', delimiter='\t')                               
Out[4]: 
Empty DataFrame
Columns: [Project ID, Project Name, Product Name]
Index: []

In [5]: pd.read_csv('test, delimiter='\t').shape                         
Out[5]: (0, 3)

@pep8speaks
Copy link

pep8speaks commented Jun 3, 2019

Hello @luckydenis! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-11 07:36:17 UTC

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2019

This closes an issue right? If so can you update the OP to reflect that

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Jun 3, 2019
@codecov
Copy link

codecov bot commented Jun 3, 2019

Codecov Report

Merging #26634 into master will decrease coverage by 50.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26634      +/-   ##
==========================================
- Coverage   91.88%   41.78%   -50.1%     
==========================================
  Files         174      174              
  Lines       50692    50692              
==========================================
- Hits        46576    21182   -25394     
- Misses       4116    29510   +25394
Flag Coverage Δ
#multiple ?
#single 41.78% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
pandas/core/tools/numeric.py 10.14% <0%> (-89.86%) ⬇️
... and 128 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d124ea...0816572. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 3, 2019

Codecov Report

Merging #26634 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26634      +/-   ##
==========================================
- Coverage   91.72%   91.71%   -0.01%     
==========================================
  Files         178      178              
  Lines       50779    50779              
==========================================
- Hits        46578    46574       -4     
- Misses       4201     4205       +4
Flag Coverage Δ
#multiple 90.31% <ø> (ø) ⬆️
#single 41.19% <ø> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 157a4e3...fb010d5. Read the comment docs.

@luckydenis
Copy link
Contributor Author

@WillAyd, I made a correction on your comments. Look please)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates - can you also add a whatsnew note for 0.25?

cc @gfyoung if you care to take a look

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2019

@luckydenis could you also check what test_utf8_bom is doing in the test module? It looks to cover the same intention as test added so want to make sure we understand the difference and clarify accordingly

@luckydenis
Copy link
Contributor Author

luckydenis commented Jun 4, 2019

In this context, I'm not sure that they are different, but my PR cures this error, which tm method should be used to check it? The error was in the method that was cleaning from the bom.

In [1]: import pandas as pd                                                             

In [2]: pd.read_csv('test.txt', engine='python', delimiter='\t')              
Out[2]: 
Empty DataFrame
Columns: [Project ID]
Index: []

In [3]: pd.read_csv('test.txt', engine='python', delimiter='\t').shape        
Out[3]: (0, 1)

In [4]: pd.read_csv('test.txt', delimiter='\t')                               
Out[4]: 
Empty DataFrame
Columns: [Project ID, Project Name, Product Name]
Index: []

In [5]: pd.read_csv('test, delimiter='\t').shape                         
Out[5]: (0, 3)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good if you can add a whatsnew for 0.25 then lgtm!

@WillAyd WillAyd added this to the 0.25.0 milestone Jun 5, 2019
@jreback
Copy link
Contributor

jreback commented Jun 6, 2019

@gfyoung ok with this?

@gfyoung
Copy link
Member

gfyoung commented Jun 6, 2019

@jreback : Looks fine to me, just need the whatsnew as @WillAyd said

@luckydenis
Copy link
Contributor Author

Yeah, I'll add a description in the whatsnew, just haven't had time.

@luckydenis
Copy link
Contributor Author

@jreback, @WillAyd, @gfyoung I added, look please.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @gfyoung

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback if we have anything else

@jreback
Copy link
Contributor

jreback commented Jun 10, 2019

lgtm. @luckydenis if you can merge master and ping on green to resolve the conflict.

@jreback jreback merged commit a137a9c into pandas-dev:master Jun 12, 2019
@jreback
Copy link
Contributor

jreback commented Jun 12, 2019

thanks @luckydenis

@luckydenis luckydenis deleted the dev-26545 branch June 13, 2019 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

same .tsv file, get different data-frame structure using engine 'python' and 'c'
6 participants