Skip to content

ENH: Add support for reading 102-format Stata dta files #58978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

cmjcharlton
Copy link
Contributor

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This would complete support for reading all historic Stata dta format versions.

I would understand if you chose not to merge this as:

  • No formal documentation exists for this version, so I have had to infer the details from later formats and the Stata 1 user manual.
  • Unlike all the other version formats I have not been able to locate any sample data written in this version (and hence I haven't created a linked issue).

Having said that, I am reasonably confident that the changes are correct, and Stata is happy to open and view the test data that I created:

. dtaversion "stata-compat-102.dta"
  (file "stata-compat-102.dta" is
   .dta-format 102 from Stata 1)
. use "stata-compat-102.dta"
. describe

Contains data from stata-compat-102.dta
 Observations:             3                  
    Variables:             7                  
-------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------------------------------
index           long    %12.0g                
i8              int     %8.0g                 
i16             int     %8.0g                 
i32             long    %12.0g                
f               float   %9.0g                 
d               double  %10.0g                
dt              double  %10.0g                
-------------------------------------------------------------------------------------------------------------------
Sorted by:

. list

     +--------------------------------------------------+
     | index   i8     i16        i32     f    d      dt |
     |--------------------------------------------------|
  1. |     1   -1   -1025   -8388609   -.1   .1   14610 |
  2. |     2    0       0          0   -.2   .2   14611 |
  3. |     3    1    1025    8388609   -.3   .3   14612 |
     +--------------------------------------------------+
. dtaversion "stata4_102.dta"
  (file "stata4_102.dta" is
   .dta-format 102 from Stata 1)
. use "stata4_102.dta"
. describe

Contains data from stata4_102.dta
 Observations:            10                  
    Variables:             5                  
-------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------------------------------
fulllab         int     %8.0g      full_lbl   A fully labeled variable.
fulllab2        float   %9.0g      full_lbl   Another fully labeled variable.
incmplab        long    %12.0g     incp_lbl   Some values without labels.
misslab         int     %8.0g      miss_lbl   Some missing value labels.
floatlab        float   %9.0g      full_lbl   Floating point with labels.
-------------------------------------------------------------------------------------------------------------------
Sorted by:

. list

     +----------------------------------------------------+
     | fulllab   fulllab2   incmplab   misslab   floatlab |
     |----------------------------------------------------|
  1. |     one        ten        one       one        one |
  2. |     two       nine        two       two        two |
  3. |   three      eight      three     three      three |
  4. |    four      seven          4      four       four |
  5. |    five        six          5         .       five |
     |----------------------------------------------------|
  6. |     six       five          6         .        six |
  7. |   seven       four          7         .      seven |
  8. |   eight      three          8         .      eight |
  9. |    nine        two          9         .       nine |
 10. |     ten        one        ten         .        ten |
     +----------------------------------------------------+

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 12, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, let us know and we can reopen. I don't think there should be that much objection to support this.

@mroeschke mroeschke closed this Jul 22, 2024
@cmjcharlton
Copy link
Contributor Author

I'm happy to continue working on this if you think that it might be useful, How would you like it referenced in the whatsnew file, given that there isn't currently an issue that it addresses?

@mroeschke mroeschke reopened this Jul 22, 2024
@mroeschke
Copy link
Member

You can just reference this pull request

@cmjcharlton cmjcharlton marked this pull request as ready for review July 22, 2024 20:33
Comment on lines 2066 to 2067
ref = os.path.join(data_base, "stata-compat-118.dta")
old = os.path.join(data_base, f"stata-compat-{version}.dta")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make these 2 datapath as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now made this change. Note that the equivalent tests for other format versions don't use datapath for these either, but I haven't changed them to keep this pull restricted to just 102 format related changes.

@mroeschke mroeschke added IO Stata read_stata, to_stata and removed Stale labels Jul 22, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jul 22, 2024
@mroeschke mroeschke merged commit 67a58cd into pandas-dev:main Jul 23, 2024
49 checks passed
@mroeschke
Copy link
Member

Thanks @cmjcharlton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants