Skip to content

Clarify how to cite pandas in scientific papers #24036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JoKeyser opened this issue Dec 1, 2018 · 31 comments
Closed

Clarify how to cite pandas in scientific papers #24036

JoKeyser opened this issue Dec 1, 2018 · 31 comments
Labels

Comments

@JoKeyser
Copy link

JoKeyser commented Dec 1, 2018

Dear developers,
first thank you for your great work.

I'd like to reference my use of pandas in my scientific work, but have been unable to find a recommendation how to cite this.
I'm aware of https://pandas.pydata.org/talks.html, but the "papers" seem to refer to conference proceedings (slides) that lack a publisher, and seem out-dated?

Maybe something similar to SciPy's recommendations could be added?

@TomAugspurger
Copy link
Contributor

The paper linked from
https://www.scipy.org/citing.html is the correct one. We could update the link from talks. That’ll be in the pandas-website repository.

@JoKeyser
Copy link
Author

JoKeyser commented Dec 1, 2018

@TomAugspurger thanks for the fast response.

So the correct bibtex entry is:

@InProceedings{ mckinney-proc-scipy-2010,
  author    = { Wes McKinney },
  title     = { Data Structures for Statistical Computing in Python },
  booktitle = { Proceedings of the 9th Python in Science Conference },
  pages     = { 51 - 56 },
  year      = { 2010 },
  editor    = { St\'efan van der Walt and Jarrod Millman }
}

I guess the only issue then is/was the dead link on https://pandas.pydata.org/talks.html to http://jarrodmillman.com/scipy2010/pdfs/mckinney.pdf when it should link to http://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf

An editor asked for a publisher, but then I'll just reply that there is no one with that role.

@mroeschke
Copy link
Member

We should eventually include citation reference in the documentations probably in overview.rst.

@mroeschke mroeschke added the Docs label Dec 2, 2018
@JoKeyser
Copy link
Author

JoKeyser commented Dec 3, 2018

@mroeschke yes, I think it's helpful to include it anywhere an author would look after it. Over in the R world, most packages come with their citation command, which makes it very convenient (and thus likely) to be cited correctly.

Not sure whether such a discussion belongs here, but it seems to me that also the work since 2010 could be included into an "official" reference recommendation?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 3, 2018 via email

@ivalaginja
Copy link
Contributor

Does pandas have a DOI for native software citation, like matplotlib? See here, where I can get a bibtex entry for native software citation, depending on the version I use or a concept DOI.

@JoKeyser
Copy link
Author

JoKeyser commented Sep 2, 2019

Pandas could also get a Research Resource Identifier (RRID), see e.g. matplotlib (RRID:SCR_008624). The RRIDs of the software I used were requested in a review for the European Journal of Neuroscience.

@TomAugspurger
Copy link
Contributor

I don't think pandas has either of those.

@ivalaginja
Copy link
Contributor

Would the pandas team consider obtaining a DOI?

Trying to establish citation principles that will give everyone due credit, as well as create traceable records of a work unit like a piece of software, the FORCE11 Software Citation Working Group published a paper presenting software citation principles. This includes the need to cite the software product directly, additionally to software papers (see section 6.2):

6.2 Software papers. Currently, and for the foreseeable future, software papers are being published and cited, in addition to software itself being published and cited, as many community norms and practices are oriented towards citation of papers. [...] the software itself should be cited on the same basis as any other research product; authors should cite the appropriate set of software products. If a software paper exists and it contains results (performance, validation, etc.) that are important to the work, then the software paper should also be cited. We believe that a request from the software authors to cite a paper should typically be respected, and the paper cited in addition to the software.

A DOI makes native software citation possible.

Since pandas is one of the more frequently used packages in many sciences, giving people the opportunity to stick to this citation recommendation might spread this best practice in the community. It would be great to see this included and move towards this as a general standard in citing software. It sure adds to the maintenance overhead of the software, but I would recommend considering it.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 3, 2019 via email

@ivalaginja
Copy link
Contributor

Yup, that's exactly it. While Zenodo is by far not the only solution, it's definitely one of the most widely used ones and the hook with GitHub is super simple.

Yes, each release will get its own DOI so that people can cite specifically the software in the state it was in when they used it. On top of that, each software package also gets a so called "Concept DOI", which will always refer to the latest release, and the user can chose which one to cite.

The last thing would then be to add an encouragement to cite the software DOI to the citation instructions given for the software. I just realized that this repo doesn't have a CITATION.txt or similar in the first place, nor are there citation instructions in the readme, so it might be worth adding that in general - adding both a citation for the software itself, as well as any papers that should be cited.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 3, 2019

Cool, thanks. I'll see about adding the Zenodo GitHub integration.

One final question, (since you seem to be familiar with this 😄) If we add a CITATION.txt or a badge to the README, does it have to be updated with each release? Or is their some kind of living "master" DOI that refers to pandas, in addition to the per-release DOIs?

@ivalaginja
Copy link
Contributor

ivalaginja commented Sep 3, 2019

Yes, that would be the "concept DOI"! You can chose to have that in the CITATION.txt and/or in the badge and then you don't have to worry about updating that :) You can find more on the DOI business here:
http://help.zenodo.org/#versioning

At one point you can also expect the user to know or at least to research what they're doing, so sticking in the "overall" DOI (i.e. Concept DOI) should be more than fine.

@TomAugspurger
Copy link
Contributor

Perfect, thanks.

@jreback @jorisvandenbossche I've requested access to add the Zenodo GitHub application to pandas-dev.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2019

all sgtm

@matchings
Copy link

Has pandas obtained a DOI or RRID yet? I would like to properly cite, in addition to the paper citation

@matchings
Copy link

ah, I have found the DOI here: https://zenodo.org/record/3630805#.XjI9CyOIYdU

@ivalaginja
Copy link
Contributor

Cool, so the only thing they're missing are citation instructions in their repository, or anywhere really. Maybe time to make a citation.txt file in the repo? @TomAugspurger

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 30, 2020 via email

@ivalaginja
Copy link
Contributor

Here?
https://pandas.pydata.org/index.html

I browsed through all the pages but couldn't find any, maybe I was searching in the wrong place?

@dvaitkus3
Copy link

dvaitkus3 commented Feb 1, 2020

https://www.scipy.org/citing.html#pandas has a citation, and link out to the SciPy2010 conference proceedings with Bibtex entry

@ivalaginja
Copy link
Contributor

So I guess now that pandas has a way to cite the native software itself, is there a way to get that updated to include that as well?

@JoKeyser
Copy link
Author

JoKeyser commented Feb 13, 2020

It seems to me, now that the DOI exists, this is a more proper way to cite than the link at https://www.scipy.org/citing.html#pandas. If this is correct, the latter link should be updated.
And to make it easy for people to cite, it would be nice to clarify that as well on https://pandas.pydata.org/, provide a (bib-) file in the repo, and maybe a <package>.__citation__ variable.

@ivalaginja
Copy link
Contributor

Absolutely, the Scipy link should be updated. However, I would be hesitant to delete the pandas paper from any citation instructions, since especially scientists usually get more credit for papers than software. That's a decision that the pandas team will have to make internally. And yes, I would update the "cite" section on https://pandas.pydata.org as well.
Personally, I recommend a plain .txt files on the repo that also includes the citations in the bib format, but obviously any format will work as long as it is easy to find.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2020

happy to take a PR to add a. txt for citation purposes (likely numpy / scipy have an example of how this could look)

@ivalaginja
Copy link
Contributor

Interestingly, none of them do, as the scientific community often relies more on paper citations than actually giving credit to the software and its writers themselves (find the problem there :P).
There is a fitting package called sherpa that has a nice example of a CITATION file on their repo on the master branch.

Full disclosure, I also talked to them about including the native software citation additionally to the bib entries for the papers, which is currently in this PR.

@mroeschke
Copy link
Member

Note our citation page is here: https://pandas.io/about/citing.html

Still could update with the DOI and possibly add a citiation.txt

@ivalaginja
Copy link
Contributor

I submitted a PR with #32388 to get things going.

@MPvHarmelen
Copy link

Note our citation page is here: https://pandas.io/about/citing.html

Still could update with the DOI and possibly add a citiation.txt

The link is dead. This is the up-to-date url: https://pandas.pydata.org/about/citing.html

@jreback jreback modified the milestones: 1.1, Contributions Welcome Jul 10, 2020
@mroeschke
Copy link
Member

Given that we have https://pandas.pydata.org/about/citing.html, I think that's sufficient for now. Going to close, but if we want to reference this in other ways we can open a new issue for it

@buhtz
Copy link

buhtz commented Aug 25, 2021

Keep in mind that this citation has two parts: McKinney's paper is "just" a part of the proceeding. So the complete entry is a bit more complex. In German we say "Sammelwerk" and "Sammelwerkbeitrag". It means there is a book (with an editor) where each chapter has a different author.

% This file was created with Citavi 6.8.0.0

@inproceedings{McKinney.2010,
 author = {McKinney, Wes},
 title = {Data Structures for Statistical Computing in Python},
 pages = {56--61},
 bookpagination = {page},
 publisher = {SciPy},
 series = {Proceedings of the Python in Science Conference},
 editor = {{van der Walt}, St{\'e}fan and Millman, Jarrod},
 booktitle = {Proceedings of the 9th Python in Science Conference},
 year = {2010},
 doi = {10.25080/Majora-92bf1922-00a},
 booksubtitle = {SciPy 2010},
 eventtitle = {Python in Science Conference},
 venue = {Austin, Texas},
 eventdate = {June 28 - July 3 2010}
}



Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants