Skip to content

Commit 5414988

Browse files
qiyunzhuwasade
andauthored
Adding taxdump format and taxdump-to-tree converter (scikit-bio#1810)
* added taxdump format and taxdump-to-tree converter * fixed code style issues * Update skbio/io/format/taxdump.py Co-authored-by: Daniel McDonald <[email protected]> * Update skbio/tree/_tree.py Co-authored-by: Daniel McDonald <[email protected]> * restored .c files * fixed typo in format title * allowed custom taxdump scheme * fixed another .c file * fixed a typo * mentioned in changelog * Update CHANGELOG.md Co-authored-by: Daniel McDonald <[email protected]>
1 parent 01e1769 commit 5414988

File tree

8 files changed

+783
-0
lines changed

8 files changed

+783
-0
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# scikit-bio changelog
22

3+
## Version 0.5.8-dev
4+
5+
### Features
6+
7+
* Added NCBI taxonomy database dump format (`taxdump`) ([#1810](https://github.com/biocore/scikit-bio/pull/1810)).
8+
* Added `TreeNode.from_taxdump` for converting taxdump into a tree ([#1810](https://github.com/biocore/scikit-bio/pull/1810)).
9+
310
## Version 0.5.7
411

512
### Features

skbio/io/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,7 @@ class of Python's standard library. The goal of a `sniffer` is twofold: to
253253
import_module('skbio.io.format.gff3')
254254
import_module('skbio.io.format.stockholm')
255255
import_module('skbio.io.format.binary_dm')
256+
import_module('skbio.io.format.taxdump')
256257

257258
# This is meant to be a handy indicator to the user that they have done
258259
# something wrong.

skbio/io/format/taxdump.py

Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
"""
2+
Taxdump format (:mod:`skbio.io.format.taxdump`)
3+
==============================================
4+
5+
.. currentmodule:: skbio.io.format.taxdump
6+
7+
The NCBI Taxonomy database dump (``taxdump``) format stores information of
8+
organism names, classifications and other properties. It is a tabular format
9+
with a delimiter: ``<tab><pipe><tab>`` between columns, and a line end
10+
``<tab><pipe>`` after all columns. The file name usually ends with .dmp.
11+
12+
Format Support
13+
--------------
14+
**Has Sniffer: No**
15+
16+
+------+------+---------------------------------------------------------------+
17+
|Reader|Writer| Object Class |
18+
+======+======+===============================================================+
19+
|Yes |No |:mod:`pandas.DataFrame` |
20+
+------+------+---------------------------------------------------------------+
21+
22+
Format Specification
23+
--------------------
24+
**State: Experimental as of 0.5.8.**
25+
26+
The NCBI taxonomy database [1]_ [2]_ hosts organism names and classifications.
27+
It has a web portal [3]_ and an FTP download server [4]_. It is also accessible
28+
using E-utilities [5]_. The database is being updated daily, and an archive is
29+
generated every month. The data release has the file name ``taxdump``. It
30+
consists of multiple .dmp files. These files serve different purposes, but they
31+
follow a common format pattern:
32+
33+
- It is a tabular format.
34+
- Column delimiter is ``<tab><pipe><tab>``.
35+
- Line end is ``<tab><pipe>``.
36+
- The first column is a numeric identifier, which usually represent taxa (i.e.,
37+
"TaxID"), but can also be genetic codes, citations or other entries.
38+
39+
The two most important files of the data release are ``nodes.dmp`` and
40+
``names.dmp``. They store the hierarchical structure of the classification
41+
system (i.e., taxonomy) and the names of organisms, respectively. They can be
42+
used to construct the taxonomy tree of organisms.
43+
44+
The definition of columns of each .dmp file type are taken from [6]_ and [7]_.
45+
46+
``nodes.dmp``
47+
^^^^^^^^^^^^^
48+
+----------------+-------------------------------------+
49+
|Name |Description |
50+
+================+=====================================+
51+
|tax_id |node id in GenBank taxonomy database |
52+
+----------------+-------------------------------------+
53+
|parent tax_id |parent node id in GenBank taxonomy |
54+
| |database |
55+
+----------------+-------------------------------------+
56+
|rank |rank of this node (superkingdom, |
57+
| |kingdom, ...) |
58+
+----------------+-------------------------------------+
59+
|embl code |locus-name prefix; not unique |
60+
+----------------+-------------------------------------+
61+
|division id |see division.dmp file |
62+
+----------------+-------------------------------------+
63+
|inherited div |1 if node inherits division from |
64+
|flag (1 or 0) |parent |
65+
+----------------+-------------------------------------+
66+
|genetic code id |see gencode.dmp file |
67+
+----------------+-------------------------------------+
68+
|inherited GC |1 if node inherits genetic code from |
69+
|flag (1 or 0) |parent |
70+
+----------------+-------------------------------------+
71+
|mitochondrial |see gencode.dmp file |
72+
|genetic code id | |
73+
+----------------+-------------------------------------+
74+
|inherited MGC |1 if node inherits mitochondrial |
75+
|flag (1 or 0) |gencode from parent |
76+
+----------------+-------------------------------------+
77+
|GenBank hidden |1 if name is suppressed in GenBank |
78+
|flag (1 or 0) |entry lineage |
79+
+----------------+-------------------------------------+
80+
|hidden subtree |1 if this subtree has no sequence |
81+
|root flag |data yet |
82+
|(1 or 0) | |
83+
+----------------+-------------------------------------+
84+
|comments |free-text comments and citations |
85+
+----------------+-------------------------------------+
86+
87+
Since 2018, NCBI releases "new taxonomy files" [8]_ (``new_taxdump``). The new
88+
``nodes.dmp`` format is compatible with the classical format, plus five extra
89+
columns after all aforementioned columns .
90+
+----------------+-------------------------------------+
91+
|Name |Description |
92+
+================+=====================================+
93+
|plastid genetic |see gencode.dmp file |
94+
|code id | |
95+
+----------------+-------------------------------------+
96+
|inherited PGC |1 if node inherits plastid gencode |
97+
|flag (1 or 0) |from parent |
98+
+----------------+-------------------------------------+
99+
|specified_ |1 if species in the node's lineage |
100+
|species |has formal name |
101+
+----------------+-------------------------------------+
102+
|hydrogenosome |see gencode.dmp file |
103+
|genetic code id | |
104+
+----------------+-------------------------------------+
105+
|inherited HGC |1 if node inherits hydrogenosome |
106+
|flag (1 or 0) |gencode from parent |
107+
+----------------+-------------------------------------+
108+
109+
``names.dmp``
110+
^^^^^^^^^^^^^
111+
+----------------+-------------------------------------+
112+
|Name |Description |
113+
+================+=====================================+
114+
|tax_id |the id of node associated with this |
115+
| |name |
116+
+----------------+-------------------------------------+
117+
|name_txt |name itself |
118+
+----------------+-------------------------------------+
119+
|unique name |the unique variant of this name if |
120+
| |name not unique |
121+
+----------------+-------------------------------------+
122+
|name class |(synonym, common name, ...) |
123+
+----------------+-------------------------------------+
124+
125+
``division.dmp``
126+
^^^^^^^^^^^^^^^^
127+
+----------------+-------------------------------------+
128+
|Name |Description |
129+
+================+=====================================+
130+
|division id |taxonomy database division id |
131+
+----------------+-------------------------------------+
132+
|division cde |GenBank division code (three |
133+
| |characters) |
134+
+----------------+-------------------------------------+
135+
|division name |e.g. BCT, PLN, VRT, MAM, PRI... |
136+
+----------------+-------------------------------------+
137+
|comments | |
138+
+----------------+-------------------------------------+
139+
140+
``gencode.dmp``
141+
^^^^^^^^^^^^^^^
142+
+----------------+-------------------------------------+
143+
|Name |Description |
144+
+================+=====================================+
145+
|genetic code id |GenBank genetic code id |
146+
+----------------+-------------------------------------+
147+
|abbreviation |genetic code name abbreviation |
148+
+----------------+-------------------------------------+
149+
|name |genetic code name |
150+
+----------------+-------------------------------------+
151+
|cde |translation table for this genetic |
152+
| |code |
153+
+----------------+-------------------------------------+
154+
|starts |start codons for this genetic code |
155+
+----------------+-------------------------------------+
156+
157+
Other types of .dmp files are currently not supported by scikit-bio. However,
158+
the user may customize column definitions in using this utility. See below for
159+
details.
160+
161+
Format Parameters
162+
-----------------
163+
The following format parameters are available in ``taxdump`` format:
164+
165+
- ``scheme``: The column definition scheme name of the input .dmp file.
166+
Available options are listed below. Alternatively, one can provide a custom
167+
scheme as defined in a name-to-data type dictionary.
168+
169+
1. ``nodes``: The classical ``nodes.dmp`` scheme. It is also compatible with
170+
new ``nodes.dmp`` format, in which case only the columns defined by the
171+
classical format will be read.
172+
173+
2. ``nodes_new``: The new ``nodes.dmp`` scheme.
174+
175+
3. ``nodes_slim``: Only the first three columns: tax_id, parent_tax_id and
176+
rank, which are the minimum required information for constructing the
177+
taxonomy tree. It can be applied to both classical and new ``nodes.dmp``
178+
files. It can also handle custom files which only contains these three
179+
columns.
180+
181+
4. ``names``: The ``names.dmp`` scheme.
182+
183+
5. ``division``: The ``division.dmp`` scheme.
184+
185+
6. ``gencode``: The ``gencode.dmp`` scheme.
186+
187+
.. note:: scikit-bio will read columns from leftmost till the number of columns
188+
defined in the scheme. Extra columns will be cropped.
189+
190+
Examples
191+
--------
192+
193+
>>> from io import StringIO
194+
>>> import skbio.io
195+
>>> import pandas as pd
196+
>>> fs = '\\n'.join([
197+
... '1\\t|\\t1\\t|\\tno rank\\t|',
198+
... '2\\t|\\t131567\\t|\\tsuperkingdom\\t|',
199+
... '6\\t|\\t335928\\t|\\tgenus\\t|'
200+
... ])
201+
>>> fh = StringIO(fs)
202+
203+
Read the file into a ``pd.DataFrame`` and specify that the "nodes_slim" scheme
204+
should be used:
205+
206+
>>> df = skbio.io.read(fh, format="taxdump", into=pd.DataFrame,
207+
... scheme="nodes_slim")
208+
>>> df # doctest: +NORMALIZE_WHITESPACE
209+
parent_tax_id rank
210+
tax_id
211+
1 1 no rank
212+
2 131567 superkingdom
213+
6 335928 genus
214+
215+
References
216+
----------
217+
.. [1] Federhen, S. (2012). The NCBI taxonomy database. Nucleic acids
218+
research, 40(D1), D136-D143.
219+
.. [2] Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S.,
220+
Khovanskaya, R., ... & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a
221+
comprehensive update on curation, resources and tools. Database, 2020.
222+
.. [3] https://www.ncbi.nlm.nih.gov/taxonomy
223+
.. [4] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
224+
.. [5] Kans, J. (2022). Entrez direct: E-utilities on the UNIX command line.
225+
In Entrez Programming Utilities Help [Internet]. National Center for
226+
Biotechnology Information (US).
227+
.. [6] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_readme.txt
228+
.. [7] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt
229+
.. [8] https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-
230+
available-with-lineage-type-and-host-information/
231+
"""
232+
233+
# ----------------------------------------------------------------------------
234+
# Copyright (c) 2013--, scikit-bio development team.
235+
#
236+
# Distributed under the terms of the Modified BSD License.
237+
#
238+
# The full license is in the file COPYING.txt, distributed with this software.
239+
# ----------------------------------------------------------------------------
240+
241+
import pandas as pd
242+
243+
from skbio.io import create_format
244+
245+
246+
taxdump = create_format('taxdump')
247+
248+
_taxdump_column_schemes = {
249+
'nodes_slim': {
250+
'tax_id': int,
251+
'parent_tax_id': int,
252+
'rank': str
253+
},
254+
'nodes': {
255+
'tax_id': int,
256+
'parent_tax_id': int,
257+
'rank': str,
258+
'embl_code': str,
259+
'division_id': int,
260+
'inherited_div_flag': bool,
261+
'genetic_code_id': int,
262+
'inherited_GC_flag': bool,
263+
'mitochondrial_genetic_code_id': int,
264+
'inherited_MGC_flag': bool,
265+
'GenBank_hidden_flag': bool,
266+
'hidden_subtree_root_flag': bool,
267+
'comments': str
268+
},
269+
'names': {
270+
'tax_id': int,
271+
'name_txt': str,
272+
'unique_name': str,
273+
'name_class': str
274+
},
275+
'division': {
276+
'division_id': int,
277+
'division_cde': str,
278+
'division_name': str,
279+
'comments': str
280+
},
281+
'gencode': {
282+
'genetic_code_id': int,
283+
'abbreviation': str,
284+
'name': str,
285+
'cde': str,
286+
'starts': str
287+
}
288+
}
289+
290+
_taxdump_column_schemes['nodes_new'] = dict(
291+
_taxdump_column_schemes['nodes'], **{
292+
'plastid_genetic_code_id': bool,
293+
'inherited_PGC_flag': bool,
294+
'specified_species': bool,
295+
'hydrogenosome_genetic_code_id': int,
296+
'inherited_HGC_flag': bool
297+
})
298+
299+
300+
@taxdump.reader(pd.DataFrame, monkey_patch=False)
301+
def _taxdump_to_data_frame(fh, scheme):
302+
'''Read a taxdump file into a data frame.
303+
304+
Parameters
305+
----------
306+
fh : file handle
307+
Input taxdump file
308+
scheme : str
309+
Name of column scheme
310+
311+
Returns
312+
-------
313+
pd.DataFrame
314+
Parsed table
315+
'''
316+
if isinstance(scheme, str):
317+
if scheme not in _taxdump_column_schemes:
318+
raise ValueError(f'Invalid taxdump column scheme: "{scheme}".')
319+
scheme = _taxdump_column_schemes[scheme]
320+
names = list(scheme.keys())
321+
try:
322+
return pd.read_csv(
323+
fh, sep='\t\\|(?:\t|$)', engine='python', index_col=0,
324+
names=names, dtype=scheme, usecols=range(len(names)))
325+
except ValueError:
326+
raise ValueError('Invalid taxdump file format.')
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
1 | root | | scientific name |
2+
2 | Bacteria | Bacteria <bacteria> | scientific name |
3+
2 | eubacteria | | genbank common name |
4+
543 | Enterobacteriaceae | | scientific name |
5+
548 | Klebsiella aerogenes | | scientific name |
6+
561 | Escherichia | | scientific name |
7+
562 | "Bacillus coli" Migula 1895 | | authority |
8+
562 | Escherichia coli | | scientific name |
9+
562 | Escherichia/Shigella coli | | equivalent name |
10+
570 | Donovania | | synonym |
11+
570 | Klebsiella | | scientific name |
12+
620 | Shigella | | scientific name |
13+
622 | Shigella dysenteriae | | scientific name |
14+
766 | Rickettsiales | | scientific name |
15+
1224 | Proteobacteria | | scientific name |
16+
1236 | Gammaproteobacteria | | scientific name |
17+
28211 | Alphaproteobacteria | | scientific name |
18+
91347 | Enterobacterales | | scientific name |
19+
118884 | unclassified Gammaproteobacteria | | scientific name |
20+
126792 | Plasmid pPY113 | | scientific name |
21+
131567 | cellular organisms | | scientific name |
22+
585056 | Escherichia coli UMN026 | | scientific name |
23+
1038927 | Escherichia coli O104:H4 | | scientific name |
24+
2580236 | synthetic Escherichia coli Syn61 | | scientific name |
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
2+
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | |
3+
543 | 91347 | family | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
4+
548 | 570 | species | KA | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
5+
561 | 543 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
6+
562 | 561 | species | EC | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
7+
570 | 543 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
8+
620 | 543 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
9+
622 | 620 | species | SD | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
10+
766 | 28211 | order | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
11+
1224 | 2 | phylum | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
12+
1236 | 1224 | class | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
13+
28211 | 1224 | class | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
14+
91347 | 1236 | order | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
15+
118884 | 1236 | no rank | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
16+
126792 | 36549 | species | PP | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
17+
131567 | 1 | no rank | | 8 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | |
18+
585056 | 562 | no rank | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
19+
1038927 | 562 | no rank | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
20+
2580236 | 488338 | species | SE | 7 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |

0 commit comments

Comments
 (0)