Skip to content

Commit bbca141

Browse files
Merge pull request #25 from wellcometrust/feature/ivyleavedtoadflax/multitask_2
Implement multitask training
2 parents fbc37d1 + fceed1b commit bbca141

21 files changed

+662
-351
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# Changelog
22

3+
## 2020.3.3 - Pre-release
4+
5+
NOTE: This version includes changes to both the way that model artefacts are packaged and saved, and the way that data are laded and parsed from tsv files. This results in a significantly faster training time (c.14 hours -> c.0.5 hour), but older models will no longer be compatible. For compatibility you must use multitask modles > 2020.3.19, splitting models > 2020.3.6, and parisng models > 2020.3.8. These models currently perform less well than previous versions, but performance is expected to improve with more data and experimentation predominatly around sequence length.
6+
7+
* Adds support for a Multitask models as in the original Rodrigues paper
8+
* Combines artefacts into a single `indices.pickle` rather than the several previous pickles. Now the model just requires the embedding, `indices.pickle`, and `weights.h5`.
9+
* Updates load_tsv to better handle quoting.
10+
11+
312
## 2020.3.2 - Pre-release
413

514
* Adds parse command that can be called with `python -m deep_reference_parser parse`

Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,10 @@ datasets = data/splitting/2019.12.0_splitting_train.tsv \
8383
data/splitting/2019.12.0_splitting_valid.tsv \
8484
data/parsing/2020.3.2_parsing_train.tsv \
8585
data/parsing/2020.3.2_parsing_test.tsv \
86-
data/parsing/2020.3.2_parsing_valid.tsv
86+
data/parsing/2020.3.2_parsing_valid.tsv \
87+
data/multitask/2020.3.19_multitask_train.tsv \
88+
data/multitask/2020.3.19_multitask_test.tsv \
89+
data/multitask/2020.3.19_multitask_valid.tsv
8790

8891

8992
rodrigues_datasets = data/rodrigues/clean_train.txt \

README.md

Lines changed: 82 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -2,63 +2,87 @@
22

33
# Deep Reference Parser
44

5-
Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
5+
Deep Reference Parser is a Deep Learning Model for recognising references in free text. In this context we mean references to other works, for example an academic paper, or a book. Given an arbitrary block of text (nominally a section containing references), the model will extract the limits of the individual references, and identify key information like: authors, year published, and title.
66

7-
The BiLSTM model is based on Rodrigues et al. (2018), and like this project, the intention is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection (splitting), reference component detection (parsing), and reference type classification (classification) in a single neural network and stacked CRF.
7+
The model itself is a Bi-directional Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF). It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) application to replace a number of existing machine learning models which find references, and extract the constituent parts.
8+
9+
The BiLSTM model is based on [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing) who developed a model to find (split) references, parse them into contituent parts, and classify them according to the type of reference (e.g. primary reference, secondary reference, etc). This implementation of the model implements a the first two tasks and is intened for use in the medical field. Three models are implemented here: individual splitting and parsing models, and a combined multitask model which both splits and parses. We have not yet attempted to include reference type classification, but this may be done in the future.
810

911
### Current status:
1012

1113
|Component|Individual|MultiTask|
1214
|---|---|---|
13-
|Spans (splitting)|✔️ Implemented|❌ Not Implemented|
14-
|Components (parsing)|✔️ Implemented|❌ Not Implemented|
15+
|Spans (splitting)|✔️ Implemented|✔️ Implemented|
16+
|Components (parsing)|✔️ Implemented|✔️ Implemented|
1517
|Type (classification)|❌ Not Implemented|❌ Not Implemented|
1618

1719
### The model
1820

1921
The model itself is based on the work of [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing), although the implemention here differs significantly. The main differences are:
2022

21-
* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have labelled ourselves. No Rodrigues et al. data are included in the test and validation sets.
22-
* We also use a new word embedding that has been trained on documents relevant to the medicine.
23+
* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have annotated ourselves. No Rodrigues et al. data are included in the test and validation sets.
24+
* We also use a new word embedding that has been trained on documents relevant to the field of medicine.
2325
* Whereas Rodrigues at al. split documents on lines, and sent the lines to the model, we combine the lines of the document together, and then send larger chunks to the model, giving it more context to work with when training and predicting.
24-
* Whilst the model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
26+
* Whilst the splitter model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
2527
* Hyperparameters are passed to the model in a config (.ini) file. This is to keep track of experiments, but also because it is difficult to save the model with the CRF architecture, so it is necesary to rebuild (not re-train!) the model object each time you want to use it. Storing the hyperparameters in a config file makes this easier.
26-
* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
28+
* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2020.3.19_multitask.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (index dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
2729
* The model includes a command line interface inspired by [SpaCy](https://github.com/explosion/spaCy); functions can be called from the command line with `python -m deep_reference_parser` ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/predict.py)).
28-
* Python version updated to 3.7, along with dependencies (although more to do)
30+
* Python version updated to 3.7, along with dependencies (although more to do).
2931

3032
### Performance
3133

3234
On the validation set.
3335

34-
#### Span detection (splitting)
36+
#### Finding references spans (splitting)
3537

36-
|token|f1|support|
37-
|---|---|---|
38-
|b-r|0.9364|2472|
39-
|e-r|0.9312|2424|
40-
|i-r|0.9833|92398|
41-
|o|0.9561|32666|
42-
|weighted avg|0.9746|129959|
38+
Current mode version: *2020.3.6_splitting*
4339

44-
#### Components (parsing)
40+
|token|f1|
41+
|---|---|
42+
|b-r|0.8146|
43+
|e-r|0.7075|
44+
|i-r|0.9623|
45+
|o|0.8463|
46+
|weighted avg|0.9326|
4547

46-
|token|f1|support|
47-
|---|---|---|
48-
|author|0.9467|2818|
49-
|title|0.8994|4931|
50-
|year|0.8774|418|
51-
|o|0.9592|13685|
52-
|weighted avg|0.9425|21852|
48+
#### Identifying reference components (parsing)
49+
50+
Current mode version: *2020.3.8_parsing*
51+
52+
|token|f1|
53+
|---|---|
54+
|author|0.9053|
55+
|title|0.8607|
56+
|year|0.0.8639|
57+
|o|0.0.9340|
58+
|weighted avg|0.9124|
59+
60+
#### Multitask model (splitting and parsing)
61+
62+
Current mode version: *2020.3.19_multitask*
63+
64+
|token|f1|
65+
|---|---|
66+
|author|0.9102|
67+
|title|0.8809|
68+
|year|0.7469|
69+
|o|0.8892|
70+
|parsing weighted avg|0.8869|
71+
|b-r|0.8254|
72+
|e-r|0.7908|
73+
|i-r|0.9563|
74+
|o|0.7560|
75+
|weighted avg|0.9240|
5376

5477
#### Computing requirements
5578

5679
Models are trained on AWS instances using CPU only.
5780

5881
|Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
5982
|---|---|---|---|---|
60-
|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
61-
|Components|11:02:59|m4.4xlarge|$0.88|$9.72|
83+
|Span detection|00:26:41|m4.4xlarge|$0.88|$0.39|
84+
|Components|00:17:22|m4.4xlarge|$0.88|$0.25|
85+
|MultiTask|00:19:56|m4.4xlarge|$0.88|$0.29|
6286

6387
## tl;dr: Just get me to the references!
6488

@@ -77,15 +101,20 @@ cat > references.txt <<EOF
77101
EOF
78102
79103
80-
# Run the splitter model. This will take a little time while the weights and
104+
# Run the MultiTask model. This will take a little time while the weights and
81105
# embeddings are downloaded. The weights are about 300MB, and the embeddings
82106
# 950MB.
83107
84-
python -m deep_reference_parser split "$(cat references.txt)"
108+
python -m deep_reference_parser split_parse -t "$(cat references.txt)"
85109
86110
# For parsing:
87111
88112
python -m deep_reference_parser parse "$(cat references.txt)"
113+
114+
# For splitting:
115+
116+
python -m deep_reference_parser split "$(cat references.txt)"
117+
89118
```
90119

91120
## The longer guide
@@ -106,22 +135,24 @@ A [config file](https://github.com/wellcometrust/deep_reference_parser/blob/mast
106135

107136
```
108137
[DEFAULT]
109-
version = 2019.12.0
138+
version = 2020.3.19_multitask
139+
description = Same as 2020.3.13 but with adam rather than rmsprop
140+
deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692
110141
111142
[data]
112143
test_proportion = 0.25
113144
valid_proportion = 0.25
114145
data_path = data/
115146
respect_line_endings = 0
116147
respect_doc_endings = 1
117-
line_limit = 250
118-
policy_train = data/2019.12.0_train.tsv
119-
policy_test = data/2019.12.0_test.tsv
120-
policy_valid = data/2019.12.0_valid.tsv
148+
line_limit = 150
149+
policy_train = data/multitask/2020.3.19_multitask_train.tsv
150+
policy_test = data/multitask/2020.3.19_multitask_test.tsv
151+
policy_valid = datamultitask/2020.3.19_multitask_valid.tsv
121152
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
122153
123154
[build]
124-
output_path = models/2020.2.0/
155+
output_path = models/multitask/2020.3.19_multitask/
125156
output = crf
126157
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
127158
pretrained_embedding = 0
@@ -133,13 +164,10 @@ char_embedding_type = BILSTM
133164
optimizer = rmsprop
134165
135166
[train]
136-
epochs = 10
167+
epochs = 60
137168
batch_size = 100
138169
early_stopping_patience = 5
139170
metric = val_f1
140-
141-
[evaluate]
142-
out_file = evaluation_data.tsv
143171
```
144172

145173
### Getting help
@@ -198,21 +226,21 @@ Data must be prepared in the following tab separated format (tsv). We use [prodi
198226
You must provide the train/test/validation data splits in this format in pre-prepared files that are defined in the config file.
199227

200228
```
201-
References o
202-
1 o
203-
The b-r
204-
potency i-r
205-
of i-r
206-
history i-r
207-
was i-r
208-
on i-r
209-
display i-r
210-
at i-r
211-
a i-r
212-
workshop i-r
213-
held i-r
214-
in i-r
215-
February i-r
229+
References o o
230+
1 o o
231+
The b-r title
232+
potency i-r title
233+
of i-r title
234+
history i-r title
235+
was i-r title
236+
on i-r title
237+
display i-r title
238+
at i-r title
239+
a i-r title
240+
workshop i-r title
241+
held i-r title
242+
in i-r title
243+
February i-r title
216244
```
217245

218246
### Making predictions

deep_reference_parser/__main__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,13 @@
1212
from .train import train
1313
from .split import split
1414
from .parse import parse
15+
from .split_parse import split_parse
1516

1617
commands = {
1718
"split": split,
1819
"parse": parse,
1920
"train": train,
21+
"split_parse": split_parse,
2022
}
2123

2224
if len(sys.argv) == 1:

deep_reference_parser/__version__.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
__name__ = "deep_reference_parser"
2-
__version__ = "2020.3.2"
2+
__version__ = "2020.3.3"
33
__description__ = "Deep learning model for finding and parsing references"
44
__url__ = "https://github.com/wellcometrust/deep_reference_parser"
55
__author__ = "Wellcome Trust DataLabs Team"
66
__author_email__ = "[email protected]"
77
__license__ = "MIT"
8-
__splitter_model_version__ = "2019.12.0_splitting"
9-
__parser_model_version__ = "2020.3.2_parsing"
8+
__splitter_model_version__ = "2020.3.6_splitting"
9+
__parser_model_version__ = "2020.3.8_parsing"
10+
__splitparser_model_version__ = "2020.3.19_multitask"

deep_reference_parser/common.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,12 @@
55
from logging import getLogger
66
from urllib import parse, request
77

8+
from .__version__ import (
9+
__parser_model_version__,
10+
__splitparser_model_version__,
11+
__splitter_model_version__,
12+
)
813
from .logger import logger
9-
from .__version__ import __splitter_model_version__, __parser_model_version__
1014

1115

1216
def get_path(path):
@@ -15,6 +19,7 @@ def get_path(path):
1519

1620
SPLITTER_CFG = get_path(f"configs/{__splitter_model_version__}.ini")
1721
PARSER_CFG = get_path(f"configs/{__parser_model_version__}.ini")
22+
MULTITASK_CFG = get_path(f"configs/{__splitparser_model_version__}.ini")
1823

1924

2025
def download_model_artefact(artefact, s3_slug):
@@ -47,13 +52,8 @@ def download_model_artefacts(model_dir, s3_slug, artefacts=None):
4752
if not artefacts:
4853

4954
artefacts = [
50-
"char2ind.pickle",
51-
"ind2label.pickle",
52-
"ind2word.pickle",
53-
"label2ind.pickle",
54-
"maxes.pickle",
55+
"indices.pickle" "maxes.pickle",
5556
"weights.h5",
56-
"word2ind.pickle",
5757
]
5858

5959
for artefact in artefacts:

deep_reference_parser/configs/2019.12.0_splitting.ini

Lines changed: 0 additions & 35 deletions
This file was deleted.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
[DEFAULT]
2+
version = 2020.3.19_multitask
3+
description = Same as 2020.3.13 but with adam rather than rmsprop
4+
deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692
5+
6+
[data]
7+
# Note that test and valid proportion are only used for data creation steps,
8+
# not when running the train command.
9+
test_proportion = 0.25
10+
valid_proportion = 0.25
11+
data_path = data/
12+
respect_line_endings = 0
13+
respect_doc_endings = 1
14+
line_limit = 150
15+
policy_train = data/multitask/2020.3.19_multitask_train.tsv
16+
policy_test = data/multitask/2020.3.19_multitask_test.tsv
17+
policy_valid = data/multitask/2020.3.19_multitask_valid.tsv
18+
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
19+
20+
[build]
21+
output_path = models/multitask/2020.3.19_multitask/
22+
output = crf
23+
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
24+
pretrained_embedding = 0
25+
dropout = 0.5
26+
lstm_hidden = 400
27+
word_embedding_size = 300
28+
char_embedding_size = 100
29+
char_embedding_type = BILSTM
30+
optimizer = rmsprop
31+
32+
[train]
33+
epochs = 60
34+
batch_size = 100
35+
early_stopping_patience = 5
36+
metric = val_f1
37+

0 commit comments

Comments
 (0)