Skip to content

Commit 3e48684

Browse files
new: Update parser and splitter model
* Bumps parser model to 2020.3.8 * Bumps splitter model to 2020.3.6 - note that this model does not perform better than the model it replaces at present, but the previous model is not compatible with breaking API changes that have been implemented in the 2020.3.2 of deep_reference_parser. Nonetheless it will be relatively easy to experiment with the splitter model to get a higher score, and in any case this individual splitter model is mostly superseded by the multitask model, and just provided her for comparison.
1 parent 182a9cb commit 3e48684

File tree

6 files changed

+80
-81
lines changed

6 files changed

+80
-81
lines changed

deep_reference_parser/__version__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@
55
__author__ = "Wellcome Trust DataLabs Team"
66
__author_email__ = "[email protected]"
77
__license__ = "MIT"
8-
__splitter_model_version__ = "2019.12.0_splitting"
9-
__parser_model_version__ = "2020.3.2_parsing"
8+
__splitter_model_version__ = "2020.3.6_splitting"
9+
__parser_model_version__ = "2020.3.8_parsing"

deep_reference_parser/common.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,9 @@ def download_model_artefacts(model_dir, s3_slug, artefacts=None):
4747
if not artefacts:
4848

4949
artefacts = [
50-
"char2ind.pickle",
51-
"ind2label.pickle",
52-
"ind2word.pickle",
53-
"label2ind.pickle",
50+
"indices.pickle"
5451
"maxes.pickle",
5552
"weights.h5",
56-
"word2ind.pickle",
5753
]
5854

5955
for artefact in artefacts:

deep_reference_parser/configs/2019.12.0_splitting.ini

Lines changed: 0 additions & 35 deletions
This file was deleted.

deep_reference_parser/configs/2020.3.2_parsing.ini

Lines changed: 0 additions & 39 deletions
This file was deleted.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
[DEFAULT]
2+
version = 2020.3.6_splitting
3+
description = Splitting model trained on a combination of Reach and Rodrigues
4+
data. The Rodrigues data have been concatenated into a single continuous
5+
document and then cut into sequences of length=line_length, so that the
6+
Rodrigues data and Reach data have the same lengths without need for much
7+
padding or truncating.
8+
deep_reference_parser_version = e489f7efa31072b95175be8f728f1fcf03a4cabb
9+
10+
[data]
11+
test_proportion = 0.25
12+
valid_proportion = 0.25
13+
data_path = data/
14+
respect_line_endings = 0
15+
respect_doc_endings = 1
16+
line_limit = 250
17+
policy_train = data/splitting/2020.3.6_splitting_train.tsv
18+
policy_test = data/splitting/2020.3.6_splitting_test.tsv
19+
policy_valid = data/splitting/2020.3.6_splitting_valid.tsv
20+
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
21+
22+
[build]
23+
output_path = models/splitting/2020.3.6_splitting/
24+
output = crf
25+
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
26+
pretrained_embedding = 0
27+
dropout = 0.5
28+
lstm_hidden = 400
29+
word_embedding_size = 300
30+
char_embedding_size = 100
31+
char_embedding_type = BILSTM
32+
optimizer = rmsprop
33+
34+
[train]
35+
epochs = 30
36+
batch_size = 100
37+
early_stopping_patience = 5
38+
metric = val_f1
39+
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
[DEFAULT]
2+
version = 2020.3.8_parsing
3+
description = Parsing model trained on a combination of Reach and Rodrigues
4+
data. The Rodrigues data have been concatenated into a single continuous
5+
document and then cut into sequences of length=line_length, so that the
6+
Rodrigues data and Reach data have the same lengths without need for much
7+
padding or truncating.
8+
deep_reference_parser_version = e489f7efa31072b95175be8f728f1fcf03a4cabb
9+
10+
[data]
11+
test_proportion = 0.25
12+
valid_proportion = 0.25
13+
data_path = data/
14+
respect_line_endings = 0
15+
respect_doc_endings = 1
16+
line_limit = 100
17+
policy_train = data/parsing/2020.3.8_parsing_train.tsv
18+
policy_test = data/parsing/2020.3.8_parsing_test.tsv
19+
policy_valid = data/parsing/2020.3.8_parsing_valid.tsv
20+
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
21+
22+
[build]
23+
output_path = models/parsing/2020.3.8_parsing/
24+
output = crf
25+
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
26+
pretrained_embedding = 0
27+
dropout = 0.5
28+
lstm_hidden = 400
29+
word_embedding_size = 300
30+
char_embedding_size = 100
31+
char_embedding_type = BILSTM
32+
optimizer = rmsprop
33+
34+
[train]
35+
epochs = 30
36+
batch_size = 100
37+
early_stopping_patience = 5
38+
metric = val_f1

0 commit comments

Comments
 (0)