Commit Graph

7027 Commits

Author SHA1 Message Date
svlandeg
e91485dfc4 add discard_oversize parameter, move optimizer to training subsection 2020-06-03 10:04:16 +02:00
svlandeg
03c58b488c prevent infinite loop, custom warning 2020-06-03 10:00:21 +02:00
svlandeg
6504b7f161 Merge remote-tracking branch 'upstream/develop' into feature/pretrain-config 2020-06-03 08:30:16 +02:00
svlandeg
c5ac382f0a fix name clash 2020-06-02 22:24:57 +02:00
svlandeg
2bf5111ecf additional test with discard_oversize=False 2020-06-02 22:09:37 +02:00
svlandeg
aa6271b16c extending algorithm to deal better with edge cases 2020-06-02 22:05:08 +02:00
svlandeg
f2e162fc60 it's only oversized if the tolerance level is also exceeded 2020-06-02 19:59:04 +02:00
svlandeg
ef834b4cd7 fix comments 2020-06-02 19:50:44 +02:00
svlandeg
6208d322d3 slightly more challenging unit test 2020-06-02 19:47:30 +02:00
svlandeg
6651fafd5c using overflow buffer for examples within the tolerance margin 2020-06-02 19:43:39 +02:00
svlandeg
85b0597ed5 add test for minibatch util 2020-06-02 18:26:21 +02:00
svlandeg
5b350a6c99 bugfix of the bugfix 2020-06-02 17:49:33 +02:00
Adriane Boyd
75f08ad62d Remove unnecessary check 2020-06-02 17:41:25 +02:00
Adriane Boyd
bbc1836581 Add rudimentary version checks on model load 2020-06-02 17:33:48 +02:00
svlandeg
fdfd822936 rewrite minibatch_by_words function 2020-06-02 15:22:54 +02:00
svlandeg
ec52e7f886 add oversize examples before StopIteration returns 2020-06-02 13:21:55 +02:00
svlandeg
e0f9f448f1 remove Tensorizer 2020-06-01 23:38:48 +02:00
Leo
925e938570
Spanish tokenizer exception and examples improvement (#5531)
* Spanish tokenizer exception additions. Added Spanish question examples

* erased slang tokenization examples
2020-06-01 18:18:34 +02:00
Matthew Honnibal
67af3a32b0
Merge pull request #5527 from adrianeboyd/bugfix/tagger-sp-tag-map
Preserve _SP when filtering tag map in Tagger
2020-06-01 12:00:21 +02:00
Leo
c21c308ecb
corrected issue #5524 changed <U+009C> 'STRING TERMINATOR' for <U+0153> LATIN SMALL LIGATURE OE' (#5526) 2020-05-31 22:08:12 +02:00
Adriane Boyd
a005ccd6d7 Preserve _SP when filtering tag map in Tagger
To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if
present as the reference `SPACE` POS in the tag map in
`Tagger.begin_training()`.
2020-05-31 19:57:54 +02:00
Ines Montani
b5ae2edcba
Merge pull request #5516 from explosion/feature/improve-model-version-deps 2020-05-31 12:54:01 +02:00
Ines Montani
dc186afdc5 Add warning 2020-05-30 15:34:54 +02:00
Ines Montani
b7aff6020c Make functions more general purpose and update docstrings and tests 2020-05-30 15:18:53 +02:00
Ines Montani
a7e370bcbf Don't override spaCy version 2020-05-30 15:03:18 +02:00
Ines Montani
e47e5a4b10 Use more sophisticated version parsing logic 2020-05-30 15:01:58 +02:00
svlandeg
15134ef611 fix deserialization order 2020-05-30 12:53:32 +02:00
Matthew Honnibal
64adda3202 Revert "Remove peeking from Parser.begin_training (#5456)"
This reverts commit 9393253b66.

The model shouldn't need to see all examples, and actually in v3 there's
no equivalent step. All examples are provided to the component, for the
component to do stuff like figuring out the labels. The model just needs
to do stuff like shape inference.
2020-05-29 23:21:55 +02:00
Matthew Honnibal
85f1acfaa0
Merge pull request #5517 from adrianeboyd/bugfix/morph-repr
Remove MorphAnalysis __str__ and __repr__
2020-05-29 19:20:56 +02:00
svlandeg
291483157d prevent loading a pretrained Tok2Vec layer AND pretrained components 2020-05-29 17:38:33 +02:00
Adriane Boyd
e1b7cbd197 Remove MorphAnalysis __str__ and __repr__ 2020-05-29 14:33:47 +02:00
Ines Montani
4fd087572a WIP: improve model version deps 2020-05-28 12:51:37 +02:00
Matthw Honnibal
58750b06f8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-27 22:18:36 +02:00
Matthew Honnibal
aecd1437cc
Merge pull request #5508 from adrianeboyd/bugfix/tag-map-sp-tag
Prefer _SP over SP for default tag map space attrs
2020-05-27 20:39:40 +02:00
Adriane Boyd
25de2a2191 Improve vector name loading from model meta 2020-05-27 14:48:54 +02:00
adrianeboyd
aad0610a85
Map NR to PROPN (#5512) 2020-05-26 22:30:53 +02:00
Adriane Boyd
b6b5908f5e Prefer _SP over SP for default tag map space attrs
If `_SP` is already in the tag map, use the mapping from `_SP` instead
of `SP` so that `SP` can be a valid non-space tag. (Chinese has a
non-space tag `SP` which was overriding the mapping of `_SP` to
`SPACE`.)
2020-05-26 14:57:13 +02:00
Adriane Boyd
1eed101be9 Fix Polish lemmatizer for deserialized models
Restructure Polish lemmatizer not to depend on lookups data in
`__init__` since the lemmatizer is initialized before the lookups data
is loaded from a saved model. The lookups tables are accessed first in
`__call__` instead once the data is available.
2020-05-26 09:56:12 +02:00
Ines Montani
24ef6680fa
Merge pull request #5499 from adrianeboyd/chore/bump-version-deps-v2.3.0 2020-05-25 13:25:45 +02:00
Adriane Boyd
3f727bc539 Switch to v2.3.0.dev0 2020-05-25 12:57:20 +02:00
Adriane Boyd
736f3cb5af Bump version and deps for v2.3.0
* spacy to v2.3.0
* thinc to v7.4.1
* spacy-lookups-data to v0.3.2
2020-05-25 12:03:49 +02:00
Adriane Boyd
e06ca7ea24 Switch to new add API in PhraseMatcher unpickle 2020-05-25 11:22:47 +02:00
Ines Montani
1a15896ba9 unicode -> str consistency [ci skip] 2020-05-24 18:51:10 +02:00
Ines Montani
5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani
387c7aba15 Update test 2020-05-24 14:55:16 +02:00
Ines Montani
f9786d765e Simplify is_package check 2020-05-24 14:48:56 +02:00
Matthw Honnibal
2d9de8684d Support use_pytorch_for_gpu_memory config 2020-05-22 23:10:40 +02:00
Ines Montani
4465cad6c5 Rename spacy.analysis to spacy.pipe_analysis 2020-05-22 17:42:06 +02:00
Ines Montani
25d6ed3fb8
Merge pull request #5489 from explosion/feature/connected-components 2020-05-22 17:40:11 +02:00
Ines Montani
841c05b47b
Merge pull request #5490 from explosion/fix/remove-jsonschema 2020-05-22 17:39:54 +02:00
Ines Montani
569a65b60e Auto-format 2020-05-22 16:55:42 +02:00
Ines Montani
d844528c5f Add test for is_compatible_model 2020-05-22 16:55:15 +02:00
Ines Montani
12b7be1d98 Remove jsonschema from dependencies 2020-05-22 16:49:26 +02:00
Matthew Honnibal
f7f6df7275 Move to spacy.analysis 2020-05-22 16:43:18 +02:00
Matthew Honnibal
78d79d94ce Guess set_annotations=True in nlp.update
During `nlp.update`, components can be passed a boolean set_annotations
to indicate whether they should assign annotations to the `Doc`. This
needs to be called if downstream components expect to use the
annotations during training, e.g. if we wanted to use tagger features in
the parser.

Components can specify their assignments and requirements, so we can
figure out which components have these inter-dependencies. After
figuring this out, we can guess whether to pass set_annotations=True.

We could also call set_annotations=True always, or even just have this
as the only behaviour. The downside of this is that it would require the
`Doc` objects to be created afresh to avoid problematic modifications.
One approach would be to make a fresh copy of the `Doc` objects within
`nlp.update()`, so that we can write to the objects without any
problems. If we do that, we can drop this logic and also drop the
`set_annotations` mechanism. I would be fine with that approach,
although it runs the risk of introducing some performance overhead, and
we'll have to take care to copy all extension attributes etc.
2020-05-22 15:55:45 +02:00
Ines Montani
6728747f71
Merge pull request #5486 from explosion/fix/compat-py2 2020-05-22 15:47:21 +02:00
Ines Montani
6e6db6afb6 Better model compatibility and validation 2020-05-22 15:42:46 +02:00
Matthew Honnibal
f6078d866a
Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match
Revert token_match priority changes from #4374 and extend token match options
2020-05-22 14:42:51 +02:00
Ines Montani
c685ee734a Fix compat for v2.x branch 2020-05-22 14:22:36 +02:00
Adriane Boyd
e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd
730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Adriane Boyd
71fe61fdcd Disallow merging 0-length spans 2020-05-22 10:14:34 +02:00
Matthew Honnibal
93c4d13588
Merge pull request #5264 from lfiedler/issue-5230
Fix ResourceWarnings during unittest
2020-05-22 00:31:07 +02:00
Matthew Honnibal
e1cb7e838b
Merge pull request #5481 from explosion/feature/blank-shortcut-v2
Add blank:{lang} shortcut support to util.load_model
2020-05-22 00:08:23 +02:00
Ines Montani
2250380816
Merge pull request #5482 from explosion/fix/backwards-compat-super 2020-05-21 21:51:46 +02:00
Ines Montani
891fa59009 Use backwards-compatible super() 2020-05-21 20:52:48 +02:00
Matthew Honnibal
5ce02c1b17
Merge pull request #5470 from svlandeg/bugfix/noun-chunks
Bugfix in noun chunks
2020-05-21 20:51:31 +02:00
Matthw Honnibal
25b51f4fc8 Set version to v3.0.0.dev9 2020-05-21 20:47:52 +02:00
Matthw Honnibal
bc94fdabd0 Fix begin_training 2020-05-21 20:46:21 +02:00
Matthw Honnibal
d507ac28d8 Fix shape inference 2020-05-21 20:46:10 +02:00
Ines Montani
cb02bff0eb Add blank:{lang} shortcut to util.load_mode 2020-05-21 20:24:07 +02:00
Matthw Honnibal
df87c32a40 Pass smaller doc sample into model initialize 2020-05-21 20:17:24 +02:00
Ines Montani
581bda9f98 Update senter test and auto-format 2020-05-21 20:17:14 +02:00
Ines Montani
0f1beb5ff2 Tidy up and avoid absolute spacy imports in core 2020-05-21 20:05:03 +02:00
svlandeg
51715b9f72 span / noun chunk has +1 because end is exclusive 2020-05-21 19:56:56 +02:00
Adriane Boyd
132b2a6898 Merge remote-tracking branch 'upstream/master-tmp' into HEAD 2020-05-21 19:50:30 +02:00
Adriane Boyd
17ee9ab53a Fix _SP/POS=SPACE in strings serialization tests 2020-05-21 19:49:08 +02:00
Ines Montani
245f91df78 Fix merge issues 2020-05-21 19:42:13 +02:00
Matthw Honnibal
3b5cfec1fc Tweak memory management in train_from_config 2020-05-21 19:32:04 +02:00
Matthw Honnibal
f075655deb Fix shape inference in begin_training 2020-05-21 19:26:29 +02:00
svlandeg
84d5b7ad0a Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks
# Conflicts:
#	spacy/lang/el/syntax_iterators.py
#	spacy/lang/en/syntax_iterators.py
#	spacy/lang/fa/syntax_iterators.py
#	spacy/lang/fr/syntax_iterators.py
#	spacy/lang/id/syntax_iterators.py
#	spacy/lang/nb/syntax_iterators.py
#	spacy/lang/sv/syntax_iterators.py
2020-05-21 19:19:50 +02:00
svlandeg
f7d10da555 avoid unnecessary loop to check overlapping noun chunks 2020-05-21 19:15:57 +02:00
Ines Montani
631e20d0c6 Fix test and schemas 2020-05-21 19:01:02 +02:00
Ines Montani
d34fc0915e Remove serialization getter 2020-05-21 18:48:21 +02:00
Ines Montani
f44897e4c6 Update warning IDs 2020-05-21 18:39:11 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Ines Montani
c6ec19c844 Add missing declaration 2020-05-21 17:30:05 +02:00
Matthew Honnibal
884d9b060d
Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info
Add option to omit extra lexeme tables in CLI
2020-05-21 16:40:02 +02:00
Matthew Honnibal
e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Matthew Honnibal
26cd6a0229
Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos
Extend lemmatizer rules for all UPOS tags
2020-05-21 16:05:31 +02:00
Matthew Honnibal
cad9b290a2
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 16:04:24 +02:00
Matthew Honnibal
1f572ce89b
Merge pull request #5473 from explosion/fix/travis-tests
Fix Python 2.7 compat
2020-05-21 15:56:16 +02:00
Ines Montani
a9cb2882cb
Rename argument: doc_or_span/obj -> doclike (#5463)
* doc_or_span -> obj

* Revert "doc_or_span -> obj"

This reverts commit 78bb9ff5e0.

* obj -> doclike

* Refer to correct object
2020-05-21 15:17:39 +02:00
Ines Montani
bea863acd2 Fix naming conflict and formatting 2020-05-21 14:24:38 +02:00
Ines Montani
bd6353715a Merge branch 'master' into fix/travis-tests 2020-05-21 14:23:04 +02:00
Ines Montani
d8f3190c0a Tidy up and auto-format 2020-05-21 14:14:01 +02:00
Ines Montani
56de520afd Try to fix tests on Travis (2.7) 2020-05-21 14:04:57 +02:00
adrianeboyd
d45602bc11
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 10:26:01 +02:00
svlandeg
b221bcf1ba fixing all languages 2020-05-21 00:17:28 +02:00
svlandeg
b509a3e7fc fix: use actual range in 'seen' instead of subtree 2020-05-20 23:06:39 +02:00