Matthw Honnibal
3b5cfec1fc
Tweak memory management in train_from_config
2020-05-21 19:32:04 +02:00
Matthw Honnibal
f075655deb
Fix shape inference in begin_training
2020-05-21 19:26:29 +02:00
svlandeg
84d5b7ad0a
Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks
...
# Conflicts:
# spacy/lang/el/syntax_iterators.py
# spacy/lang/en/syntax_iterators.py
# spacy/lang/fa/syntax_iterators.py
# spacy/lang/fr/syntax_iterators.py
# spacy/lang/id/syntax_iterators.py
# spacy/lang/nb/syntax_iterators.py
# spacy/lang/sv/syntax_iterators.py
2020-05-21 19:19:50 +02:00
svlandeg
f7d10da555
avoid unnecessary loop to check overlapping noun chunks
2020-05-21 19:15:57 +02:00
Matthw Honnibal
1729165e90
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-05-21 19:11:08 +02:00
Ines Montani
631e20d0c6
Fix test and schemas
2020-05-21 19:01:02 +02:00
Ines Montani
d34fc0915e
Remove serialization getter
2020-05-21 18:48:21 +02:00
Ines Montani
f44897e4c6
Update warning IDs
2020-05-21 18:39:11 +02:00
Ines Montani
24f72c669c
Merge branch 'develop' into master-tmp
2020-05-21 18:39:06 +02:00
Ines Montani
c6ec19c844
Add missing declaration
2020-05-21 17:30:05 +02:00
Matthew Honnibal
884d9b060d
Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info
...
Add option to omit extra lexeme tables in CLI
2020-05-21 16:40:02 +02:00
Matthew Honnibal
e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
...
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Matthew Honnibal
26cd6a0229
Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos
...
Extend lemmatizer rules for all UPOS tags
2020-05-21 16:05:31 +02:00
Matthew Honnibal
cad9b290a2
Merge branch 'master' into feature/omit-extra-lexeme-info
2020-05-21 16:04:24 +02:00
Matthew Honnibal
1f572ce89b
Merge pull request #5473 from explosion/fix/travis-tests
...
Fix Python 2.7 compat
2020-05-21 15:56:16 +02:00
Matthew Honnibal
7902ebc63c
Rename argument: doc_or_span/obj -> doclike ( #5463 )
...
* doc_or_span -> obj
* Revert "doc_or_span -> obj"
This reverts commit 78bb9ff5e0
.
* obj -> doclike
* Refer to correct object
2020-05-21 15:17:54 +02:00
Ines Montani
a9cb2882cb
Rename argument: doc_or_span/obj -> doclike ( #5463 )
...
* doc_or_span -> obj
* Revert "doc_or_span -> obj"
This reverts commit 78bb9ff5e0
.
* obj -> doclike
* Refer to correct object
2020-05-21 15:17:39 +02:00
Ines Montani
bea863acd2
Fix naming conflict and formatting
2020-05-21 14:24:38 +02:00
Ines Montani
bd6353715a
Merge branch 'master' into fix/travis-tests
2020-05-21 14:23:04 +02:00
Ines Montani
e2fe83e35d
Refer to correct object
2020-05-21 14:20:29 +02:00
Ines Montani
b1f45c9da3
obj -> doclike
2020-05-21 14:19:58 +02:00
Ines Montani
69fb4bedf2
Revert "doc_or_span -> obj"
...
This reverts commit 78bb9ff5e0
.
2020-05-21 14:14:28 +02:00
Ines Montani
d8f3190c0a
Tidy up and auto-format
2020-05-21 14:14:01 +02:00
Ines Montani
56de520afd
Try to fix tests on Travis (2.7)
2020-05-21 14:04:57 +02:00
Kevin Lu
a3b7ae4f98
Update universe.json
2020-05-21 13:59:09 +02:00
Ines Montani
f2a131bd9a
Merge pull request #5461 from kevinlu1248/master
2020-05-21 13:53:10 +02:00
adrianeboyd
d45602bc11
Merge branch 'master' into feature/omit-extra-lexeme-info
2020-05-21 10:26:01 +02:00
svlandeg
b221bcf1ba
fixing all languages
2020-05-21 00:17:28 +02:00
svlandeg
b509a3e7fc
fix: use actual range in 'seen' instead of subtree
2020-05-20 23:06:39 +02:00
svlandeg
36a94c409a
failing test to reproduce overlapping spans problem
2020-05-20 23:06:03 +02:00
adrianeboyd
49ef06d793
Add option for base model in init-model CLI ( #5467 )
...
Intended for languages like Chinese with a custom tokenizer.
2020-05-20 18:49:11 +02:00
Adriane Boyd
4b229bfc22
Improve handling of NER in CoNLL-U MISC
2020-05-20 18:48:51 +02:00
Matthew Honnibal
609c0ba557
Fix accidentally quadratic runtime in Example.split_sents ( #5464 )
...
* Tidy up train-from-config a bit
* Fix accidentally quadratic perf in TokenAnnotation.brackets
When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.
To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.
* Fixes
2020-05-20 18:48:18 +02:00
Kevin Lu
c7c4cd5fe1
Changed pyate code example in universe.json
2020-05-20 09:11:32 -07:00
Adriane Boyd
daaa7bf451
Add option to omit extra lexeme tables in CLI
2020-05-20 15:51:44 +02:00
Adriane Boyd
8cba0e41d8
Return lowercase form as default except for PROPN
2020-05-20 15:35:08 +02:00
adrianeboyd
9393253b66
Remove peeking from Parser.begin_training ( #5456 )
...
Inspect all instances in `Parser.begin_training` rather than only the
first 1000.
2020-05-20 15:18:06 +02:00
Ines Montani
78bb9ff5e0
doc_or_span -> obj
2020-05-20 14:56:52 +02:00
Matthw Honnibal
60e8da4813
Tidy up train-from-config a bit
2020-05-20 12:56:27 +02:00
Matthw Honnibal
fda7355508
Fix train-from-config
2020-05-20 12:30:21 +02:00
Matthw Honnibal
24efd54a42
Merge from develop
2020-05-20 12:27:31 +02:00
Sofie Van Landeghem
7f5715a081
Various fixes to NEL functionality, Example class etc ( #5460 )
...
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
2020-05-20 11:41:12 +02:00
Adriane Boyd
4fa9670537
Extend lemmatizer rules for all UPOS tags
2020-05-20 10:15:43 +02:00
Kevin Lu
291b9ad7b9
Update CONTRIBUTOR_AGREEMENT.md
2020-05-19 20:29:53 -07:00
Kevin Lu
9a1a535215
Create kevinlu1248.md
2020-05-19 20:25:45 -07:00
Kevin Lu
a23b3a5a50
Update CONTRIBUTOR_AGREEMENT.md
2020-05-19 20:24:24 -07:00
Kevin Lu
0a5b140235
Update universe.json
2020-05-19 20:12:21 -07:00
Matthew Honnibal
664a3603b0
Set version to v3.0.0.dev8
2020-05-19 17:15:39 +02:00
adrianeboyd
40e65d6f63
Fix most_similar for vectors with unused rows ( #5348 )
...
* Fix most_similar for vectors with unused rows
Address issues related to the unused rows in the vector table and
`most_similar`:
* Update `most_similar()` to search only through rows that are in use
according to `key2row`.
* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.
* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.
* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.
* Fix regression test using `most_similar`
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 16:41:26 +02:00
Matthew Honnibal
a2830c3ef5
Use thinc 8.0.0a9
2020-05-19 16:23:11 +02:00