Commit Graph

12055 Commits

Author SHA1 Message Date
svlandeg
fba219f737 remove unnecessary itertools call 2020-03-16 08:31:36 +01:00
Alan Chan
1ae01684cf Fill in contributor agreement 2020-03-15 03:45:20 +08:00
Alan Chan
2124be100d Tweak run-on sentence 2020-03-15 03:45:20 +08:00
Alan Chan
7c3a4ce933 Missing word in api/cli doc 2020-03-15 03:45:20 +08:00
Alan Chan
36e3532475 Remove unfinished sentence 2020-03-15 03:45:17 +08:00
nihil
9cde7eb08c add spacy_syllables to universe + sign contributor agreement 2020-03-13 18:09:42 +01:00
svlandeg
59000ee21d fix serialization of empty doc + unit test 2020-03-13 16:07:56 +01:00
Mark Abraham
a0ffa346c0 Fix broken link in docs 2020-03-13 14:07:26 +01:00
Adriane Boyd
423849f94a Fix sents comparison in test util
Due to changes to `Span` (#5005), spans from different documents are now
never equal. Check `Token.is_sent_start` values instead.
2020-03-13 09:25:23 +01:00
Ines Montani
353f8486f5 Merge branch 'master' into spacy.io 2020-03-12 14:45:33 +01:00
Matthew Honnibal
26a90f011b Set version to v2.2.4 2020-03-12 11:30:41 +01:00
Ines Montani
c669435c62
Merge pull request #5125 from renaud/patch-1
small typo in code sample
2020-03-12 11:19:12 +01:00
Ines Montani
4130fef4ec
Merge pull request #5127 from svlandeg/docs/empty-doc
is_XXX is True if doc is empty
2020-03-12 11:18:10 +01:00
Ines Montani
3497b2973d
Merge pull request #5130 from merrcury/patch-1
DOC : Update LICENSE Year
2020-03-12 11:17:38 +01:00
Himanshu Garg
27d1300bdb
Create merrcury.md 2020-03-10 15:11:07 +05:30
Himanshu Garg
ba47d5a5cb
Update LICENSE Year 2020-03-10 15:03:29 +05:30
svlandeg
c4d030dbf6 remove accidental commit 2020-03-09 18:10:54 +01:00
svlandeg
1724a4f75b additional information if doc is empty 2020-03-09 18:08:18 +01:00
Renaud Richardet
eccf6b1686
small typo in code sample 2020-03-09 14:49:11 +01:00
Adriane Boyd
0c31f03ec5 Update docs [ci skip] 2020-03-09 13:41:17 +01:00
Adriane Boyd
1139247532 Revert changes to token_match priority from #4374
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani
1d6aec805d Fix formatting and update docs for v2.2.4 2020-03-09 11:17:20 +01:00
Ines Montani
5f68004264 Port over gitignore changes from develop
Prevents stale files when switching branches
2020-03-09 11:05:00 +01:00
Mark Abraham
0345135167
Tokenizer to_disk and from_disk now ensure paths (#5116)
* Tokenizer to_disk and from_disk now ensure strings are converted to paths

Fixes #5115

* Sign contributor agreement
2020-03-08 13:25:56 +01:00
Yohei Tamura
31755630a7
fix typ (#5106) 2020-03-08 13:24:38 +01:00
adrianeboyd
9dd98a4b27
Improve Makefile (#5105)
* Explicitly upgrade pip

* Include spacy-lookups-data in pex
2020-03-08 13:24:19 +01:00
Sofie Van Landeghem
5847be6022
Tok2Vec: extract-embed-encode (#5102)
* avoid changing original config

* fix elif structure, batch with just int crashes otherwise

* tok2vec example with doc2feats, encode and embed architectures

* further clean up MultiHashEmbed

* further generalize Tok2Vec to work with extract-embed-encode parts

* avoid initializing the charembed layer with Docs (for now ?)

* small fixes for bilstm config (still does not run)

* rename to core layer

* move new configs

* walk model to set nI instead of using core ref

* fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)
2020-03-08 13:23:18 +01:00
adrianeboyd
993758c58f
Remove unnecessary iterator in Language.pipe (#5101)
Remove iterator over `raw_texts` with `iterator.tee()` in
`Language.pipe` that is never consumed and consumes memory
unnecessarily.
2020-03-08 13:22:25 +01:00
Ines Montani
cd79c7bd26
Merge pull request #5110 from dhpollack/dhp/fix-minor-svg-error
fix typo in svg file - caused documentation build error
2020-03-06 15:32:43 +01:00
Sofie Van Landeghem
1a2b8fc264
set vector of merged entity (#5085)
* merge_entities sets the vector in the vocab for the merged token

* add unit test

* import unicode_literals

* move code to _merge function

* only set vector if vocab has non-zero vectors
2020-03-06 14:45:28 +01:00
adrianeboyd
c95ce96c44
Update sentence recognizer (#5109)
* Update sentence recognizer

* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests

* Update serialization test for senter
2020-03-06 14:45:02 +01:00
Sofie Van Landeghem
6ac9fc0619
Unit test for NEL functionality (#5114)
* empty begin_training for sentencizer

* overfitting unit test for entity linker

* fixed NEL IO by storing the entity_vector_length in the cfg
2020-03-06 14:42:23 +01:00
David Pollack
80004930ed fix typo in svg file 2020-03-05 17:04:33 +01:00
Matthew Honnibal
3440a72ecb
Update Makefile (#5099) 2020-03-04 19:28:16 +01:00
Ines Montani
31faab3647
Merge pull request #5097 from mirfan899/master
Basque language support added.
2020-03-04 17:20:23 +01:00
Ines Montani
3adc511cb0
Merge pull request #5070 from explosion/refactor/simplify-warnings
Simplify warnings
2020-03-04 17:11:18 +01:00
Ines Montani
b0cfab317f Merge branch 'develop' into refactor/simplify-warnings 2020-03-04 16:38:55 +01:00
Ines Montani
99d8ee506f
Merge pull request #5100 from adrianeboyd/feature/bump-srsly-1.0.2
Require srsly >=1.0.2
2020-03-04 16:32:52 +01:00
Adriane Boyd
4d655b1d45 Require srsly >=1.0.2 2020-03-04 13:50:37 +01:00
Muhammad Irfan
224a7f8e94 examples 2020-03-04 15:49:06 +05:00
Muhammad Irfan
03376c9d9b Basque language added and tested. 2020-03-04 11:58:56 +05:00
adrianeboyd
9be90dbca3
Improve token head verification (#5079)
* Improve token head verification

Improve the verification for valid token heads when heads are set:

* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document

* Improve error message
2020-03-03 21:44:51 +01:00
adrianeboyd
8c20dae6f7
Fix model-final/model-best meta from train CLI (#5093)
* Fix model-final/model-best meta

* include speed and accuracy from final iteration
* combine with speeds from base model if necessary

* Include token_acc metric for all components
2020-03-03 21:43:25 +01:00
Sofie Van Landeghem
a0998868ff
prevent updating cfg if the Model was already defined (#5078) 2020-03-03 13:58:56 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
adrianeboyd
d078b47c81
Break out of infinite loop as intended (#5077) 2020-03-03 12:29:05 +01:00
adrianeboyd
697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
adrianeboyd
2281c4708c
Restore empty tokenizer properties (#5026)
* Restore empty tokenizer properties

* Check for types in tokenizer.from_bytes()

* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem
c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
Ines Montani
648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00