Commit Graph

10981 Commits

Author SHA1 Message Date
Sofie Van Landeghem
2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Peter Gilles
428887b8f2 Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md
2019-10-14 12:27:50 +02:00
adrianeboyd
98a961a60e Fix PhraseMatcher.remove for overlapping patterns (#4437) 2019-10-14 12:19:51 +02:00
Ines Montani
f8f68bb062 Auto-format [ci skip] 2019-10-10 17:08:39 +02:00
adrianeboyd
d2d2baaf76 Revert training example edit from #4327 (#4403)
I think the original annotation was correct and this change also
unfortunately introduced a cycle into the dependency tree.
2019-10-10 17:00:26 +02:00
adrianeboyd
6f54e59fe7 Fix util.filter_spans() to prefer first span in overlapping sam… (#4414)
* Update util.filter_spans() to prefer earlier spans

* Add filter_spans test for first same-length span

* Update entity relation example to refer to util.filter_spans()
2019-10-10 17:00:03 +02:00
Sofie Van Landeghem
da6e0de34f fix attrs field in the matcher (#4423)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB
2019-10-10 15:20:59 +02:00
Sofie Van Landeghem
5efae495f1 Error when removing a matcher rule that doesn't exist (#4420)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing
2019-10-10 14:01:53 +02:00
Matthew Honnibal
fa95c030a5
Unify matcher get_ent_id and get_pattern_key (#4415)
This is basically stabbing blindly at the ghost match problem, but it at
least seems like there was a bug previously here --- so this should
hopefully be an improvement, even if it doesn't fix the ghost match
problem.
2019-10-09 15:26:31 +02:00
Ines Montani
77643de2ca Downgrade importlib_metadata requirement 2019-10-08 23:43:24 +02:00
Ines Montani
5cbe21700b Only show label scheme if not empty [ci skip] 2019-10-08 15:52:59 +02:00
Ines Montani
8f76d6c9ef Update transformer model details [ci skip] 2019-10-08 15:39:38 +02:00
Ines Montani
dd30d3ec99 Add setuptools as runtime dependency 2019-10-08 12:46:59 +02:00
Ines Montani
c4f95c1569 Update formatting and docstrings [ci skip] 2019-10-08 12:25:23 +02:00
Matthew Honnibal
ddd6fda59c Add registry for model creation functions ('architectures') (#4395)
* Add architecture registry

* Add test for arch registry

* Add error for model architectures
2019-10-08 12:21:03 +02:00
tamuhey
650cbfe82d multiprocessing pipe (#1303) (#4371)
* refactor: separate formatting docs and golds in Language.update

* fix return typo

* add pipe test

* unpickleable object cannot be assigned to p.map

* passed test pipe

* passed test!

* pipe terminate

* try pipe

* passed test

* fix ch

* add comments

* fix len(texts)

* add comment

* add comment

* fix: multiprocessing of pipe is not supported in 2

* test: use assert_docs_equal

* fix: is_python3 -> is_python2

* fix: change _pipe arg to use functools.partial

* test: add vector modification test

* test: add sample ner_pipe and user_data pipe

* add warnings test

* test: fix user warnings

* test: fix warnings capture

* fix: remove islice import

* test: remove warnings test

* test: add stream test

* test: rename

* fix: multiproc stream

* fix: stream pipe

* add comment

* mp.Pipe seems to be able to use with relative small data

* test: skip stream test in python2

* sort imports

* test: add reason to skiptest

* fix: use pipe for docs communucation

* add comments

* add comment
2019-10-08 12:20:55 +02:00
adrianeboyd
14841d0aa6 Fix PhraseMatcher callback and add tests (#4399)
* Fix callback lookup in PhraseMatcher (string key rather than hash key)
* Add callback tests for Matcher and PhraseMatcher
2019-10-08 12:07:02 +02:00
Matthew Honnibal
fd4a5341b0 Fix ner_jsonl2json converter (fix #4389) (#4394) 2019-10-08 00:52:45 +02:00
Matthew Honnibal
29f9fec267
Improve spacy pretrain (#4393)
* Support bilstm_depth arg in spacy pretrain

* Add option to ignore zero vectors in get_cossim_loss

* Use cosine loss in Cloze multitask
2019-10-07 23:34:58 +02:00
Ines Montani
9cd6ca3e4d Improve usage of pkg_resources and handling of entry points (#4387)
* Only import pkg_resources where it's needed

Apparently it's really slow

* Use importlib_metadata for entry points

* Revert "Only import pkg_resources where it's needed"

This reverts commit 5ed8c03afa.

* Revert "Revert "Only import pkg_resources where it's needed""

This reverts commit 8b30b57957.

* Revert "Use importlib_metadata for entry points"

This reverts commit 9f071f5c40.

* Revert "Revert "Use importlib_metadata for entry points""

This reverts commit 02e12a17ec.

* Skip test that weirdly hangs

* Fix hanging test by using global
2019-10-07 17:22:09 +02:00
adrianeboyd
d53a8d9313 Consider batch_size when sorting similar vectors (#4388) 2019-10-07 13:38:35 +02:00
adrianeboyd
a3509f67d4 Extend unicode character block for Sinhala (#4378)
* Extend unicode character block for Sinhala

* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
adrianeboyd
cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani
fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Ines Montani
e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Matthew Honnibal
37ef874d8b Set version to v2.2.1 2019-10-03 14:50:39 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani
ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ben Taylor
1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Ines Montani
4159936720 Update README.md [ci skip] 2019-10-02 19:15:22 +02:00
Ines Montani
e4782feae9 Update README.md [ci skip] 2019-10-02 18:49:55 +02:00
Ines Montani
80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani
f8e606c303 Update README.md [ci skip] 2019-10-02 16:47:10 +02:00
Ines Montani
12a941d841 Update binder version [ci skip] 2019-10-02 16:47:01 +02:00
Matthew Honnibal
2eb31012e7 Set version to v2.2.0 2019-10-02 14:40:06 +02:00
Matthew Honnibal
796072e560 Set version to v2.2.0.dev19 2019-10-02 12:51:29 +02:00
Sofie Van Landeghem
9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
Ines Montani
52b5912dbf Tidy up [ci skip] 2019-10-02 12:05:59 +02:00
adrianeboyd
d82241218a Make the default NER labels less model-specific [ci skip] (#4361) 2019-10-02 12:05:17 +02:00
adrianeboyd
dda86118bd Update Ukrainian lemmatizer with new lookups (#4359)
* Update Ukrainian lemmatizer with new lookups

* Add missing import


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
208629615d Auto-format 2019-10-02 10:37:04 +02:00
Ines Montani
867e93aae2 Add Streamlit example [ci skip] 2019-10-02 01:21:20 +02:00
Matthew Honnibal
38b6e69389 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 22:28:25 +02:00
Matthew Honnibal
d4b63bb6dd Set version to v2.2.0 2019-10-01 22:28:13 +02:00
Ines Montani
9885b5ae68 Update spacy_lookups_data version [ci skip] 2019-10-01 22:21:21 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Matthew Honnibal
667f294627 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 21:37:25 +02:00