Commit Graph

15306 Commits

Author SHA1 Message Date
Meenal Jhajharia
2613f0e98f
benepar usage example has deprecated imports 2021-08-28 16:35:58 +05:30
Sofie Van Landeghem
689535c264 config is not Optional (#9024) 2021-08-27 11:53:54 +02:00
Sofie Van Landeghem
1e974de837
config is not Optional (#9024) 2021-08-27 11:44:31 +02:00
github-actions[bot]
fb9c31fbda
Auto-format code with black (#9065)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-27 11:42:27 +02:00
Sofie Van Landeghem
8c1d86ea92 Document use-case of freezing tok2vec (#8992)
* update error msg

* add sentence to docs

* expand note on frozen components
2021-08-26 09:53:29 +02:00
Sofie Van Landeghem
31c0a75e6d fix docs for Span constructor arguments (#9023) 2021-08-26 09:52:59 +02:00
Sofie Van Landeghem
4d39430b82
Document use-case of freezing tok2vec (#8992)
* update error msg

* add sentence to docs

* expand note on frozen components
2021-08-26 09:50:35 +02:00
Sofie Van Landeghem
94fb840443
fix docs for Span constructor arguments (#9023) 2021-08-25 16:06:22 +02:00
David Strouk
31e9b126a0
Fix verbs list in lang/fr/tokenizer_exceptions.py (#9033) 2021-08-25 15:55:09 +02:00
Ines Montani
4cd052e81d
Include component factories in third-party dependencies resolver (#9009)
* Include component factories in third-party dependencies resolver

* Increment catalogue and update test
2021-08-25 14:58:01 +02:00
svlandeg
fb8c2f794a Merge remote-tracking branch 'upstream/master' into spacy.io 2021-08-20 14:49:51 +02:00
Sofie Van Landeghem
e1f88de729
bump to 3.1.2 (#9008) 2021-08-20 12:41:09 +02:00
Sofie Van Landeghem
4d52d7051c
Fix spancat training on nested entities (#9007)
* overfitting test on non-overlapping entities

* add failing overfitting test for overlapping entities

* failing test for list comprehension

* remove test that was put in separate PR

* bugfix

* cleanup
2021-08-20 12:37:50 +02:00
Paul O'Leary McCann
9cc3dc2b67
Add glossary entry for _SP (#8983) 2021-08-20 12:04:02 +02:00
Sofie Van Landeghem
de025beb5f
Warn and document spangroup.doc weakref (#8980)
* test for error after Doc has been garbage collected

* warn about using a SpanGroup when the Doc has been garbage collected

* add warning to the docs

* rephrase slightly

* raise error instead of warning

* update

* move warning to doc property
2021-08-20 11:06:19 +02:00
Paul O'Leary McCann
0e4da8ed70 Fix type annotation in docs 2021-08-20 15:35:41 +09:00
Paul O'Leary McCann
37fe847af4 Fix type annotation in docs 2021-08-20 15:34:22 +09:00
Ines Montani
8444aa75e2 Fix universe.json [ci skip] 2021-08-20 11:26:46 +10:00
Ines Montani
f2b61b77a5 Fix universe.json [ci skip] 2021-08-20 11:26:29 +10:00
Ines Montani
f2d19e6dc2 Merge pull request #9003 from bbieniek/add-spacy-api-v3 [ci skip] 2021-08-20 11:23:50 +10:00
Ines Montani
894e16f5ca
Merge pull request #9003 from bbieniek/add-spacy-api-v3 [ci skip] 2021-08-20 11:23:30 +10:00
Baltazar
4d85cb88a5 added contribution license 2021-08-19 21:45:18 +02:00
Baltazar
71e65fe943 added spacy api v3 docker 2021-08-19 21:29:25 +02:00
Adriane Boyd
c5de9b463a
Update custom tokenizer APIs and pickling (#8972)
* Fix incorrect pickling of Japanese and Korean pipelines, which led to
the entire pipeline being reset if pickled

* Enable pickling of Vietnamese tokenizer

* Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and
Vietnamese so that only the `Vocab` is required for initialization
2021-08-19 14:37:47 +02:00
Adriane Boyd
6722dc3dc5
Fix allow_overlap default for spancat scoring (#8970)
* Remove irrelevant default options
2021-08-18 09:56:56 +02:00
Steele Farnsworth
b18cb1cd2a
Refactor dependencymatcher.pyx to use list comps and enumerate. (#8956)
* Refactor to use list comps and enumerate.

Replace loops that append to a list with a list comprehensions where this does not change the behavior; replace range(len(...)) loops with enumerate. Correct one typo in a comment. Replace a call to set() with a set literal.

* Undo double assignment.

Expand `tokens_to_key[j] = k = self._get_matcher_key(key, i, j)` to two statements.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Sign contributors agreement

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-18 09:55:45 +02:00
Ines Montani
d94ddd5686
Auto-detect package dependencies in spacy package (#8948)
* Auto-detect package dependencies in spacy package

* Add simple get_third_party_dependencies test

* Import packages_distributions explicitly

* Inline packages_distributions

* Fix docstring [ci skip]

* Relax catalogue requirement

* Move importlib_metadata to spacy.compat with note

* Include license information [ci skip]
2021-08-17 14:05:13 +02:00
Sofie Van Landeghem
0a6b68848f
Fix making span_group (#8975)
* fix _make_span_group

* fix imports
2021-08-17 10:36:34 +02:00
Ines Montani
593a22cf2d
Add development docs for Language and code conventions (#8745)
* WIP: add dev docs for Language / config [ci skip]

* Add section on initialization [ci skip]

* Fix wording [ci skip]

* Add code conventions WIP [ci skip]

* Update code convention docs [ci skip]

* Update contributing guide and conventions [ci skip]

* Update Code Conventions.md [ci skip]

* Clarify sourced components + vectors

* Apply suggestions from code review [ci skip]

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update wording and add link [ci skip]

* restructure slightly + extended index

* remove paragraph that breaks flow and is repeated in more detail later

* fix anchors

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-08-17 09:38:15 +02:00
Paul O'Leary McCann
4ed5d9ad5a Add notes on preparing training data to docs (#8964)
* Add training data section

Not entirely sure this is in the right location on the page - maybe it
should be after quickstart?

* Add pointer from binary format to training data section

* Minor cleanup

* Add to ToC, fix filename

* Update website/docs/usage/training.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move the training data section further down the page

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Run prettier

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-16 17:39:19 +02:00
Paul O'Leary McCann
9391998c77
Add notes on preparing training data to docs (#8964)
* Add training data section

Not entirely sure this is in the right location on the page - maybe it
should be after quickstart?

* Add pointer from binary format to training data section

* Minor cleanup

* Add to ToC, fix filename

* Update website/docs/usage/training.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move the training data section further down the page

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Run prettier

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-16 17:37:21 +02:00
Ines Montani
d65e03adae Merge pull request #8951 from HLasse/master 2021-08-16 11:41:53 +10:00
Ines Montani
a894fe0440
Merge pull request #8951 from HLasse/master 2021-08-16 11:41:32 +10:00
Lasse
839ea0f987 change tags formatting to match 2021-08-13 14:40:08 +02:00
Lasse
70ab596f61 Merge branch 'master' of https://github.com/HLasse/spaCy 2021-08-13 14:35:21 +02:00
Lasse
195e4e48c3 add textdescriptives to universe 2021-08-13 14:35:18 +02:00
github-actions[bot]
92071326d8
Auto-format code with black (#8950)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-13 11:48:38 +02:00
Adriane Boyd
8448c7dbc5
Update da trf recommendation (#8921)
Update the da trf recommendation to the same model used in the
pretrained pipelines.
2021-08-12 13:54:02 +02:00
Ines Montani
647abe186c Merge pull request #8938 from explosion/docs/prodigy-v1-11-project [ci skip]
Update Prodigy project template for v1.11
2021-08-12 21:17:14 +10:00
Ines Montani
6260f044cc
Merge pull request #8938 from explosion/docs/prodigy-v1-11-project [ci skip]
Update Prodigy project template for v1.11
2021-08-12 21:16:49 +10:00
Adriane Boyd
b278f31ee6
Document scorers in registry and components from #8766 (#8929)
* Document scorers in registry and components from #8766

* Update spacy/pipeline/lemmatizer.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/dependencyparser.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Reformat

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-12 12:50:03 +02:00
Edward
944ad6b1d4
Add new parameter for saving every n epoch in pretraining (#8912)
* Add parameter for saving every n epoch

* Add new parameter in schemas

* Add new parameter in default_config

* Adjust schemas

* format code
2021-08-12 11:14:48 +02:00
Ines Montani
4f769ff913 Update Prodigy project template for v1.11 [ci skip] 2021-08-12 13:46:20 +10:00
Paul O'Leary McCann
e227d24d43
Allow passing in array vars for speedup (#8882)
* Allow passing in array vars for speedup

This fixes #8845. Not sure about the docstring changes here...

* Update docs

Types maybe need more detail? Maybe not?

* Run prettier on docs

* Update spacy/tokens/span.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:53 +02:00
Adriane Boyd
f99d6d5e39
Refactor scoring methods to use registered functions (#8766)
* Add scorer option to components

Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.

* Add registered scorers for all components

* Add `scorers` registry
* Move all scoring methods outside of components as independent
  functions and register
* Use the registered scoring methods as defaults in configs and inits

Additional:

* The scoring methods no longer have access to the full component, so
  use settings from `cfg` as default scorer options to handle settings
  such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
  patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
  score overlapping spans correctly

* Update Russian lemmatizer to use direct score method

* Check type of cfg in Pipe.score

* Fix check

* Update spacy/pipeline/sentencizer.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove validate_examples from scoring functions

* Use Pipe.labels instead of Pipe.cfg["labels"]

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:39 +02:00
fgaim
ee011ca963
Update Tigrinya ትግርኛ language support (#8900)
* Add missing punctuation for Tigrinya and Amharic

* Fix numeral and ordinal numbers for Tigrinya

 - Amharic was used in many cases
 - Also fixed some typos

* Update Tigrinya stop-words

* Contributor agreement for fgaim

* Fix typo in "ti" lang test

* Remove multi-word entries from numbers and ordinals
2021-08-10 13:55:08 +02:00
Paul O'Leary McCann
6029cfc391
Add scores to output in spancat (#8855)
* Add scores to output in spancat

This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.

* Add basic doc note

* Vectorize score calcs

* Add "annotation format" section

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clean up doc section

* Ran prettier on docs

* Get arrays off the gpu before iterating over them

* Remove int() calls

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-08-10 13:47:49 +02:00
Dimitar Ganev
733ffe439d
Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862)
* Add more stop words and Improve the readability

* Add and categorize the tokenizer exceptions for `bg` lang

* Create syrull.md

* Add references for the additional stop words and tokenizer exc abbrs
2021-08-10 13:44:23 +02:00
Adriane Boyd
415dee587c
Merge pull request #8911 from adrianeboyd/chore/update-develop-from-master-v3.1-1
Update develop from master
2021-08-09 15:41:36 +02:00
Ines Montani
c581848cbb Merge pull request #8910 from DuyguA/patch-1 [ci skip]
updated unv json for new book
2021-08-09 23:13:17 +10:00