Commit Graph

851 Commits

Author SHA1 Message Date
Ines Montani
06f1ecb308 Update v3 docs 2020-07-03 16:48:21 +02:00
Ines Montani
cdf9ee1716 Add stub for Example API docs [ci skip] 2020-07-03 15:46:10 +02:00
Ines Montani
fa8e097c04 Update convert docs [ci skip] 2020-07-03 15:42:04 +02:00
Jan Jessewitsch
e4dcac4a4b
Merging multiple docs into one (#5032)
* Add static method to Doc to allow merging of multiple docs.

* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().

* Add test for Doc.from_docs() implementation.

* Fix using numpy's concatenate in Doc.from_docs.

* Replace typing's type annotations in from_docs.

* Simply remove type annotations in from_docs.

* Add documentation for Doc.from_docs to api.

* Simplify from_docs, its test and the api doc for codebase consistency.

* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.

* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.

* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.

* Add MORPH to attrs

* Update warnings calls

* Remove out-dated error from merge

* Rename space_delimiter to ensure_whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-03 11:32:42 +02:00
Adriane Boyd
a723fa02a1
DocBin: add version number, missing attributes and strings (#5685)
* Add version number to DocBin

Add a version number to DocBin for future use.

* Add POS to all attributes in DocBin

* Add morph string to strings in DocBin

* Update DocBin API

* Add string for ENT_KB_ID in DocBin
2020-07-02 17:41:50 +02:00
Ines Montani
b5268955d7 Update matcher usage examples [ci skip] 2020-07-02 15:39:45 +02:00
Ines Montani
a4cfe9fc33 Remove inline notes on v2 changes [ci skip] 2020-07-01 22:29:22 +02:00
Ines Montani
fe4cfd0632 Start updating website for v3 [ci skip] 2020-07-01 21:26:39 +02:00
Ines Montani
26df4efa94 Add new in v3.0 2020-07-01 13:02:17 +02:00
Ines Montani
18a900abc2 Fix markup 2020-07-01 13:02:07 +02:00
Ines Montani
414dc7ace1 Merge branch 'spacy.io' into spacy.io-develop 2020-07-01 11:47:47 +02:00
Álvaro Abella Bascarán
7111b9de2e Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:01:12 +02:00
Álvaro Abella Bascarán
ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel
305221f3e5 Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:55 +02:00
Matthias Hertel
8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Adriane Boyd
d777d9cc38 Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:13:01 +02:00
Adriane Boyd
c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
Adriane Boyd
a2660bd9c6 Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:57 +02:00
Adriane Boyd
fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd
4f73ced914 Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:50:43 +02:00
Adriane Boyd
7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd
fcdecefacf Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:38:06 +02:00
Adriane Boyd
bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Ines Montani
52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Adriane Boyd
66889de166 Warning for sudachipy 0.4.5 (#5611) 2020-06-19 13:45:23 +02:00
Adriane Boyd
931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani
6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd
02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd
f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd
a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd
9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd
457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani
44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani
a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Adriane Boyd
d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Sofie Van Landeghem
c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
Sofie Van Landeghem
4d1ba6feb4
add tag variant for 2.3 (#5542) 2020-06-04 19:16:33 +02:00
Ines Montani
810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
svlandeg
5f0a91cf37 fix conv-depth parameter 2020-05-29 09:56:29 +02:00
Ines Montani
262d306eaa unicode -> str consistency 2020-05-24 17:23:00 +02:00
Ines Montani
5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Jannis
aa53ce6996
Documentation Typo Fix (#5492)
* Fix typo

Change 'realize' to 'realise'

* Add contributer agreement
2020-05-22 19:50:26 +02:00
Matthew Honnibal
f6078d866a
Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match
Revert token_match priority changes from #4374 and extend token match options
2020-05-22 14:42:51 +02:00
Ines Montani
65c7e82de2 Auto-format and remove 2.3 feature [ci skip] 2020-05-22 13:50:30 +02:00
Adriane Boyd
e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd
730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Sofie Van Landeghem
0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Ines Montani
f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd
a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Adriane Boyd
565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
svlandeg
ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
adrianeboyd
792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd
90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
Mike
481574cbc8
[minor doc change] embedding vis. link is broken in website/docs/usage/examples.md (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
laszabine
fb73d4943a
Amend documentation to Language.evaluate (#5319)
* Specified usage of arguments to Language.evaluate

* Created contributor agreement
2020-04-16 20:00:18 +02:00
Sofie Van Landeghem
a3965ec13d
tag-map-path since 2.2.4 instead of 2.2.3 (#5289) 2020-04-14 14:53:47 +02:00
Marek Grzenkowicz
6a8a52650f
[Closes #5292] Fix typo in option name "--n-save_every" (#5293)
* Sign contributor agreement for chopeen

* Fix typo in option name and close #5292
2020-04-11 23:35:01 +02:00
Sofie Van Landeghem
1137420840
Small doc fixes (#5250)
* fix link

* torchtext instead tochtext
2020-04-03 13:01:43 +02:00
Nikhil Saldanha
d1ddfa1cb7 update docs for EntityRecognizer.predict
return type was wrongly written as a tuple, changed to syntax.StateClass
2020-03-28 18:13:02 +01:00
Sofie Van Landeghem
9b412516e7
Fixing pickling of the parser (#5218)
* fix __reduce__ for pickling parser

* setting the move object as 'state' during pickling

* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani
46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Tiljander
e53232533b
Describing priority rules for overlapping matches (#5197)
* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd
d88a377bed
Remove Vectors.from_glove (#5209) 2020-03-26 10:45:47 +01:00
Ines Montani
17bd9ed84f
Merge pull request #5153 from pinealan/fix/website-docs
Fix website typos and weird sentences
2020-03-16 15:03:01 +01:00
Alan Chan
2124be100d Tweak run-on sentence 2020-03-15 03:45:20 +08:00
Alan Chan
7c3a4ce933 Missing word in api/cli doc 2020-03-15 03:45:20 +08:00
Alan Chan
36e3532475 Remove unfinished sentence 2020-03-15 03:45:17 +08:00
Mark Abraham
a0ffa346c0 Fix broken link in docs 2020-03-13 14:07:26 +01:00
Ines Montani
c669435c62
Merge pull request #5125 from renaud/patch-1
small typo in code sample
2020-03-12 11:19:12 +01:00
svlandeg
1724a4f75b additional information if doc is empty 2020-03-09 18:08:18 +01:00
Renaud Richardet
eccf6b1686
small typo in code sample 2020-03-09 14:49:11 +01:00
Adriane Boyd
0c31f03ec5 Update docs [ci skip] 2020-03-09 13:41:17 +01:00
Adriane Boyd
1139247532 Revert changes to token_match priority from #4374
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani
1d6aec805d Fix formatting and update docs for v2.2.4 2020-03-09 11:17:20 +01:00
Ines Montani
acb4e3c7ba
Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape
Fix formatting in Token API
2020-02-25 14:57:25 +01:00
Sofie Van Landeghem
479bd8d09f
add lemma option to displacy 'dep' visualiser (#5041)
* add lemma option to displacy 'dep' visualiser

* more compact list comprehension

* add option to doc

* fix test and add lemmas to util.get_doc

* fix capital

* remove lemma from get_doc

* cleanup
2020-02-22 14:11:51 +01:00
Adriane Boyd
3853d385fa Fix formatting in Token API 2020-02-20 13:41:24 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Julin S
479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
Ines Montani
9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Ines Montani
abd5c06374 Adjust formatting [ci skip] 2020-02-03 13:00:02 +01:00
Martin A. Kayser
02a44c5be2
Adding a note on retrieving the string rep of the match_id (#4904)
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
adrianeboyd
7ad000fce7 Update docs for train CLI --use_gpu option (#4927) 2020-01-20 17:02:47 +01:00
Preston Badeer
b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook
53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Ines Montani
1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00