Commit Graph

13925 Commits

Author SHA1 Message Date
Adriane Boyd
3f61f5eb54
Use int8_t instead of char in Matcher (#6413)
* Use signed char instead of char in Matcher

Remove unused char* utf8_t typedef

* Use int8_t instead of signed char
2020-11-23 10:26:47 +01:00
Adriane Boyd
4284605683
Remove Beam cleanup (#6414)
Beam cleanup is handled through the Beam finalization method.
2020-11-23 10:01:46 +01:00
Adriane Boyd
a8c2dad466
Add all vectors to vocab before pruning (#6408)
Add all vectors to the vocab before pruning to correct the selection of
vectors to prioritize.
2020-11-23 10:00:59 +01:00
Adriane Boyd
13f0676f04
Updates for python 3.9 (#6338)
* Update blis and thinc version ranges

* Update thinc version range

* Update setup.cfg for python 3.9

* Adjust blis and thinc ranges
* Add python 3.9 classifier

* Update CI for python 3.9

* Add --prefer-binary to CI sdist install

* Update CI python 3.7 mac image

* Add --prefer-binary to Travis CI

* Update install instructions in README

* Specify blis versions separately for < / >= 3.6

* Update --prefer-binary in README

* Test cleaner sdist install

* Also upgrade pip

(This is kind of unnecessary given --prefer-binary but may avoid other
issues related to sdist installs in the future.)

* Compile with -j 2

* Remove wheel from setup_requires

* Update to have separate CI uninstall step

* Remove wheel from pyproject.toml

* Recommend upgrading setuptools in addition to pip
2020-11-23 09:45:18 +01:00
Yusuke Mori
e3ac90b035
Avoid a SyntaxError in self-attentive-parser (#6428)
* Avoid a SyntaxError in self-attentive-parser

Fix a usage of quotation marks in the example of spaCy Universe self-attentive-parser

* Create forest1988.md

Fill in the spaCy contributor agreement
2020-11-22 21:59:37 +01:00
svlandeg
218abaa69a typo 2020-11-20 22:36:49 +01:00
svlandeg
e861e928df more small corrections 2020-11-20 22:29:58 +01:00
svlandeg
5ac0867427 final fixes 2020-11-20 22:18:53 +01:00
svlandeg
331ec83493 edits and updates to implementing REL component docs 2020-11-20 21:41:52 +01:00
svlandeg
4a3e611abc small fixes and formatting 2020-11-20 15:55:05 +01:00
svlandeg
124f49feb6 update REL model code 2020-11-20 15:25:20 +01:00
svlandeg
636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
Sofie Van Landeghem
165993d8e5
fix typo in transformer docs (#6404) 2020-11-19 14:11:38 +01:00
M. Revuelta Espinosa
51232ffb9e
Update universe.json (include PatternOmatic) (#6399)
Request to include PatternOmatic in spaCy Universe

Adds @revuel to contributors
2020-11-19 13:15:50 +01:00
Adriane Boyd
3cf6479467 Fix JSON in #6395 2020-11-17 15:25:41 +01:00
Sam Edwardes
78913a4f95
Added spaCyTextBlob to universe.json (#6395) 2020-11-17 14:38:34 +01:00
Adriane Boyd
96726ec1f6
Fix DocBin init in training example (#6396) 2020-11-17 14:36:44 +01:00
Adriane Boyd
ed32fa80cd Update source install instructions
* Use `pip install` instead of `python setup.py install`
* For developers recommend:
  * `python setup.py build_ext --inplace -j N`
  * `python setup.py develop`
2020-11-16 10:13:51 +01:00
svlandeg
99d0412b6e add link to REL project 2020-11-15 18:35:56 +01:00
svlandeg
73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg
d5a920325f remove labels from constructor 2020-11-11 21:34:12 +01:00
svlandeg
fcd79e0655 remove set_morphology from docs 2020-11-11 21:32:34 +01:00
Adriane Boyd
320a8b1481
Add ent_id_ to strings serialized with Doc (#6353) 2020-11-10 20:16:07 +08:00
Adriane Boyd
a7e7d6c6c9
Ignore misaligned in Morphologizer.get_loss (#6363)
Fix bug where `Morphologizer.get_loss` treated misaligned annotation as
`EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH`
mappings.
2020-11-10 20:15:09 +08:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Ines Montani
3ca5c7082d Use pip install . in quickstart [ci skip] 2020-11-10 17:27:49 +08:00
Ines Montani
de6453940e
Merge pull request #6305 from svlandeg/feature/score-docs [ci skip] 2020-11-10 02:52:11 +01:00
Ines Montani
d490428089 Update README.md [ci skip] 2020-11-10 09:51:20 +08:00
Ines Montani
4d337eedf2
Merge pull request #6322 from medspacy/master 2020-11-10 02:47:29 +01:00
Ines Montani
d7950c5ada
Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip] 2020-11-10 02:45:52 +01:00
Ines Montani
448bfbdc30 Remove conda from nightly install widget [ci skip] 2020-11-10 09:44:52 +08:00
svlandeg
789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani
363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Adriane Boyd
90550552a0
CI updates for python 3.5 (#6354)
* Update pip in CI

* Use --prefer-binary

* Use `--prefer-binary`
* Delete all installed packages before testing source install

* sdist install with --only-binary :all:
2020-11-06 13:35:51 +01:00
Daniel Vasic
20d72de986
Added Multext-East V5 tagset for Croatian language (#6248)
* Added Multext-East V5 tagset for Croatian language

* Create danielvasic.md

* Update danielvasic.md

* Update danielvasic.md

* Add tag map to CroatianDefaults

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-05 12:19:22 +01:00
Robert Šípek
6069efe57d
Add tag map to cs language (#6284) 2020-11-05 10:13:11 +01:00
Adriane Boyd
8644ee3e3f
Update TIGER link and tag description (#6344) 2020-11-05 09:33:00 +01:00
Vu Ha
6d465ec52c
add oprd to the list of accepted deps for noun chunking (#6302)
* add oprd to the list of accepted deps for noun chunking

* add SCA
2020-11-05 09:17:35 +01:00
Adriane Boyd
31de700b0f
Fix on_match callback and remove empty patterns (#6312)
For the `DependencyMatcher`:

* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
2020-11-05 09:16:26 +01:00
Sofie Van Landeghem
8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Ines Montani
019a1dd5e8 Fix v3 overview [ci skip] 2020-11-03 18:10:06 +01:00
Adriane Boyd
1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Adriane Boyd
a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Alec Chapman
204c7c8a00 fix thumbnail link to be github raw url 2020-11-01 07:53:48 -07:00
Adriane Boyd
5d2cb86c34
Fix on_match callback for DependencyMatcher (#6313)
Fix `DependencyMatcher` so that the callback is called only once per
match.
2020-10-31 12:20:27 +01:00
Adriane Boyd
45c9a68828
Identify final Matcher pattern node by quantifier (#6317)
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.

It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)

In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
2020-10-31 12:18:48 +01:00
Sofie Van Landeghem
2918923541
fix resolving of dot notation (#6326) 2020-10-31 12:17:06 +01:00
Alec Chapman
73d22d96ff add medspacy to universe and fix example w/ cov-bsv 2020-10-29 07:53:56 -06:00
Duygu Altinok
0e55f806dd
Turkish tokenization improvements (#6268)
* added single and paired orth variants

* added token match

* added long text tokenization test

* inverted init

* normalized lemmas to lowercase

* more abbrevs

* tests for ordinals and abbrevs

* separated period abbvrevs to another list

* fiex typo

* added ordinal and abbrev tests

* added number tests for dates

* minor refinement

* added inflected abbrevs regex

* added percentage and inflection

* cosmetics

* added token match

* added url inflection tests

* excluded url tokens from custom pattern

* removed url match import
2020-10-29 09:43:17 +01:00
Adriane Boyd
8cc5ed6771 Add Macedonian to website languages 2020-10-29 08:49:56 +01:00