Commit Graph

8279 Commits

Author SHA1 Message Date
Ines Montani
05a2812ae0 Merge branch 'develop' into pr/6444 2020-12-09 11:04:03 +11:00
Sofie Van Landeghem
cfc72c2995
Bugfix multi-label textcat reproducibility (#6481)
* add test for multi-label textcat reproducibility

* remove positive_label

* fix lengths dtype

* fix comments

* remove comment that we should not have forgotten :-)
2020-12-09 06:29:15 +08:00
Sofie Van Landeghem
de108ed3e8
Add specific error when StaticVectors can't read the vectors data (#6450) 2020-12-09 06:16:07 +08:00
Ines Montani
8921364579
Merge pull request #6521 from explosion/feature/config-stdin
Allow reading config from stdin in spacy train
2020-12-08 22:07:43 +11:00
Ines Montani
6c7a930ee8 Fix variable 2020-12-08 20:44:59 +11:00
Ines Montani
94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Ines Montani
d25b1606d6 Allow reading config from sdtin in spacy train 2020-12-08 18:01:40 +11:00
Ines Montani
6cfa66ed1c
Make training.loop return nlp object and path (#6520) 2020-12-08 14:55:55 +08:00
Sofie Van Landeghem
2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Sofie Van Landeghem
f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Adriane Boyd
29b058ebdc
Fix spacy when retokenizing cases with affixes (#6475)
Preserve `token.spacy` corresponding to the span end token in the
original doc rather than adjusting for the current offset.

* If not modifying in place, this checks in the original document
(`doc.c` rather than `tokens`).
* If modifying in place, the document has not been modified past the
current span start position so the value at the current span end
position is valid.
2020-12-08 14:25:56 +08:00
Adriane Boyd
4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
Ines Montani
ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Ines Montani
82e88f0e3b
Merge pull request #6379 from svlandeg/fix/labels-constructor 2020-12-08 06:29:56 +01:00
Adriane Boyd
78085fab1f
Check for spacy-nightly package in download (#6502)
Also check for spacy-nightly in download so that `--no-deps` isn't set
for normal nightly installs.
2020-12-04 09:40:03 +01:00
Ines Montani
63f83e7034
Merge pull request #6470 from adrianeboyd/feature/license-in-package 2020-12-04 03:55:54 +01:00
Sofie Van Landeghem
d6c616a125
Fixes in test suite (#6457)
* fix slow test for textcat readers

* cleanup test_issue5551

* add explicit score weight

* cleanup
2020-12-02 12:57:08 +01:00
Adriane Boyd
31ec9a906e
Clean up 3rd party license info (#6478)
Move scikit-learn license from `Scorer` to
`licenses/3rd_party_licenses.txt`.
2020-12-02 10:15:23 +01:00
Adriane Boyd
591cd48aa8 Remove config.cfg from MANIFEST 2020-12-01 12:58:02 +01:00
Adriane Boyd
b0dd13e0ba Support LICENSE in spacy package
If present, include the file `input_dir/LICENSE` at the top level of the
packaged model.
2020-11-30 13:43:58 +01:00
Sofie Van Landeghem
079f6ea474
avoid resolving the full config (#6465) 2020-11-30 09:34:29 +08:00
Ines Montani
9beba7164f Make jinja2 top-level import
No problem anymore since it's now an official dependency
2020-11-27 15:17:14 +08:00
Adriane Boyd
26296ab223
Add error message if DocBin zlib decompress fails (#6394)
Add a better error message if DocBin zlib decompress fails, indicating
that the data is not in `DocBin` format.
2020-11-27 14:39:49 +08:00
Adriane Boyd
cf693f0eae Fix token_match in tokenizer 2020-11-25 11:49:34 +01:00
Adriane Boyd
724831b066 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Adriane Boyd
573f5c863f
Fix tag map clobbering in spacy train (#6437)
Fix bug from #5768 where the tag map is clobbered if a custom tag map
isn't provided.
2020-11-24 13:13:16 +01:00
Adriane Boyd
ce18fc6588 Set version to v2.3.3 2020-11-24 10:03:45 +01:00
Adriane Boyd
cd61d264ef Set version to v2.3.3.dev0 2020-11-23 13:51:59 +01:00
Sofie Van Landeghem
2af31a8c8d
Bugfix textcat reproducibility on GPU (#6411)
* add seed argument to ParametricAttention layer

* bump thinc to 7.4.3

* set thinc version range

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-23 12:29:35 +01:00
Adriane Boyd
3f61f5eb54
Use int8_t instead of char in Matcher (#6413)
* Use signed char instead of char in Matcher

Remove unused char* utf8_t typedef

* Use int8_t instead of signed char
2020-11-23 10:26:47 +01:00
Adriane Boyd
4284605683
Remove Beam cleanup (#6414)
Beam cleanup is handled through the Beam finalization method.
2020-11-23 10:01:46 +01:00
Adriane Boyd
a8c2dad466
Add all vectors to vocab before pruning (#6408)
Add all vectors to the vocab before pruning to correct the selection of
vectors to prioritize.
2020-11-23 10:00:59 +01:00
svlandeg
636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
svlandeg
73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg
d5a920325f remove labels from constructor 2020-11-11 21:34:12 +01:00
Adriane Boyd
320a8b1481
Add ent_id_ to strings serialized with Doc (#6353) 2020-11-10 20:16:07 +08:00
Adriane Boyd
a7e7d6c6c9
Ignore misaligned in Morphologizer.get_loss (#6363)
Fix bug where `Morphologizer.get_loss` treated misaligned annotation as
`EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH`
mappings.
2020-11-10 20:15:09 +08:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Ines Montani
de6453940e
Merge pull request #6305 from svlandeg/feature/score-docs [ci skip] 2020-11-10 02:52:11 +01:00
Ines Montani
d7950c5ada
Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip] 2020-11-10 02:45:52 +01:00
svlandeg
789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani
363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Daniel Vasic
20d72de986
Added Multext-East V5 tagset for Croatian language (#6248)
* Added Multext-East V5 tagset for Croatian language

* Create danielvasic.md

* Update danielvasic.md

* Update danielvasic.md

* Add tag map to CroatianDefaults

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-05 12:19:22 +01:00
Robert Šípek
6069efe57d
Add tag map to cs language (#6284) 2020-11-05 10:13:11 +01:00
Vu Ha
6d465ec52c
add oprd to the list of accepted deps for noun chunking (#6302)
* add oprd to the list of accepted deps for noun chunking

* add SCA
2020-11-05 09:17:35 +01:00
Adriane Boyd
31de700b0f
Fix on_match callback and remove empty patterns (#6312)
For the `DependencyMatcher`:

* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
2020-11-05 09:16:26 +01:00
Sofie Van Landeghem
8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Adriane Boyd
1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Adriane Boyd
a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Adriane Boyd
5d2cb86c34
Fix on_match callback for DependencyMatcher (#6313)
Fix `DependencyMatcher` so that the callback is called only once per
match.
2020-10-31 12:20:27 +01:00