spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 12:36:46 +03:00

Author	SHA1	Message	Date
Ines Montani	6c7a930ee8	Fix variable	2020-12-08 20:44:59 +11:00
Ines Montani	94a5a9814f	Update argument handling and documentation	2020-12-08 20:41:18 +11:00
Adriane Boyd	6c221d4841	Fix subsequent pipe detection in EntityRuler Fix subsequent pipe detection to detect the position of the current object by comparing the component itself rather than from the factory name.	2020-12-08 10:01:30 +01:00
Ines Montani	b87793a89a	Merge pull request #6523 from adrianeboyd/bugfix/remove-use-chars Remove non-working --use-chars from train CLI	2020-12-08 09:30:48 +01:00
Adriane Boyd	5ceac425ee	Remove non-working --use-chars from train CLI Remove the non-working `--use-chars` option from the train CLI. The implementation of the option across component types and the CLI settings could be fixed, but the `CharacterEmbed` model does not work on GPU in v2 so it's better to remove it.	2020-12-08 08:30:00 +01:00
Ines Montani	ef59ce783b	Adjust install instructions [ci skip]	2020-12-08 18:06:50 +11:00
Ines Montani	d25b1606d6	Allow reading config from sdtin in spacy train	2020-12-08 18:01:40 +11:00
Ines Montani	6cfa66ed1c	Make training.loop return nlp object and path (#6520 )	2020-12-08 14:55:55 +08:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Ines Montani	d8e01ca931	Merge pull request #6391 from adrianeboyd/docs/install-guide	2020-12-08 07:42:16 +01:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	dcecc75270	Improve blis and numpy build dependencies (#6455 ) * Fix blis build dependencies * Add blis with python_version constraints to pyproject.toml * Add blis to setup_requires * Remove --only-binary from CI * Reduce number of builds to speed up CI * Add hack to install wheel for python 3.5 in linux * Remove os spec from CI * Remove detailed numpy build constraints * Remove detailed numpy build constraints from `pyproject.toml` because it is too difficult to maintain for many architectures * These constraints are more a reflection of what is available on pypi as binary wheels rather than any real build requirements that it is necessary for users to follow when building from source * Users building their own binary packages will need to enforce the constraints that make sense in their environments, e.g., the `conda` compatible numpy pins * Keep the build constraints in `build-constraints.txt` for use with our builds * Our builds with wheelwright are built against the earliest compatible binary versions of numpy on pypi * These constraints are documented within the distribution * Revert "Remove os spec from CI" This reverts commit `7489476688`.	2020-12-08 14:29:34 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	e931d3f72b	Move max_length to nlp.make_doc() (#6512 ) Move max_length check to `nlp.make_doc()` so that's it's also checked for `nlp.pipe()`.	2020-12-08 14:24:02 +08:00
Ines Montani	ee2ec52f48	Merge pull request #6409 from svlandeg/feature/trf-docs	2020-12-08 06:32:10 +01:00
Ines Montani	c2b196c2c1	Merge pull request #6419 from svlandeg/feature/rel-docs	2020-12-08 06:30:41 +01:00
Ines Montani	82e88f0e3b	Merge pull request #6379 from svlandeg/fix/labels-constructor	2020-12-08 06:29:56 +01:00
Sofie Van Landeghem	52fa46dd58	tested EL scripts with 2.3.4 (#6517 )	2020-12-07 20:46:38 +01:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Adriane Boyd	78085fab1f	Check for spacy-nightly package in download (#6502 ) Also check for spacy-nightly in download so that `--no-deps` isn't set for normal nightly installs.	2020-12-04 09:40:03 +01:00
Ines Montani	63f83e7034	Merge pull request #6470 from adrianeboyd/feature/license-in-package	2020-12-04 03:55:54 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	31ec9a906e	Clean up 3rd party license info (#6478 ) Move scikit-learn license from `Scorer` to `licenses/3rd_party_licenses.txt`.	2020-12-02 10:15:23 +01:00
Adriane Boyd	591cd48aa8	Remove config.cfg from MANIFEST	2020-12-01 12:58:02 +01:00
Adriane Boyd	b0dd13e0ba	Support LICENSE in spacy package If present, include the file `input_dir/LICENSE` at the top level of the packaged model.	2020-11-30 13:43:58 +01:00
Adriane Boyd	1442d2f213	Improve simple training example in v3 migration (#6438 ) * Create the examples once * Use the examples in the initialization * Provide the batch size * Fix `begin_training` migration example	2020-11-30 09:39:45 +08:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Sofie Van Landeghem	079f6ea474	avoid resolving the full config (#6465 )	2020-11-30 09:34:29 +08:00
Ines Montani	9beba7164f	Make jinja2 top-level import No problem anymore since it's now an official dependency	2020-11-27 15:17:14 +08:00
Ines Montani	d21d2c2e59	Don't multiply accuracy by 100	2020-11-27 15:15:51 +08:00
Adriane Boyd	26296ab223	Add error message if DocBin zlib decompress fails (#6394 ) Add a better error message if DocBin zlib decompress fails, indicating that the data is not in `DocBin` format.	2020-11-27 14:39:49 +08:00
Adriane Boyd	3a5cc5f8b4	Set version to v2.3.4	2020-11-26 08:48:52 +01:00
Adriane Boyd	e0f5646a4a	Restore cleanup_beam method (#6446 )	2020-11-25 13:21:48 +01:00
Adriane Boyd	40c583a41b	Remove --prefer-binary and --only-binary from CI	2020-11-25 12:24:11 +01:00
Adriane Boyd	cf693f0eae	Fix token_match in tokenizer	2020-11-25 11:49:34 +01:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Jacob Bortell	fe9009911a	Update rule-based-matching.md (#6421 ) * Update rule-based-matching.md Clarified case-sensititivy of dictionary-referencing attributes (POS/TAG/DEP/etc). Clarified "Type" column header to "Value Type" * Update rule-based-matching.md Improved clarity of wording	2020-11-24 16:20:19 +01:00
Jacob Bortell	992723dfac	Add jabortell to the contributors (#6422 ) * Add jabortell to the contributors * Update jabortell.md Added tick to applicable statement	2020-11-24 16:15:31 +01:00
Adriane Boyd	6f133877aa	Update source install instructions * Don't recommend an editable install in the default source instructions. * Use `pip install --no-build-isolation` for editable installs. * Remove reference to `virtualenv`.	2020-11-24 14:44:13 +01:00
Adriane Boyd	afd744bc05	Update Travis CI pip install steps (#6440 )	2020-11-24 14:10:16 +01:00
Adriane Boyd	573f5c863f	Fix tag map clobbering in spacy train (#6437 ) Fix bug from #5768 where the tag map is clobbered if a custom tag map isn't provided.	2020-11-24 13:13:16 +01:00
Adriane Boyd	ce18fc6588	Set version to v2.3.3	2020-11-24 10:03:45 +01:00
Adriane Boyd	cd61d264ef	Set version to v2.3.3.dev0	2020-11-23 13:51:59 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	cdca44ac11	Dynamically include numpy headers (#6418 ) * Dynamically include numpy headers * Add `build-constraints.txt` with numpy version pins for building wheels with `pip` and `wheelwright` * Update `setup.py` to add current numpy include directory * Assume `cython` and `numpy` are installed for `setup.py` * Remove included numpy headers * Fix typo in requirements.txt * Use script in CI	2020-11-23 11:15:11 +01:00
Adriane Boyd	3f61f5eb54	Use int8_t instead of char in Matcher (#6413 ) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char	2020-11-23 10:26:47 +01:00
Adriane Boyd	4284605683	Remove Beam cleanup (#6414 ) Beam cleanup is handled through the Beam finalization method.	2020-11-23 10:01:46 +01:00
Adriane Boyd	a8c2dad466	Add all vectors to vocab before pruning (#6408 ) Add all vectors to the vocab before pruning to correct the selection of vectors to prioritize.	2020-11-23 10:00:59 +01:00

... 4 5 6 7 8 ...

14129 Commits