spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Author	SHA1	Message	Date
Ines Montani	bc089b693c	Update tests	2021-01-29 19:38:09 +11:00
Ines Montani	01ecfbcc45	Merge branch 'develop' into feature/replace-listeners	2021-01-29 15:57:32 +11:00
Ines Montani	911dfcccfc	Add option to replace listeners for sourced components	2021-01-29 15:57:04 +11:00
Adriane Boyd	fcce3600ed	Forbid OP matching 2+ tokens in DependencyMatcher (#6824 ) Instead of silently using only the first token in each matched span: * Forbid `OP: ?//+` through `DependencyMatcher` validation As a fail-safe, add warning if a token match that's not exactly one token long is found by a token pattern.	2021-01-29 08:52:01 +08:00
Sofie Van Landeghem	24a697abb8	avoid empty aliases and improve UX and docs (#6840 )	2021-01-29 08:51:40 +08:00
Sofie Van Landeghem	837a4f53c2	Error handling in nlp.pipe (#6817 ) * add error handler for pipe methods * add unit tests * remove pipe method that are the same as their base class * have Language keep track of a default error handler * cleanup * formatting * small refactor * add documentation	2021-01-29 08:51:21 +08:00
Sofie Van Landeghem	6b68ad027b	Fix beam NER resizing (#6834 ) * move label check to sub methods * add tests	2021-01-27 23:39:14 +11:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	80ba9eaf7d	Fix test	2021-01-27 21:29:02 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Matthew Honnibal	68b1c2984d	Test labels are added implicitly	2021-01-27 12:52:29 +11:00
Dhruv Naik	e7db07a0b9	Fix Span.char_span bug (#6816 ) * Create dhruvrnaik.md * add test for issue #6815 * bugfix for issue #6815 * update dhruvrnaik.md * add span.vector test for #6815	2021-01-26 15:50:37 +08:00
Adriane Boyd	2263bc7b28	Update develop from master for v3.0.0rc5 (#6811 ) * Fix `spacy.util.minibatch` when the size iterator is finished (#6745) * Skip 0-length matches (#6759) Add hack to prevent matcher from returning 0-length matches. * support IS_SENT_START in PhraseMatcher (#6771) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead * ensure span.text works for an empty span (#6772) * Remove unicode_literals Co-authored-by: Santiago Castro <bryant@montevideo.com.uy> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-26 14:52:45 +11:00
Matthew Honnibal	f049df1715	Revert "Set annotations in update" (#6810 ) * Revert "Set annotations in update (#6767)" This reverts commit `e680efc7cc`. * Fix version * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/multilabel_textcategorizer.md * Update website/docs/api/tok2vec.md * Update website/docs/usage/layers-architectures.md * Update website/docs/usage/layers-architectures.md * Update website/docs/api/transformer.md * Update website/docs/api/textcategorizer.md * Update website/docs/api/tagger.md * Update spacy/pipeline/entity_linker.py * Update website/docs/api/sentencerecognizer.md * Update website/docs/api/pipe.md * Update website/docs/api/morphologizer.md * Update website/docs/api/entityrecognizer.md * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/multitask.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/entitylinker.md * Update website/docs/api/dependencyparser.md * Update spacy/pipeline/trainable_pipe.pyx	2021-01-25 22:18:45 +08:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
Sofie Van Landeghem	5ace559201	ensure span.text works for an empty span (#6772 )	2021-01-21 23:18:46 +08:00
Sofie Van Landeghem	d93cd3b7c0	remove artificially duplicated test [ci skip]	2021-01-21 10:53:16 +01:00
Sofie Van Landeghem	fdf8c77630	support IS_SENT_START in PhraseMatcher (#6771 ) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead	2021-01-21 09:59:17 +01:00
Sofie Van Landeghem	e680efc7cc	Set annotations in update (#6767 ) * bump to 3.0.0rc4 * do set_annotations in component update calls * update docs and remove set_annotations flag * fix EL test	2021-01-20 11:49:25 +11:00
Adriane Boyd	bc7d83d4be	Skip 0-length matches (#6759 ) Add hack to prevent matcher from returning 0-length matches.	2021-01-19 07:38:11 +08:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	bf0cdae8d4	Add token_splitter component (#6726 ) * Add long_token_splitter component Add a `long_token_splitter` component for use with transformer pipelines. This component splits up long tokens like URLs into smaller tokens. This is particularly relevant for pretrained pipelines with `strided_spans`, since the user can't change the length of the span `window` and may not wish to preprocess the input texts. The `long_token_splitter` splits tokens that are at least `long_token_length` tokens long into smaller tokens of `split_length` size. Notes: * Since this is intended for use as the first component in a pipeline, the token splitter does not try to preserve any token annotation. * API docs to come when the API is stable. * Adjust API, add test * Fix name in factory	2021-01-17 19:54:41 +08:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Ines Montani	f9e4ac1283	Fix test	2021-01-15 12:51:02 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	57369909c0	Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3	2021-01-15 11:44:28 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
Ines Montani	31a92b28ae	Merge pull request #6715 from adrianeboyd/feature/before-after-init-callbacks Add initialize.before_init and after_init callbacks	2021-01-13 12:17:00 +11:00
Ines Montani	97d5a7ba99	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-13 12:03:02 +11:00
Ines Montani	8d6448ccf7	Add config resolver test	2021-01-13 12:02:59 +11:00
svlandeg	232e953b14	pytest.approx with absolute eps	2021-01-12 20:32:57 +01:00
svlandeg	5b598bd1d5	formatting	2021-01-12 17:28:41 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	a45d89f09a	Add initialize.before_init and after_init callbacks Add `initialize.before_init` and `initialize.after_init` callbacks to the config. The `initialize.before_init` callback is a place to implement one-time tokenizer customizations that are then saved with the model.	2021-01-12 13:07:44 +01:00
Adriane Boyd	ad43cbb042	Sync missing and misaligned values in Tagger loss (#6689 ) Use `None` for both missing and misaligned annotation in `Tagger.get_loss`, reverting to the default missing value in the loss function.	2021-01-10 11:30:37 +11:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	8c1a23209f	Getting scores out of beam_parser (#6684 ) * clean up of ner tests * beam_parser tests * implement get_beam_parses and scored_parses for the dep parser * we don't have to add the parse if there are no arcs	2021-01-07 16:28:27 +11:00
Sofie Van Landeghem	402dbc5bae	Getting scores out of beam_ner (#6575 ) * small fixes and formatting * bring test_issue4313 up-to-date, currently fails * formatting * add get_beam_parses method back * add scored_ents function * delete tag map	2021-01-06 12:02:32 +01:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Adriane Boyd	bf9096437e	Set default lemmas in retokenizer (#6667 ) Instead of unsetting lemmas on retokenized tokens, set the default lemmas to: * merge: concatenate any existing lemmas with `SPACY` preserved * split: use the new `ORTH` values if lemmas were previously set, otherwise leave unset	2021-01-06 12:29:44 +08:00
Adriane Boyd	0041dfbc7f	Use special matcher for exceptions with spaces (#6668 ) Use the special cases phrase matcher for exceptions that include space characters so that exceptions including spaces are supported.	2021-01-06 12:05:10 +08:00

1 2 3 4 5 ...

2146 Commits