spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-04 02:46:40 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	63b7accd74	💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107 ) Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent. This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537 * Update test for span.as_doc() * Make span.as_doc() return a copy. Closes #1537 * Document change to Span.as_doc()	2018-12-30 15:17:46 +01:00
Matthew Honnibal	72e4d3782a	Resize doc.tensor when merging spans. Closes #1963 (#3106 ) The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token. * Add test for resizing doc.tensor when merging * Add test for resizing doc.tensor when merging. Closes #1963 * Update get_lca_matrix test for develop * Fix retokenize if tensor unset	2018-12-30 15:17:17 +01:00
Matthew Honnibal	3d64eb4a74	Update get_lca_matrix test for develop	2018-12-30 14:28:07 +01:00
Matthew Honnibal	ac9e3a4a8b	Add test for #1773	2018-12-30 13:16:05 +01:00
Matthew Honnibal	ee4d06fb1b	Prevent exceptions from setting POS but not TAG. Closes #1773	2018-12-30 13:16:05 +01:00
Kirill Bulygin	b665a32b95	Enabling `tests/lang/ru/test_lemmatizer.py`, fixing a `unicode` issue (#3084 ) <!--- Provide a general summary of your changes in the title. --> ## Description See #3079. Here I'm merging into `develop` instead of `master`. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-30 12:10:26 +01:00
Álvaro Abella Bascarán	9bc4cc1352	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:05:52 +01:00
Matthew Honnibal	76e3e695af	Allow single string attributes in doc.to_array() Previously inputs like doc.to_array('ORTH') didn't work. Closes #3064	2018-12-29 16:24:40 +01:00
Matthew Honnibal	174e85439b	Fix behaviour of Matcher's ? quantifier for v2.1 (#3105 ) * Add failing test for matcher bug #3009 * Deduplicate matches from Matcher * Update matcher ? quantifier test * Fix bug with ? quantifier in Matcher The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009 * Fix Python2	2018-12-29 16:18:09 +01:00
Matthew Honnibal	e808bdd076	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-29 13:54:15 +01:00
Jari Bakken	ba8a840f84	spacy.cli.evaluate: fix TypeError (#3101 )	2018-12-28 11:14:28 +01:00
Jari Bakken	0546135fba	Set vectors.name when updating meta.json during training (#3100 ) * Set vectors.name when updating meta.json during training * add vectors name to meta in `spacy package`	2018-12-27 19:55:40 +01:00
Jari Bakken	cc95167b6d	cli.convert: fix typo in converter arguments (#3099 )	2018-12-27 18:08:41 +01:00
Jari Bakken	e172f2478e	Add three missing tags from the `nb` tag map (#3085 ) * Contributors agreement for jarib * Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.	2018-12-27 14:48:40 +01:00
Matthew Honnibal	978d8be8f9	Set version to v2.1.0a5	2018-12-21 00:26:39 +01:00
Matthew Honnibal	d3f03b1668	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-21 00:25:57 +01:00
Ines Montani	bb9ad37e05	Improve entry points and allow custom language classes via entry points (#3080 ) * Remove check for overwritten factory This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in. * Add helper to only load specific entry point Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront. * Check entry points for custom languages	2018-12-20 23:58:43 +01:00
Matthew Honnibal	f6ac00fab3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-20 18:45:44 +01:00
Matthew Honnibal	d8d27f9129	Set version to v2.1.0a5.dev0	2018-12-20 18:45:34 +01:00
Ines Montani	ca244f5f84	Small fixes to displaCy (#3076 ) ## Description - [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set) - [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting) - [x] add option to customise host for web server - [x] show warning if `displacy.serve` is called from within Jupyter notebooks - [x] move error message to `spacy.errors.Errors`. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-20 17:32:04 +01:00
Matthew Honnibal	f57bea8ab6	💫 Prevent parser from predicting unseen classes (#3075 ) The output weights often return negative scores for classes, especially via the bias terms. This means that when we add a new class, we can't rely on just zeroing the weights, or we'll end up with positive predictions for those labels. To solve this, we use nan values as the initial weights for new labels. This prevents them from ever coming out on top. During backprop, we replace the nan values with the minimum assigned score, so that we're still able to learn these classes.	2018-12-20 16:12:22 +01:00
Matthew Honnibal	9ec9f89b99	💫 Raise better error when using uninitialized pipeline component (#3074 ) After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model. To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.	2018-12-20 15:54:53 +01:00
Matthew Honnibal	c315e08e6e	Fix formatting of meta.json after spacy package	2018-12-19 14:36:08 +01:00
Matthew Honnibal	e24f94ce39	Fix handling of preset entities. closes #2779	2018-12-19 02:13:31 +01:00
Matthew Honnibal	faa8656582	Port parser fix for large label sets from master	2018-12-19 02:11:26 +01:00
Matthew Honnibal	99a84e4d0e	Make ParserModel.resize_output idempotent	2018-12-19 02:10:36 +01:00
Matthew Honnibal	0f83b98afa	Remove unused code from spacy pretrain	2018-12-18 19:19:26 +01:00
Ken	5f0c5fbfa4	issue #3012 : add test (#3021 ) * issue #3012: add test * add contributor aggreement * Make test work without models and fix typos ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10	2018-12-18 15:02:49 +01:00
Ines Montani	77a47b2b20	Auto-format	2018-12-18 15:02:11 +01:00
Kirill Bulygin	2fb004832f	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 15:01:06 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Brixjohn	52f3c95004	Added alpha support for Tagalog language (#3062 ) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template	2018-12-18 13:08:38 +01:00
Matthew Honnibal	92f4b9c8ea	set max batch size to 1000	2018-12-17 23:15:39 +00:00
Matthew Honnibal	3c4a2edf4a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-17 23:08:40 +00:00
Matthew Honnibal	95fc0176d1	Pass tagger options in begin_training	2018-12-17 23:08:31 +00:00
Matthew Honnibal	7c504b6ddb	Try to implement more losses for pretraining * Try to implement cosine loss This one seems to be correct? Still unsure, but it performs okay * Try to implement the von Mises-Fisher loss This one's definitely not right yet.	2018-12-17 14:48:27 +00:00
Matthew Honnibal	ab4b61fb6e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-16 20:11:43 +01:00
Matthew Honnibal	9ef30b0cde	Accept 'text' in matcher as an alternative to ORTH	2018-12-16 20:10:43 +01:00
Amandine Périnet	361554f629	Lemmatization of Adjectives - French : adding rules and vocabulary (#3045 ) * modifying FR lemmatisation for Adjectives * adding contributor agreement for amperinet * correcting some errors in vocabulary files	2018-12-16 18:11:07 +01:00
Sofie	c6ad557cea	French regular expressions instead of extensive exceptions list (on develop) (#3046 ) (resolves #2679 ) * merge changes of PR 3023 into develop branch instead of master * further deletions from exception list according to PR 3023	2018-12-16 18:04:55 +01:00
Ines Montani	7bbdffd36e	Remove pre-set lemma for "cause" (resolves #2165 )	2018-12-14 12:51:18 +01:00
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Matthew Honnibal	ab9494b2a3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-12 21:08:50 +00:00
Matthew Honnibal	fb56028476	Remove b1 and b2 decay	2018-12-12 12:37:07 +01:00
Matthew Honnibal	df15279e88	Reduce batch size during pretrain	2018-12-10 15:30:23 +00:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	449b889454	Fix KeyError in Vectors.most_similar. Fixes #2648	2018-12-10 16:19:18 +01:00
Matthew Honnibal	90aec6d2f6	Fix vectors for reserved words. Closes #2871	2018-12-10 16:09:49 +01:00
Matthew Honnibal	16fd8dce1d	Add get_string_id helper to spacy.strings	2018-12-10 16:09:26 +01:00

1 2 3 4 5 ...

5523 Commits