Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.
This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes#1537
* Update test for span.as_doc()
* Make span.as_doc() return a copy. Closes#1537
* Document change to Span.as_doc()
The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token.
* Add test for resizing doc.tensor when merging
* Add test for resizing doc.tensor when merging. Closes#1963
* Update get_lca_matrix test for develop
* Fix retokenize if tensor unset
<!--- Provide a general summary of your changes in the title. -->
## Description
See #3079. Here I'm merging into `develop` instead of `master`.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
Bug fix.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Test on #2396: bug in Doc.get_lca_matrix()
* reimplementation of Doc.get_lca_matrix(), (closes#2396)
* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()
* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()
* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix
* use memory view instead of np.ndarray in _get_lca_matrix (faster)
* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview
* cleaner conditional, add comment
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes#2464 and #3009
* Fix Python2
* Remove check for overwritten factory
This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in.
* Add helper to only load specific entry point
Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront.
* Check entry points for custom languages
## Description
- [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set)
- [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting)
- [x] add option to customise host for web server
- [x] show warning if `displacy.serve` is called from within Jupyter notebooks
- [x] move error message to `spacy.errors.Errors`.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
The output weights often return negative scores for classes, especially
via the bias terms. This means that when we add a new class, we can't
rely on just zeroing the weights, or we'll end up with positive
predictions for those labels.
To solve this, we use nan values as the initial weights for new labels.
This prevents them from ever coming out on top. During backprop, we
replace the nan values with the minimum assigned score, so that we're
still able to learn these classes.
After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model.
To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.
* issue #3012: add test
* add contributor aggreement
* Make test work without models and fix typos
ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.
I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.
While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.
* Added alpha support for Tagalog language
* Edited contributor template
* Included SCA; Reverted templates
* Fixed SCA template
* Fixed changes in SCA template
* Try to implement cosine loss
This one seems to be correct? Still unsure, but it performs okay
* Try to implement the von Mises-Fisher loss
This one's definitely not right yet.
The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
Support semi-supervised learning in spacy train
One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing.
Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning.
Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective.
Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage:
python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze
Implement rehearsal methods for pipeline components
The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows:
Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model.
Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details.
Implement rehearsal updates for tagger
Implement rehearsal updates for text categoriz