* Added the same punctuation rules as danish language.
* Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too
* Added test for long texts in swedish
* Added morph rules, infixes and suffixes to __init__.py for swedish
* Added some tests for prefixes, infixes and suffixes
* Added tests for lemma
* Renamed files to follow convention
* [sv] Removed ambigious abbreviations
* Added more tests for tokenizer exceptions
* Added test for problem with punctuation in issue #2578
* Contributor agreement
* Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')
Tamil language support to spaCy
Description
Hereby, creating new PR to add support for Tamil language in spaCy
added stop words, examples and numerical attributes
<--Working on other language data-->
Types of change
Enhancement
Checklist
[ x] I have submitted the spaCy Contributor Agreement.
[x ] I ran the tests, and all new and existing tests passed.
[ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Update call to `mkdir()` to create the parents
- Update the call to `output_dir.mkdir()` to also create the parents if needed
* don't automatically create parents but fail fast if cannot create directory
* add signed contributors agreement for retnuh
* adding adverbs and irregular cases for empty words
* adding adverbs and irregular cases for empty words
* adding adverbs and irregular cases for empty words
* updating contributor agreement for amperinet
* modifying French lookup that contained wrong lemmas
* correcting wrong line breaks on hyphen
* adding contributor agreement for amperinet@
* correcting a typo
Somehow, 4.1.x seems to cause test failure due to get_marker – possibly needs to be investigated for spacy-models/tests and likely not relevant on develop anymore
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!).
## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.
### Types of change
Bug fix
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
If doc.from_array() was called with say, only entity information, this
would cause doc.is_tagged to be set to False, even if tags were set.
This caused tags to be dropped from serialisation. The same was true for
doc.is_parsed.
Closes#3012.
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.
This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes#1537
* Update test for span.as_doc()
* Make span.as_doc() return a copy. Closes#1537
* Document change to Span.as_doc()
The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token.
* Add test for resizing doc.tensor when merging
* Add test for resizing doc.tensor when merging. Closes#1963
* Update get_lca_matrix test for develop
* Fix retokenize if tensor unset
<!--- Provide a general summary of your changes in the title. -->
## Description
See #3079. Here I'm merging into `develop` instead of `master`.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
Bug fix.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Test on #2396: bug in Doc.get_lca_matrix()
* reimplementation of Doc.get_lca_matrix(), (closes#2396)
* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()
* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()
* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix
* use memory view instead of np.ndarray in _get_lca_matrix (faster)
* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview
* cleaner conditional, add comment
* Test on #2396: bug in Doc.get_lca_matrix()
* reimplementation of Doc.get_lca_matrix(), (closes#2396)
* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()
* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()
* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix
* use memory view instead of np.ndarray in _get_lca_matrix (faster)
* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview
* cleaner conditional, add comment
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes#2464 and #3009
* Fix Python2
* Remove check for overwritten factory
This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in.
* Add helper to only load specific entry point
Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront.
* Check entry points for custom languages
## Description
- [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set)
- [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting)
- [x] add option to customise host for web server
- [x] show warning if `displacy.serve` is called from within Jupyter notebooks
- [x] move error message to `spacy.errors.Errors`.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
The output weights often return negative scores for classes, especially
via the bias terms. This means that when we add a new class, we can't
rely on just zeroing the weights, or we'll end up with positive
predictions for those labels.
To solve this, we use nan values as the initial weights for new labels.
This prevents them from ever coming out on top. During backprop, we
replace the nan values with the minimum assigned score, so that we're
still able to learn these classes.
After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model.
To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.