* replace unicode categories with raw list of code points
* simplifying ranges
* fixing variable length quotes
* removing redundant regular expression
* small cleanup of regexp notations
* quotes and alpha as ranges instead of alterations
* removed most regexp dependencies and features
* exponential backtracking - unit tests
* rewrote expression with pathological backtracking
* disabling double hyphen tests for now
* test additional variants of repeating punctuation
* remove regex and redundant backslashes from load_reddit script
* small typo fixes
* disable double punctuation test for russian
* clean up old comments
* format block code
* final cleanup
* naming consistency
* french strings as unicode for python 2 support
* french regular expression case insensitive
* running UD eval
* printing timing of tokenizer: tokens per second
* timing of default English model
* structured output and parameterization to compare different runs
* additional flag to allow evaluation without parsing info
* printing verbose log of errors for manual inspection
* printing over- and undersegmented cases (and combo's)
* add under and oversegmented numbers to Score and structured output
* print high-freq over/under segmented words and word shapes
* printing examples as part of the structured output
* print the results to file
* batch run of different models and treebanks per language
* cleaning up code
* commandline script to process all languages in spaCy & UD
* heuristic to remove blinded corpora and option to run one single best per language
* pathlib instead of os for file paths
* Update matcher engine for regex and extensions
Add support for matching over arbitrary Python predicate functions, and
arbitrary Python attribute getters. This will allow matching over regex
patterns, and allow supporting extension attributes.
The results of the Python predicate functions are cached, so that we don't
call the same predicate function twice for the same token. The extension
attributes are fetched into an array for each token in the doc. This
should minimise the performance impact of the new features.
We still need to wire up these features to the patterns, and test it
all.
* Work on wiring up extra attributes in matcher
* Work on tests for extra matcher attrs
* Add support for extension attrs to matcher
* Test extension attribute matching
* Work on implementing predicate-based match patterns
* Get predicates working for set membership
* Add test for set membership
* Make extensions+predicates work
* Test matcher extensions
* Cache predicate results better in Matcher
* Remove print statement in matcher test
* Use srsly to get key for predicates
If doc.from_array() was called with say, only entity information, this
would cause doc.is_tagged to be set to False, even if tags were set.
This caused tags to be dropped from serialisation. The same was true for
doc.is_parsed.
Closes#3012.
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.
This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes#1537
* Update test for span.as_doc()
* Make span.as_doc() return a copy. Closes#1537
* Document change to Span.as_doc()
The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token.
* Add test for resizing doc.tensor when merging
* Add test for resizing doc.tensor when merging. Closes#1963
* Update get_lca_matrix test for develop
* Fix retokenize if tensor unset
<!--- Provide a general summary of your changes in the title. -->
## Description
See #3079. Here I'm merging into `develop` instead of `master`.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
Bug fix.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Test on #2396: bug in Doc.get_lca_matrix()
* reimplementation of Doc.get_lca_matrix(), (closes#2396)
* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()
* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()
* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix
* use memory view instead of np.ndarray in _get_lca_matrix (faster)
* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview
* cleaner conditional, add comment
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes#2464 and #3009
* Fix Python2
* Remove check for overwritten factory
This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in.
* Add helper to only load specific entry point
Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront.
* Check entry points for custom languages
## Description
- [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set)
- [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting)
- [x] add option to customise host for web server
- [x] show warning if `displacy.serve` is called from within Jupyter notebooks
- [x] move error message to `spacy.errors.Errors`.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
The output weights often return negative scores for classes, especially
via the bias terms. This means that when we add a new class, we can't
rely on just zeroing the weights, or we'll end up with positive
predictions for those labels.
To solve this, we use nan values as the initial weights for new labels.
This prevents them from ever coming out on top. During backprop, we
replace the nan values with the minimum assigned score, so that we're
still able to learn these classes.
After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model.
To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.
* issue #3012: add test
* add contributor aggreement
* Make test work without models and fix typos
ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.
I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.
While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.
* Added alpha support for Tagalog language
* Edited contributor template
* Included SCA; Reverted templates
* Fixed SCA template
* Fixed changes in SCA template