* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
`make_clean_doc` is not needed and was removed.
`logsumexp` may be needed if I misunderstood the loss calculation, so I
left it in for now with a note.
* "y" etc.
Many changes described in pull request
* Update spacy/lang/fr/stop_words.py
* Update spacy/lang/fr/stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).
The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.
Besides being comparable with traditional evaluations, this scoring is
relatively fast.
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.
If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
* unit test for pickling KB
* add pickling test for NEL
* KB to_bytes and from_bytes
* NEL to_bytes and from_bytes
* xfail pickle tests for now
* fix docs
* cleanup