Commit Graph

4 Commits

Author SHA1 Message Date
Adriane Boyd
4f0843e0ec Sort spans before processing 2020-07-30 14:45:43 +02:00
Adriane Boyd
ca33e891e2 Extend AttributeRuler functionality
* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types
2020-07-30 11:17:33 +02:00
Adriane Boyd
a6edbcb7d2 Fix default name 2020-07-30 09:38:58 +02:00
Adriane Boyd
63f5951f8b Add AttributeRuler for token attribute exceptions
Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.
2020-07-30 09:10:59 +02:00