spaCy/website/usage/_linguistic-features/_rule-based-matching.jade
Ines Montani d33953037e
💫 Port master changes over to develop (#2979)
* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf3.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef455.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests
2018-11-29 16:30:29 +01:00

706 lines
29 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > RULE-BASED MATCHING
p
| spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]],
| that operates over tokens, similar
| to regular expressions. The rules can refer to token annotations (e.g.
| the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
| The rule matcher also lets you pass in a custom callback
| to act on matches for example, to merge entities and apply custom labels.
| You can also associate patterns with entity IDs, to allow some basic
| entity linking or disambiguation. To match large terminology lists,
| you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which
| accepts #[code Doc] objects as match patterns.
+h(3, "adding-patterns") Adding patterns
p
| Let's say we want to enable spaCy to find a combination of three tokens:
+list("numbers")
+item
| A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
| or "HELLO".
+item
| A token whose #[strong #[code is_punct] flag is set to #[code True]],
| i.e. any punctuation.
+item
| A token whose #[strong lowercase form matches "world"], e.g. "World"
| or "WORLD".
+code.
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
p
| First, we initialise the #[code Matcher] with a vocab. The matcher must
| always share the same vocab with the documents it will operate on. We
| can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
| our custom pattern. The second argument lets you pass in an optional
| callback function to invoke on a successful match. For now, we set it
| to #[code None].
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)
doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc[start:end] # the matched span
print(match_id, string_id, start, end, span.text)
p
| The matcher returns a list of #[code (match_id, start, end)] tuples in
| this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
| span #[code doc[0:2]] of our original document. The #[code match_id]
| is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
| "HelloWorld". To get the string value, you can look up the ID
| in the #[+api("stringstore") #[code StringStore]].
+code.
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # 'HelloWorld'
span = doc[start:end] # the matched span
p
| Optionally, we could also choose to add more than one pattern, for
| example to also match sequences without punctuation between "hello" and
| "world":
+code.
matcher.add('HelloWorld', None,
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
[{'LOWER': 'hello'}, {'LOWER': 'world'}])
p
| By default, the matcher will only return the matches and
| #[strong not do anything else], like merge entities or assign labels.
| This is all up to you and can be defined individually for each pattern,
| by passing in a callback function as the #[code on_match] argument on
| #[code add()]. This is useful, because it lets you write entirely custom
| and #[strong pattern-specific logic]. For example, you might want to
| merge #[em some] patterns into one token, while adding entity labels for
| other pattern types. You shouldn't have to create different matchers for
| each of those processes.
+h(4, "adding-patterns-attributes") Available token attributes
p
| The available token pattern keys are uppercase versions of the
| #[+api("token#attributes") #[code Token] attributes]. The most relevant
| ones for rule-based matching are:
+table(["Attribute", "Description"])
+row
+cell #[code ORTH]
+cell The exact verbatim text of a token.
+row
+cell.u-nowrap #[code LOWER]
+cell The lowercase form of the token text.
+row
+cell #[code LENGTH]
+cell The length of the token text.
+row
+cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
+cell
| Token text consists of alphanumeric characters, ASCII characters,
| digits.
+row
+cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE]
+cell Token text is in lowercase, uppercase, titlecase.
+row
+cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP]
+cell Token is punctuation, whitespace, stop word.
+row
+cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL]
+cell Token text resembles a number, URL, email.
+row
+cell.u-nowrap
| #[code POS], #[code TAG], #[code DEP], #[code LEMMA],
| #[code SHAPE]
+cell
| The token's simple and extended part-of-speech tag, dependency
| label, lemma, shape.
+row
+cell.u-nowrap #[code ENT_TYPE]
+cell The token's entity label.
+h(4, "adding-patterns-wildcard") Using wildcard token patterns
+tag-new(2)
p
| While the token attributes offer many options to write highly specific
| patterns, you can also use an empty dictionary, #[code {}] as a wildcard
| representing #[strong any token]. This is useful if you know the context
| of what you're trying to match, but very little about the specific token
| and its characters. For example, let's say you're trying to extract
| people's user names from your data. All you know is that they are listed
| as "User name: {username}". The name itself may contain any character,
| but no whitespace so you'll know it will be handled as one token.
+code.
[{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]
+h(4, "quantifiers") Using operators and quantifiers
p
| The matcher also lets you use quantifiers, specified as the #[code 'OP']
| key. Quantifiers let you define sequences of tokens to be mached, e.g.
| one or more punctuation marks, or specify optional tokens. Note that there
| are no nested or scoped quantifiers instead, you can build those
| behaviours with #[code on_match] callbacks.
+table([ "OP", "Description"])
+row
+cell #[code !]
+cell Negate the pattern, by requiring it to match exactly 0 times.
+row
+cell #[code ?]
+cell Make the pattern optional, by allowing it to match 0 or 1 times.
+row
+cell #[code +]
+cell Require the pattern to match 1 or more times.
+row
+cell #[code *]
+cell Allow the pattern to match zero or more times.
p
| In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
| behave inconsistently. They were usually interpreted
| "greedily", i.e. longer matches are returned where possible. However, if
| you specify two #[code +] and #[code *] patterns in a row and their
| matches overlap, the first operator will behave non-greedily. This quirk
| in the semantics is corrected in spaCy v2.1.0.
+h(3, "adding-phrase-patterns") Adding phrase patterns
p
| If you need to match large terminology lists, you can also use the
| #[+api("phrasematcher") #[code PhraseMatcher]] and create
| #[+api("doc") #[code Doc]] objects instead of token patterns, which is
| much more efficient overall. The #[code Doc] patterns can contain single
| or multiple tokens.
+code-exec.
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
p
| Since spaCy is used for processing both the patterns and the text to be
| matched, you won't have to worry about specific tokenization for
| example, you can simply pass in #[code nlp(u"Washington, D.C.")] and
| won't have to write a complex token pattern covering the exact
| tokenization of the term.
+h(3, "on_match") Adding #[code on_match] rules
p
| To move on to a more realistic example, let's say you're working with a
| large corpus of blog articles, and you want to match all mentions of
| "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
| To be safe, you only match on the uppercase versions, in case someone has
| written it as "Google i/o". You also add a second pattern with an added
| #[code {IS_DIGIT: True}] token this will make sure you also match on
| "Google I/O 2017". If your pattern matches, spaCy should execute your
| custom callback function #[code add_event_ent].
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# Get the ID of the 'EVENT' entity type. This is required to set an entity.
EVENT = nlp.vocab.strings['EVENT']
def add_event_ent(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = (EVENT, start, end)
doc.ents += (entity,)
print(doc[start:end].text, entity)
matcher.add('GoogleIO', add_event_ent,
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])
doc = nlp(u"This is a text about Google I/O 2015.")
matches = matcher(doc)
+aside("Tip: Visualizing matches")
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
| to quickly generate a NER visualization from your updated #[code Doc],
| which can be exported as an HTML file:
+code.o-no-block.
from spacy import displacy
html = displacy.render(doc, style='ent', page=True,
options={'ents': ['EVENT']})
| For more info and examples, see the usage guide on
| #[+a("/usage/visualizers") visualizing spaCy].
p
| We can now call the matcher on our documents. The patterns will be
| matched in the order they occur in the text. The matcher will then
| iterate over the matches, look up the callback for the match ID
| that was matched, and invoke it.
+code.
doc = nlp(YOUR_TEXT_HERE)
matcher(doc)
p
| When the callback is invoked, it is
| passed four arguments: the matcher itself, the document, the position of
| the current match, and the total list of matches. This allows you to
| write callbacks that consider the entire set of matched phrases, so that
| you can resolve overlaps and other conflicts in whatever way you prefer.
+table(["Argument", "Type", "Description"])
+row
+cell #[code matcher]
+cell #[code Matcher]
+cell The matcher instance.
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document the matcher was used on.
+row
+cell #[code i]
+cell int
+cell Index of the current match (#[code matches[i]]).
+row
+cell #[code matches]
+cell list
+cell
| A list of #[code (match_id, start, end)] tuples, describing the
| matches. A match tuple describes a span #[code doc[start:end]].
+h(3, "matcher-pipeline") Using custom pipeline components
p
| Let's say your data also contains some annoying pre-processing artefacts,
| like leftover HTML line breaks (e.g. #[code &lt;br&gt;] or
| #[code &lt;BR/&gt;]). To make your text easier to analyse, you want to
| merge those into one token and flag them, to make sure you
| can ignore them later. Ideally, this should all be done automatically
| as you process the text. You can achieve this by adding a
| #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
| that's called on each #[code Doc] object, merges the leftover HTML spans
| and sets an attribute #[code bad_html] on the token.
+code-exec.
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
# we're using a class because the component needs to be initialised with
# the shared vocab via the nlp object
class BadHTMLMerger(object):
def __init__(self, nlp):
# register a new token extension to flag bad HTML
Token.set_extension('bad_html', default=False)
self.matcher = Matcher(nlp.vocab)
self.matcher.add('BAD_HTML', None,
[{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
[{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])
def __call__(self, doc):
# this method is invoked when the component is called on a Doc
matches = self.matcher(doc)
spans = [] # collect the matched spans here
for match_id, start, end in matches:
spans.append(doc[start:end])
for span in spans:
span.merge() # merge
for token in span:
token._.bad_html = True # mark token as bad HTML
return doc
nlp = spacy.load('en_core_web_sm')
html_merger = BadHTMLMerger(nlp)
nlp.add_pipe(html_merger, last=True) # add component to the pipeline
doc = nlp(u"Hello&lt;br&gt;world! &lt;br/&gt; This is a test.")
for token in doc:
print(token.text, token._.bad_html)
p
| Instead of hard-coding the patterns into the component, you could also
| make it take a path to a JSON file containing the patterns. This lets
| you reuse the component with different patterns, depending on your
| application:
+code.
html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
+infobox
| For more details and examples of how to
| #[strong create custom pipeline components] and
| #[strong extension attributes], see the
| #[+a("/usage/processing-pipelines") usage guide].
+h(3, "regex") Using regular expressions
p
| In some cases, only matching tokens and token attributes isn't enough
| for example, you might want to match different spellings of a word,
| without having to add a new pattern for each spelling. A simple solution
| is to match a regular expression on the #[code Doc]'s #[code text] and
| use the #[+api("doc#char_span") #[code Doc.char_span]] method to
| create a #[code Span] from the character indices of the match:
+code-exec.
import spacy
import re
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
for match in re.finditer(DEFINITELY_PATTERN, doc.text):
start, end = match.span() # get matched indices
span = doc.char_span(start, end) # create Span from indices
print(span.text)
p
| You can also use the regular expression with spaCy's #[code Matcher] by
| converting it to a token flag. To ensure efficiency, the
| #[code Matcher] can only access the C-level data. This means that it can
| either use built-in token attributes or #[strong binary flags].
| #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
| you can use as a key of a token match pattern. Tokens that match the
| regular expression will return #[code True] for the #[code IS_DEFINITELY]
| flag.
+code-exec.
import spacy
from spacy.matcher import Matcher
import re
nlp = spacy.load('en_core_web_sm')
definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
matcher = Matcher(nlp.vocab)
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
p
| Providing the regular expressions as binary flags also lets you use them
| in combination with other token patterns for example, to match the
| word "definitely" in various spellings, followed by a case-insensitive
| "not" and and adjective:
+code.
[{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
+h(3, "example1") Example: Using linguistic annotations
p
| Let's say you're analysing user comments and you want to find out what
| people are saying about Facebook. You want to start off by finding
| adjectives following "Facebook is" or "Facebook was". This is obviously
| a very rudimentary solution, but it'll be fast, and a great way get an
| idea for what's in your data. Your pattern could look like this:
+code.
[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
p
| This translates to a token whose lowercase form matches "facebook"
| (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
| "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
| followed by an adjective. Using the linguistic annotations here is
| especially useful, because you can tell spaCy to match "Facebook's
| annoying", but #[strong not] "Facebook's annoying ads". The optional
| adverb makes sure you won't miss adjectives with intensifiers, like
| "pretty awful" or "very nice".
p
| To get a quick overview of the results, you could collect all sentences
| containing a match and render them with the
| #[+a("/usage/visualizers") displaCy visualizer].
| In the callback function, you'll have access to the #[code start] and
| #[code end] of each match, as well as the parent #[code Doc]. This lets
| you determine the sentence containing the match,
| #[code doc[start : end].sent], and calculate the start and end of the
| matched span within the sentence. Using displaCy in
| #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
| pass in a list of dictionaries containing the text and entities to render.
+code-exec.
import spacy
from spacy import displacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized
def collect_sents(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start : end] # matched span
sent = span.sent # sentence containing matched span
# append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{'start': span.start_char - sent.start_char,
'end': span.end_char - sent.start_char,
'label': 'MATCH'}]
matched_sents.append({'text': sent.text, 'ents': match_ents })
pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
{'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
doc = nlp(u"I'd say that Facebook is evil. Facebook is pretty cool, right?")
matches = matcher(doc)
# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
# (if you're not running the code within a Jupyer environment, you can
# remove jupyter=True and use displacy.serve instead)
displacy.render(matched_sents, style='ent', manual=True, jupyter=True)
+h(3, "example2") Example: Phone numbers
p
| Phone numbers can have many different formats and matching them is often
| tricky. During tokenization, spaCy will leave sequences of numbers intact
| and only split on whitespace and punctuation. This means that your match
| pattern will have to look out for number sequences of a certain length,
| surrounded by specific punctuation depending on the
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].
p
| The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
| tell us anything about the length. However, you can use the #[code SHAPE]
| flag, with each #[code d] representing a digit:
+code.
[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
{'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]
p
| This will match phone numbers of the format #[strong (123) 4567 8901] or
| #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
| you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
| By hard-coding some values, you can match only certain, country-specific
| numbers. For example, here's a pattern to match the most common formats of
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:
+code.
[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
{'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]
p
| Depending on the formats your application needs to match, creating an
| extensive set of rules like this is often better than training a model.
| It'll produce more predictable results, is much easier to modify and
| extend, and doesn't require any training data only a set of
| test cases.
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
matcher.add('PHONE_NUMBER', None, pattern)
doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
print([t.text for t in doc])
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
+h(3, "example3") Example: Hashtags and emoji on social media
p
| Social media posts, especially tweets, can be difficult to work with.
| They're very short and often contain various emoji and hashtags. By only
| looking at the plain text, you'll lose a lot of valuable semantic
| information.
p
| Let's say you've extracted a large sample of social media posts on a
| specific topic, for example posts mentioning a brand name or product.
| As the first step of your data exploration, you want to filter out posts
| containing certain emoji and use them to assign a general sentiment
| score, based on whether the expressed emotion is positive or negative,
| e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
| You also want to find, merge and label hashtags like
| #[code #MondayMotivation], to be able to ignore or analyse them later.
+aside("Note on sentiment analysis")
| Ultimately, sentiment analysis is not always #[em that] easy. In
| addition to the emoji, you'll also want to take specific words into
| account and check the #[code subtree] for intensifiers like "very", to
| increase the sentiment score. At some point, you might also want to train
| a sentiment model. However, the approach described in this example is
| very useful for #[strong bootstrapping rules to collect training data].
| It's also an incredibly fast way to gather first insights into your data
| with about 1 million tweets, you'd be looking at a processing time of
| #[strong under 1 minute].
p
| By default, spaCy's tokenizer will split emoji into separate tokens. This
| means that you can create a pattern for one or more emoji tokens.
| Valid hashtags usually consist of a #[code #], plus a sequence of
| ASCII characters with no whitespace, making them easy to match as well.
+code-exec.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English() # we only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)
pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji
# add patterns to match one or more emoji tokens
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
# function to label the sentiment
def label_sentiment(matcher, doc, i, matches):
match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
doc.sentiment += 0.1 # add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == 'SAD':
doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern
# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = doc.vocab.strings[match_id] # look up string ID
span = doc[start:end]
print(string_id, span.text)
p
| Because the #[code on_match] callback receives the ID of each match, you
| can use the same function to handle the sentiment assignment for both
| the positive and negative pattern. To keep it simple, we'll either add
| or subtract #[code 0.1] points this way, the score will also reflect
| combinations of emoji, even positive #[em and] negative ones.
p
| With a library like
| #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
| we can also retrieve a short description for each emoji for example,
| #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
| Heart-Eyes". Assigning it to a
| #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
| on the emoji span will make it available as #[code span._.emoji_desc].
+code.
from emojipedia import Emojipedia # installation: pip install emojipedia
from spacy.tokens import Span # get the global Span object
Span.set_extension('emoji_desc', default=None) # register the custom attribute
def label_sentiment(matcher, doc, i, matches):
match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
doc.sentiment += 0.1 # add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == 'SAD':
doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
span = doc[start : end]
emoji = Emojipedia.search(span[0].text) # get data for emoji
span._.emoji_desc = emoji.title # assign emoji description
p
| To label the hashtags, we can use a
| #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
| set on the respective token:
+code-exec.
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
# register token extension
Token.set_extension('is_hashtag', default=False)
doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
if doc.vocab.strings[match_id] == 'HASHTAG':
hashtags.append(doc[start:end])
for span in hashtags:
span.merge()
for token in span:
token._.is_hashtag = True
for token in doc:
print(token.text, token._.is_hashtag)
p
| To process a stream of social media posts, we can use
| #[+api("language#pipe") #[code Language.pipe()]], which will return a
| stream of #[code Doc] objects that we can pass to
| #[+api("matcher#pipe") #[code Matcher.pipe()]].
+code.
docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)