mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 16:07:41 +03:00 
			
		
		
		
	* Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit70f4e8adf3. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commitbdebbef455. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit62358dd867. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
		
			
				
	
	
		
			706 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			706 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > RULE-BASED MATCHING
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]],
 | ||
|     |  that operates over tokens, similar
 | ||
|     |  to regular expressions. The rules can refer to token annotations (e.g.
 | ||
|     |  the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
 | ||
|     |  The rule matcher also lets you pass in a custom callback
 | ||
|     |  to act on matches – for example, to merge entities and apply custom labels.
 | ||
|     |  You can also associate patterns with entity IDs, to allow some basic
 | ||
|     |  entity linking or disambiguation. To match large terminology lists,
 | ||
|     |  you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which
 | ||
|     |  accepts #[code Doc] objects as match patterns.
 | ||
| 
 | ||
| +h(3, "adding-patterns") Adding patterns
 | ||
| 
 | ||
| p
 | ||
|     |  Let's say we want to enable spaCy to find a combination of three tokens:
 | ||
| 
 | ||
| +list("numbers")
 | ||
|     +item
 | ||
|         |  A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
 | ||
|         |  or "HELLO".
 | ||
|     +item
 | ||
|         |  A token whose #[strong #[code is_punct] flag is set to #[code True]],
 | ||
|         |  i.e. any punctuation.
 | ||
|     +item
 | ||
|         |  A token whose #[strong lowercase form matches "world"], e.g. "World"
 | ||
|         |  or "WORLD".
 | ||
| 
 | ||
| +code.
 | ||
|     [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
 | ||
| 
 | ||
| p
 | ||
|     |  First, we initialise the #[code Matcher] with a vocab. The matcher must
 | ||
|     |  always share the same vocab with the documents it will operate on. We
 | ||
|     |  can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
 | ||
|     |  our custom pattern. The second argument lets you pass in an optional
 | ||
|     |  callback function to invoke on a successful match. For now, we set it
 | ||
|     |  to #[code None].
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
|     # add match ID "HelloWorld" with no callback and one pattern
 | ||
|     pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
 | ||
|     matcher.add('HelloWorld', None, pattern)
 | ||
| 
 | ||
|     doc = nlp(u'Hello, world! Hello world!')
 | ||
|     matches = matcher(doc)
 | ||
|     for match_id, start, end in matches:
 | ||
|         string_id = nlp.vocab.strings[match_id]  # get string representation
 | ||
|         span = doc[start:end]  # the matched span
 | ||
|         print(match_id, string_id, start, end, span.text)
 | ||
| 
 | ||
| p
 | ||
|     |  The matcher returns a list of #[code (match_id, start, end)] tuples – in
 | ||
|     |  this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
 | ||
|     |  span #[code doc[0:2]] of our original document. The #[code match_id]
 | ||
|     |  is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
 | ||
|     |  "HelloWorld". To get the string value, you can look up the ID
 | ||
|     |  in the #[+api("stringstore") #[code StringStore]].
 | ||
| 
 | ||
| +code.
 | ||
|     for match_id, start, end in matches:
 | ||
|         string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
 | ||
|         span = doc[start:end]                    # the matched span
 | ||
| 
 | ||
| p
 | ||
|     |  Optionally, we could also choose to add more than one pattern, for
 | ||
|     |  example to also match sequences without punctuation between "hello" and
 | ||
|     |  "world":
 | ||
| 
 | ||
| +code.
 | ||
|     matcher.add('HelloWorld', None,
 | ||
|                 [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
 | ||
|                 [{'LOWER': 'hello'}, {'LOWER': 'world'}])
 | ||
| 
 | ||
| p
 | ||
|     |  By default, the matcher will only return the matches and
 | ||
|     |  #[strong not do anything else], like merge entities or assign labels.
 | ||
|     |  This is all up to you and can be defined individually for each pattern,
 | ||
|     |  by passing in a callback function as the #[code on_match] argument on
 | ||
|     |  #[code add()]. This is useful, because it lets you write entirely custom
 | ||
|     |  and #[strong pattern-specific logic]. For example, you might want to
 | ||
|     |  merge #[em some] patterns into one token, while adding entity labels for
 | ||
|     |  other pattern types. You shouldn't have to create different matchers for
 | ||
|     |  each of those processes.
 | ||
| 
 | ||
| +h(4, "adding-patterns-attributes") Available token attributes
 | ||
| 
 | ||
| p
 | ||
|     |  The available token pattern keys are uppercase versions of the
 | ||
|     |  #[+api("token#attributes") #[code Token] attributes]. The most relevant
 | ||
|     |  ones for rule-based matching are:
 | ||
| 
 | ||
| +table(["Attribute", "Description"])
 | ||
|     +row
 | ||
|         +cell #[code ORTH]
 | ||
|         +cell The exact verbatim text of a token.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code LOWER]
 | ||
|         +cell The lowercase form of the token text.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code LENGTH]
 | ||
|         +cell The length of the token text.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
 | ||
|         +cell
 | ||
|             |  Token text consists of alphanumeric characters, ASCII characters,
 | ||
|             |  digits.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE]
 | ||
|         +cell Token text is in lowercase, uppercase, titlecase.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP]
 | ||
|         +cell Token is punctuation, whitespace, stop word.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL]
 | ||
|         +cell Token text resembles a number, URL, email.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap
 | ||
|             |  #[code POS], #[code TAG], #[code DEP], #[code LEMMA],
 | ||
|             |  #[code SHAPE]
 | ||
|         +cell
 | ||
|             |  The token's simple and extended part-of-speech tag, dependency
 | ||
|             |  label, lemma, shape.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell.u-nowrap #[code ENT_TYPE]
 | ||
|         +cell The token's entity label.
 | ||
| 
 | ||
| +h(4, "adding-patterns-wildcard") Using wildcard token patterns
 | ||
|     +tag-new(2)
 | ||
| 
 | ||
| p
 | ||
|     |  While the token attributes offer many options to write highly specific
 | ||
|     |  patterns, you can also use an empty dictionary, #[code {}] as a wildcard
 | ||
|     |  representing #[strong any token]. This is useful if you know the context
 | ||
|     |  of what you're trying to match, but very little about the specific token
 | ||
|     |  and its characters. For example, let's say you're trying to extract
 | ||
|     |  people's user names from your data. All you know is that they are listed
 | ||
|     |  as "User name: {username}". The name itself may contain any character,
 | ||
|     |  but no whitespace – so you'll know it will be handled as one token.
 | ||
| 
 | ||
| +code.
 | ||
|     [{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]
 | ||
| 
 | ||
| +h(4, "quantifiers") Using operators and quantifiers
 | ||
| 
 | ||
| p
 | ||
|     |  The matcher also lets you use quantifiers, specified as the #[code 'OP']
 | ||
|     |  key. Quantifiers let you define sequences of tokens to be mached, e.g.
 | ||
|     |  one or more punctuation marks, or specify optional tokens. Note that there
 | ||
|     |  are no nested or scoped quantifiers – instead, you can build those
 | ||
|     |  behaviours with #[code on_match] callbacks.
 | ||
| 
 | ||
| +table([ "OP", "Description"])
 | ||
|     +row
 | ||
|         +cell #[code !]
 | ||
|         +cell Negate the pattern, by requiring it to match exactly 0 times.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code ?]
 | ||
|         +cell Make the pattern optional, by allowing it to match 0 or 1 times.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code +]
 | ||
|         +cell Require the pattern to match 1 or more times.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code *]
 | ||
|         +cell Allow the pattern to match zero or more times.
 | ||
| 
 | ||
| p
 | ||
|     |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
 | ||
|     |  behave inconsistently. They were usually interpreted
 | ||
|     |  "greedily", i.e. longer matches are returned where possible. However, if
 | ||
|     |  you specify two #[code +] and #[code *] patterns in a row and their
 | ||
|     |  matches overlap, the first operator will behave non-greedily. This quirk
 | ||
|     |  in the semantics is corrected in spaCy v2.1.0.
 | ||
| 
 | ||
| +h(3, "adding-phrase-patterns") Adding phrase patterns
 | ||
| 
 | ||
| p
 | ||
|     |  If you need to match large terminology lists, you can also use the
 | ||
|     |  #[+api("phrasematcher") #[code PhraseMatcher]] and create
 | ||
|     |  #[+api("doc") #[code Doc]] objects instead of token patterns, which is
 | ||
|     |  much more efficient overall. The #[code Doc] patterns can contain single
 | ||
|     |  or multiple tokens.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import PhraseMatcher
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = PhraseMatcher(nlp.vocab)
 | ||
|     terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
 | ||
|     patterns = [nlp(text) for text in terminology_list]
 | ||
|     matcher.add('TerminologyList', None, *patterns)
 | ||
| 
 | ||
|     doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
 | ||
|               u"converse in the Oval Office inside the White House in Washington, D.C.")
 | ||
|     matches = matcher(doc)
 | ||
|     for match_id, start, end in matches:
 | ||
|         span = doc[start:end]
 | ||
|         print(span.text)
 | ||
| 
 | ||
| p
 | ||
|     |  Since spaCy is used for processing both the patterns and the text to be
 | ||
|     |  matched, you won't have to worry about specific tokenization – for
 | ||
|     |  example, you can simply pass in #[code nlp(u"Washington, D.C.")] and
 | ||
|     |  won't have to write a complex token pattern covering the exact
 | ||
|     |  tokenization of the term.
 | ||
| 
 | ||
| +h(3, "on_match") Adding #[code on_match] rules
 | ||
| 
 | ||
| p
 | ||
|     |  To move on to a more realistic example, let's say you're working with a
 | ||
|     |  large corpus of blog articles, and you want to match all mentions of
 | ||
|     |  "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
 | ||
|     |  To be safe, you only match on the uppercase versions, in case someone has
 | ||
|     |  written it as "Google i/o". You also add a second pattern with an added
 | ||
|     |  #[code {IS_DIGIT: True}] token – this will make sure you also match on
 | ||
|     |  "Google I/O 2017". If your pattern matches, spaCy should execute your
 | ||
|     |  custom callback function #[code add_event_ent].
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
| 
 | ||
|     # Get the ID of the 'EVENT' entity type. This is required to set an entity.
 | ||
|     EVENT = nlp.vocab.strings['EVENT']
 | ||
| 
 | ||
|     def add_event_ent(matcher, doc, i, matches):
 | ||
|         # Get the current match and create tuple of entity label, start and end.
 | ||
|         # Append entity to the doc's entity. (Don't overwrite doc.ents!)
 | ||
|         match_id, start, end = matches[i]
 | ||
|         entity = (EVENT, start, end)
 | ||
|         doc.ents += (entity,)
 | ||
|         print(doc[start:end].text, entity)
 | ||
| 
 | ||
|     matcher.add('GoogleIO', add_event_ent,
 | ||
|                 [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
 | ||
|                 [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])
 | ||
| 
 | ||
|     doc = nlp(u"This is a text about Google I/O 2015.")
 | ||
|     matches = matcher(doc)
 | ||
| 
 | ||
| +aside("Tip: Visualizing matches")
 | ||
|     |  When working with entities, you can use #[+api("top-level#displacy") displaCy]
 | ||
|     |  to quickly generate a NER visualization from your updated #[code Doc],
 | ||
|     |  which can be exported as an HTML file:
 | ||
| 
 | ||
|     +code.o-no-block.
 | ||
|         from spacy import displacy
 | ||
|         html = displacy.render(doc, style='ent', page=True,
 | ||
|                                options={'ents': ['EVENT']})
 | ||
| 
 | ||
|     |  For more info and examples, see the usage guide on
 | ||
|     |  #[+a("/usage/visualizers") visualizing spaCy].
 | ||
| 
 | ||
| p
 | ||
|     |  We can now call the matcher on our documents. The patterns will be
 | ||
|     |  matched in the order they occur in the text. The matcher will then
 | ||
|     |  iterate over the matches, look up the callback for the match ID
 | ||
|     |  that was matched, and invoke it.
 | ||
| 
 | ||
| +code.
 | ||
|     doc = nlp(YOUR_TEXT_HERE)
 | ||
|     matcher(doc)
 | ||
| 
 | ||
| p
 | ||
|     |  When the callback is invoked, it is
 | ||
|     |  passed four arguments: the matcher itself, the document, the position of
 | ||
|     |  the current match, and the total list of matches. This allows you to
 | ||
|     |  write callbacks that consider the entire set of matched phrases, so that
 | ||
|     |  you can resolve overlaps and other conflicts in whatever way you prefer.
 | ||
| 
 | ||
| +table(["Argument", "Type", "Description"])
 | ||
|     +row
 | ||
|         +cell #[code matcher]
 | ||
|         +cell #[code Matcher]
 | ||
|         +cell The matcher instance.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code doc]
 | ||
|         +cell #[code Doc]
 | ||
|         +cell The document the matcher was used on.
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code i]
 | ||
|         +cell int
 | ||
|         +cell Index of the current match (#[code matches[i]]).
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code matches]
 | ||
|         +cell list
 | ||
|         +cell
 | ||
|             |  A list of #[code (match_id, start, end)] tuples, describing the
 | ||
|             |  matches. A match tuple describes a span #[code doc[start:end]].
 | ||
| 
 | ||
| +h(3, "matcher-pipeline") Using custom pipeline components
 | ||
| 
 | ||
| p
 | ||
|     |  Let's say your data also contains some annoying pre-processing artefacts,
 | ||
|     |  like leftover HTML line breaks (e.g. #[code <br>] or
 | ||
|     |  #[code <BR/>]). To make your text easier to analyse, you want to
 | ||
|     |  merge those into one token and flag them, to make sure you
 | ||
|     |  can ignore them later. Ideally, this should all be done automatically
 | ||
|     |  as you process the text. You can achieve this by adding a
 | ||
|     |  #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
 | ||
|     |  that's called on each #[code Doc] object, merges the leftover HTML spans
 | ||
|     |  and sets an attribute #[code bad_html] on the token.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
|     from spacy.tokens import Token
 | ||
| 
 | ||
|     # we're using a class because the component needs to be initialised with
 | ||
|     # the shared vocab via the nlp object
 | ||
|     class BadHTMLMerger(object):
 | ||
|         def __init__(self, nlp):
 | ||
|             # register a new token extension to flag bad HTML
 | ||
|             Token.set_extension('bad_html', default=False)
 | ||
|             self.matcher = Matcher(nlp.vocab)
 | ||
|             self.matcher.add('BAD_HTML', None,
 | ||
|                 [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
 | ||
|                 [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
 | ||
| 
 | ||
|         def __call__(self, doc):
 | ||
|             # this method is invoked when the component is called on a Doc
 | ||
|             matches = self.matcher(doc)
 | ||
|             spans = []  # collect the matched spans here
 | ||
|             for match_id, start, end in matches:
 | ||
|                 spans.append(doc[start:end])
 | ||
|             for span in spans:
 | ||
|                 span.merge()   # merge
 | ||
|                 for token in span:
 | ||
|                     token._.bad_html = True  # mark token as bad HTML
 | ||
|             return doc
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     html_merger = BadHTMLMerger(nlp)
 | ||
|     nlp.add_pipe(html_merger, last=True)  # add component to the pipeline
 | ||
|     doc = nlp(u"Hello<br>world! <br/> This is a test.")
 | ||
|     for token in doc:
 | ||
|         print(token.text, token._.bad_html)
 | ||
| 
 | ||
| p
 | ||
|     |  Instead of hard-coding the patterns into the component, you could also
 | ||
|     |  make it take a path to a JSON file containing the patterns. This lets
 | ||
|     |  you reuse the component with different patterns, depending on your
 | ||
|     |  application:
 | ||
| 
 | ||
| +code.
 | ||
|     html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
 | ||
| 
 | ||
| +infobox
 | ||
|     |  For more details and examples of how to
 | ||
|     |  #[strong create custom pipeline components] and
 | ||
|     |  #[strong extension attributes], see the
 | ||
|     |  #[+a("/usage/processing-pipelines") usage guide].
 | ||
| 
 | ||
| +h(3, "regex") Using regular expressions
 | ||
| 
 | ||
| p
 | ||
|     |  In some cases, only matching tokens and token attributes isn't enough –
 | ||
|     |  for example, you might want to match different spellings of a word,
 | ||
|     |  without having to add a new pattern for each spelling. A simple solution
 | ||
|     |  is to match a regular expression on the #[code Doc]'s #[code text] and
 | ||
|     |  use the #[+api("doc#char_span") #[code Doc.char_span]] method to
 | ||
|     |  create a #[code Span] from the character indices of the match:
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     import re
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
 | ||
| 
 | ||
|     DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
 | ||
| 
 | ||
|     for match in re.finditer(DEFINITELY_PATTERN, doc.text):
 | ||
|         start, end = match.span()         # get matched indices
 | ||
|         span = doc.char_span(start, end)  # create Span from indices
 | ||
|         print(span.text)
 | ||
| 
 | ||
| p
 | ||
|     |  You can also use the regular expression with spaCy's #[code Matcher] by
 | ||
|     |  converting it to a token flag. To ensure efficiency, the
 | ||
|     |  #[code Matcher] can only access the C-level data. This means that it can
 | ||
|     |  either use built-in token attributes or #[strong binary flags].
 | ||
|     |  #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
 | ||
|     |  you can use as a key of a token match pattern. Tokens that match the
 | ||
|     |  regular expression will return #[code True] for the #[code IS_DEFINITELY]
 | ||
|     |  flag.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
|     import re
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
 | ||
|     IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
 | ||
| 
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
|     matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
 | ||
| 
 | ||
|     doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
 | ||
|     matches = matcher(doc)
 | ||
|     for match_id, start, end in matches:
 | ||
|         span = doc[start:end]
 | ||
|         print(span.text)
 | ||
| 
 | ||
| p
 | ||
|     |  Providing the regular expressions as binary flags also lets you use them
 | ||
|     |  in combination with other token patterns – for example, to match the
 | ||
|     |  word "definitely" in various spellings, followed by a case-insensitive
 | ||
|     |  "not" and and adjective:
 | ||
| 
 | ||
| +code.
 | ||
|     [{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
 | ||
| 
 | ||
| +h(3, "example1") Example: Using linguistic annotations
 | ||
| 
 | ||
| p
 | ||
|     |  Let's say you're analysing user comments and you want to find out what
 | ||
|     |  people are saying about Facebook. You want to start off by finding
 | ||
|     |  adjectives following "Facebook is" or "Facebook was". This is obviously
 | ||
|     |  a very rudimentary solution, but it'll be fast, and a great way get an
 | ||
|     |  idea for what's in your data. Your pattern could look like this:
 | ||
| 
 | ||
| +code.
 | ||
|     [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
 | ||
| 
 | ||
| p
 | ||
|     |  This translates to a token whose lowercase form matches "facebook"
 | ||
|     |  (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
 | ||
|     |  "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
 | ||
|     |  followed by an adjective. Using the linguistic annotations here is
 | ||
|     |  especially useful, because you can tell spaCy to match "Facebook's
 | ||
|     |  annoying", but #[strong not] "Facebook's annoying ads". The optional
 | ||
|     |  adverb makes sure you won't miss adjectives with intensifiers, like
 | ||
|     |  "pretty awful" or "very nice".
 | ||
| 
 | ||
| p
 | ||
|     |  To get a quick overview of the results, you could collect all sentences
 | ||
|     |  containing a match and render them with the
 | ||
|     |  #[+a("/usage/visualizers") displaCy visualizer].
 | ||
|     |  In the callback function, you'll have access to the #[code start] and
 | ||
|     |  #[code end] of each match, as well as the parent #[code Doc]. This lets
 | ||
|     |  you determine the sentence containing the match,
 | ||
|     |  #[code doc[start : end].sent], and calculate the start and end of the
 | ||
|     |  matched span within the sentence. Using displaCy in
 | ||
|     |  #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
 | ||
|     |  pass in a list of dictionaries containing the text and entities to render.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy import displacy
 | ||
|     from spacy.matcher import Matcher
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
|     matched_sents = [] # collect data of matched sentences to be visualized
 | ||
| 
 | ||
|     def collect_sents(matcher, doc, i, matches):
 | ||
|         match_id, start, end = matches[i]
 | ||
|         span = doc[start : end]  # matched span
 | ||
|         sent = span.sent  # sentence containing matched span
 | ||
|         # append mock entity for match in displaCy style to matched_sents
 | ||
|         # get the match span by ofsetting the start and end of the span with the
 | ||
|         # start and end of the sentence in the doc
 | ||
|         match_ents = [{'start': span.start_char - sent.start_char,
 | ||
|                        'end': span.end_char - sent.start_char,
 | ||
|                        'label': 'MATCH'}]
 | ||
|         matched_sents.append({'text': sent.text, 'ents': match_ents })
 | ||
| 
 | ||
|     pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
 | ||
|                {'POS': 'ADJ'}]
 | ||
|     matcher.add('FacebookIs', collect_sents, pattern)  # add pattern
 | ||
|     doc = nlp(u"I'd say that Facebook is evil. – Facebook is pretty cool, right?")
 | ||
|     matches = matcher(doc)
 | ||
| 
 | ||
|     # serve visualization of sentences containing match with displaCy
 | ||
|     # set manual=True to make displaCy render straight from a dictionary
 | ||
|     # (if you're not running the code within a Jupyer environment, you can
 | ||
|     # remove jupyter=True and use displacy.serve instead)
 | ||
|     displacy.render(matched_sents, style='ent', manual=True, jupyter=True)
 | ||
| 
 | ||
| +h(3, "example2") Example: Phone numbers
 | ||
| 
 | ||
| p
 | ||
|     |  Phone numbers can have many different formats and matching them is often
 | ||
|     |  tricky. During tokenization, spaCy will leave sequences of numbers intact
 | ||
|     |  and only split on whitespace and punctuation. This means that your match
 | ||
|     |  pattern will have to look out for number sequences of a certain length,
 | ||
|     |  surrounded by specific punctuation – depending on the
 | ||
|     |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].
 | ||
| 
 | ||
| p
 | ||
|     |  The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
 | ||
|     |  tell us anything about the length. However, you can use the #[code SHAPE]
 | ||
|     |  flag, with each #[code d] representing a digit:
 | ||
| 
 | ||
| +code.
 | ||
|     [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
 | ||
|      {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]
 | ||
| 
 | ||
| p
 | ||
|     |  This will match phone numbers of the format #[strong (123) 4567 8901] or
 | ||
|     |  #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
 | ||
|     |  you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
 | ||
|     |  By hard-coding some values, you can match only certain, country-specific
 | ||
|     |  numbers. For example, here's a pattern to match the most common formats of
 | ||
|     |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:
 | ||
| 
 | ||
| +code.
 | ||
|     [{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
 | ||
|      {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]
 | ||
| 
 | ||
| p
 | ||
|     |  Depending on the formats your application needs to match, creating an
 | ||
|     |  extensive set of rules like this is often better than training a model.
 | ||
|     |  It'll produce more predictable results, is much easier to modify and
 | ||
|     |  extend, and doesn't require any training data – only a set of
 | ||
|     |  test cases.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
|     pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
 | ||
|                {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
 | ||
|     matcher.add('PHONE_NUMBER', None, pattern)
 | ||
| 
 | ||
|     doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
 | ||
|     print([t.text for t in doc])
 | ||
|     matches = matcher(doc)
 | ||
|     for match_id, start, end in matches:
 | ||
|         span = doc[start:end]
 | ||
|         print(span.text)
 | ||
| 
 | ||
| +h(3, "example3") Example: Hashtags and emoji on social media
 | ||
| 
 | ||
| p
 | ||
|     |  Social media posts, especially tweets, can be difficult to work with.
 | ||
|     |  They're very short and often contain various emoji and hashtags. By only
 | ||
|     |  looking at the plain text, you'll lose a lot of valuable semantic
 | ||
|     |  information.
 | ||
| 
 | ||
| p
 | ||
|     |  Let's say you've extracted a large sample of social media posts on a
 | ||
|     |  specific topic, for example posts mentioning a brand name or product.
 | ||
|     |  As the first step of your data exploration, you want to filter out posts
 | ||
|     |  containing certain emoji and use them to assign a general sentiment
 | ||
|     |  score, based on whether the expressed emotion is positive or negative,
 | ||
|     |  e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
 | ||
|     |  You also want to find, merge and label hashtags like
 | ||
|     |  #[code #MondayMotivation], to be able to ignore or analyse them later.
 | ||
| 
 | ||
| +aside("Note on sentiment analysis")
 | ||
|     |  Ultimately, sentiment analysis is not always #[em that] easy. In
 | ||
|     |  addition to the emoji, you'll also want to take specific words into
 | ||
|     |  account and check the #[code subtree] for intensifiers like "very", to
 | ||
|     |  increase the sentiment score. At some point, you might also want to train
 | ||
|     |  a sentiment model. However, the approach described in this example is
 | ||
|     |  very useful for #[strong bootstrapping rules to collect training data].
 | ||
|     |  It's also an incredibly fast way to gather first insights into your data
 | ||
|     |  – with about 1 million tweets, you'd be looking at a processing time of
 | ||
|     |  #[strong under 1 minute].
 | ||
| 
 | ||
| p
 | ||
|     |  By default, spaCy's tokenizer will split emoji into separate tokens. This
 | ||
|     |  means that you can create a pattern for one or more emoji tokens.
 | ||
|     |  Valid hashtags usually consist of a #[code #], plus a sequence of
 | ||
|     |  ASCII characters with no whitespace, making them easy to match as well.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     from spacy.lang.en import English
 | ||
|     from spacy.matcher import Matcher
 | ||
| 
 | ||
|     nlp = English()  # we only want the tokenizer, so no need to load a model
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
| 
 | ||
|     pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍']  # positive emoji
 | ||
|     neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒']  # negative emoji
 | ||
| 
 | ||
|     # add patterns to match one or more emoji tokens
 | ||
|     pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
 | ||
|     neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
 | ||
| 
 | ||
|     # function to label the sentiment
 | ||
|     def label_sentiment(matcher, doc, i, matches):
 | ||
|         match_id, start, end = matches[i]
 | ||
|         if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
 | ||
|             doc.sentiment += 0.1  # add 0.1 for positive sentiment
 | ||
|         elif doc.vocab.strings[match_id] == 'SAD':
 | ||
|             doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
 | ||
| 
 | ||
|     matcher.add('HAPPY', label_sentiment, *pos_patterns)  # add positive pattern
 | ||
|     matcher.add('SAD', label_sentiment, *neg_patterns)  # add negative pattern
 | ||
| 
 | ||
|     # add pattern for valid hashtag, i.e. '#' plus any ASCII token
 | ||
|     matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
 | ||
| 
 | ||
|     doc = nlp(u"Hello world 😀 #MondayMotivation")
 | ||
|     matches = matcher(doc)
 | ||
|     for match_id, start, end in matches:
 | ||
|         string_id = doc.vocab.strings[match_id]  # look up string ID
 | ||
|         span = doc[start:end]
 | ||
|         print(string_id, span.text)
 | ||
| 
 | ||
| p
 | ||
|     |  Because the #[code on_match] callback receives the ID of each match, you
 | ||
|     |  can use the same function to handle the sentiment assignment for both
 | ||
|     |  the positive and negative pattern. To keep it simple, we'll either add
 | ||
|     |  or subtract #[code 0.1] points – this way, the score will also reflect
 | ||
|     |  combinations of emoji, even positive #[em and] negative ones.
 | ||
| 
 | ||
| p
 | ||
|     |  With a library like
 | ||
|     |  #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
 | ||
|     |  we can also retrieve a short description for each emoji – for example,
 | ||
|     |  #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
 | ||
|     |  Heart-Eyes". Assigning it to a
 | ||
|     |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
 | ||
|     |  on the emoji span will make it available as #[code span._.emoji_desc].
 | ||
| 
 | ||
| +code.
 | ||
|     from emojipedia import Emojipedia  # installation: pip install emojipedia
 | ||
|     from spacy.tokens import Span  # get the global Span object
 | ||
| 
 | ||
|     Span.set_extension('emoji_desc', default=None)  # register the custom attribute
 | ||
| 
 | ||
|     def label_sentiment(matcher, doc, i, matches):
 | ||
|         match_id, start, end = matches[i]
 | ||
|         if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
 | ||
|             doc.sentiment += 0.1  # add 0.1 for positive sentiment
 | ||
|         elif doc.vocab.strings[match_id] == 'SAD':
 | ||
|             doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
 | ||
|         span = doc[start : end]
 | ||
|         emoji = Emojipedia.search(span[0].text) # get data for emoji
 | ||
|         span._.emoji_desc = emoji.title  # assign emoji description
 | ||
| 
 | ||
| p
 | ||
|     |  To label the hashtags, we can use a
 | ||
|     |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
 | ||
|     |  set on the respective token:
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import Matcher
 | ||
|     from spacy.tokens import Token
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     matcher = Matcher(nlp.vocab)
 | ||
| 
 | ||
|     # add pattern for valid hashtag, i.e. '#' plus any ASCII token
 | ||
|     matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
 | ||
| 
 | ||
|     # register token extension
 | ||
|     Token.set_extension('is_hashtag', default=False)
 | ||
| 
 | ||
|     doc = nlp(u"Hello world 😀 #MondayMotivation")
 | ||
|     matches = matcher(doc)
 | ||
|     hashtags = []
 | ||
|     for match_id, start, end in matches:
 | ||
|         if doc.vocab.strings[match_id] == 'HASHTAG':
 | ||
|             hashtags.append(doc[start:end])
 | ||
|     for span in hashtags:
 | ||
|         span.merge()
 | ||
|         for token in span:
 | ||
|             token._.is_hashtag = True
 | ||
| 
 | ||
|     for token in doc:
 | ||
|         print(token.text, token._.is_hashtag)
 | ||
| 
 | ||
| p
 | ||
|     |  To process a stream of social media posts, we can use
 | ||
|     |  #[+api("language#pipe") #[code Language.pipe()]], which will return a
 | ||
|     |  stream of #[code Doc] objects that we can pass to
 | ||
|     |  #[+api("matcher#pipe") #[code Matcher.pipe()]].
 | ||
| 
 | ||
| +code.
 | ||
|     docs = nlp.pipe(LOTS_OF_TWEETS)
 | ||
|     matches = matcher.pipe(docs)
 |