Merge branch 'master' into spacy.io

2025-12-24 02:23:19 +03:00 · 2019-08-11 11:14:10 +02:00 · 2019-08-11 11:14:10 +02:00 · f653e1bbea
commit f653e1bbea
parent 0df13a829c 1362f793cf
1 changed files with 19 additions and 5 deletions
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -788,11 +788,11 @@ token pattern covering the exact tokenization of the term.

 To create the patterns, each phrase has to be processed with the `nlp` object.
 If you have a mode loaded, doing this in a loop or list comprehension can easily
-become inefficient and slow. If you only need the tokenization and lexical
-attributes, you can run [`nlp.make_doc`](/api/language#make_doc) instead, which
-will only run the tokenizer. For an additional speed boost, you can also use the
-[`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process the texts
-as a stream.
+become inefficient and slow. If you **only need the tokenization and lexical
+attributes**, you can run [`nlp.make_doc`](/api/language#make_doc) instead,
+which will only run the tokenizer. For an additional speed boost, you can also
+use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process
+the texts as a stream.

 ```diff
 - patterns = [nlp(term) for term in LOTS_OF_TERMS]
@ -825,6 +825,20 @@ for match_id, start, end in matcher(doc):
    print("Matched based on lowercase token text:", doc[start:end])
 ```

+<Infobox title="Important note on creating patterns" variant="warning">
+
+The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
+object patterns as efficiently as possible and without running any of the other
+pipeline components. If the token attribute you want to match on are set by a
+pipeline component, **make sure that the pipeline component runs** when you
+create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
+objects need to have part-of-speech tags set by the `tagger`. You can either
+call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
+[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components
+selectively.
+
+</Infobox>
+
 Another possible use case is matching number tokens like IP addresses based on
 their shape. This means that you won't have to worry about how those string will
 be tokenized and you'll be able to find tokens and combinations of tokens based