Draft docs page on memory management

2025-07-16 03:02:41 +03:00 · 2024-09-30 17:39:09 +02:00 · 2024-09-30 17:39:09 +02:00 · dff18d3584
commit dff18d3584
parent 114b4894fb
1 changed files with 121 additions and 0 deletions
--- a/website/docs/usage/memory-management.mdx
+++ b/website/docs/usage/memory-management.mdx
@ -0,0 +1,121 @@
 ---
 title: Memory Management
 teaser: Managing Memory for Persistent Services
 version: 3.8
 menu:
  - ['Memory Zones', 'memoryzones']
 ---
 spaCy maintains a few internal caches that improve speed,
 but cause memory to increase slightly over time. If you're
 running a batch process that you don't need to be long-lived,
 the increase in memory usage generally isn't a problem.
 However, if you're running spaCy inside a web service, you'll
 often want spaCy's memory usage to stay consistent.
 Transformer models can also run into memory problems sometimes,
 especially when used on a GPU. 
 ## Memory zones {id=memoryzones}
 You can tell spaCy to free data from its internal caches (especially the
 `Vocab`) using the `Language.memory_zone()` context manager. Enter
 the context manager and process your text within it, and spaCy will
 reset its internal caches (freeing up the associated memory) at the
 end of the block. spaCy objects created inside the memory zone must
 not be accessed once the memory zone is finished.
 ```python {title="Memory Zone example"}
 from collections import Counter
 def count_words(nlp, texts):
    counts = Counter() 
    with nlp.memory_zone():
        for doc in nlp.pipe(texts):
            for token in doc:
                counts[token.text] += 1
    return counts
 ```
 <Infobox title="Important note" variant="warning">
 Exiting the memory-zone invalidates all `Doc`, `Token`, `Span` and `Lexeme`
 objects that were created within it. If you access these objects
 after the memory zone exits, you may encounter a segmentation fault
 due to invalid memory access.
 </InfoBox>
 spaCy needs the memory zone context-manager because the processing pipeline
 can't keep track of which `Doc` objects are referring to data in the shared
 `Vocab` cache. For instance, when spaCy encounters a new word, a new `Lexeme`
 entry is stored in the `Vocab`, and the `Doc` object points to this shared
 data. When the `Doc` goes out of scope, the `Vocab` has no way of knowing that
 this `Lexeme` is no longer in use. The memory zone solves this problem by
 allowing the programmer to tell the processing pipeline that all data created
 between two points is no longer in use. It is up to the developer to honour
 this agreement. If you access objects that are supposed to no longer be in
 use, you may encounter a segmentation fault due to invalid memory access.
 A common use-case for memory zones will be within a web service. The processing
 pipeline can be loaded once, either as a context variable or a global, and each
 request can be handled within a memory zone:
 ```python {title="Memory Zone FastAPI"}
 from fastapi import FastAPI, APIRouter, Depends, Request
 import spacy
 from spacy.language import Language
 router = APIRouter()
 def make_app():
    app = FastAPI()
    app.state.NLP = spacy.load("en_core_web_sm")
    app.include_router(router)
    return app
 def get_nlp(request: Request) -> Language:
    return request.app.state.NLP
@router.post("/parse")
 def parse_texts(
    *, text_batch: list[str], nlp: Language = Depends(get_nlp)
 ) -> list[dict]:
    with nlp.memory_zone():
        # Put the spaCy call within a separate function, so we can't
        # leak the Doc objects outside the scope of the memory zone.
        output = _process_text(nlp, text_batch)
    return output
 def _process_text(nlp: Language, texts: list[str]) -> list[dict]:
    """Call spaCy, and transform the output into our own data
    structures. This function is called from inside a memory
    zone, so must not return the spaCy objects.
    """
    docs = list(nlp.pipe(texts))
    return [
        {
            "tokens": [{"text": t.text} for t in doc],
            "entities": [
                {"start": e.start, "end": e.end, "label": e.label_} for e in doc.ents
            ],
        }
        for doc in docs
    ]
 app = make_app()
 ```
 ## Clearing transformer tensors and other Doc attributes
 The `transformer` and `tok2vec` components set intermediate values onto the `Doc`
 object during parsing. This can cause GPU memory to be exhausted if many `Doc`
 objects are kept in memory together.
 To resolve this, you can add the `doc_cleaner` component to your pipeline. By default
 this will clean up the `trf_data` extension attribute and the `doc.tensor` attribute.
 You can have it clean up other intermediate extension attributes you use in custom
 pipeline components as well.