mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-26 13:41:21 +03:00
The parser training makes use of a trick for long documents, where we use the oracle to cut up the document into sections, so that we can have batch items in the middle of a document. For instance, if we have one document of 600 words, we might make 6 states, starting at words 0, 100, 200, 300, 400 and 500. The problem is for v3, I screwed this up and didn't stop parsing! So instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch of [600, 500, 400, 300, 200, 100]. Oops. The implementation here could probably be improved, it's annoying to have this extra variable in the state. But this'll do. This makes the v3 parser training 5-10 times faster, depending on document lengths. This problem wasn't in v2. |
||
|---|---|---|
| .. | ||
| _parser_internals | ||
| __init__.py | ||
| attributeruler.py | ||
| dep_parser.pyx | ||
| entity_linker.py | ||
| entityruler.py | ||
| functions.py | ||
| lemmatizer.py | ||
| morphologizer.pyx | ||
| multitask.pyx | ||
| ner.pyx | ||
| pipe.pxd | ||
| pipe.pyx | ||
| sentencizer.pyx | ||
| senter.pyx | ||
| simple_ner.py | ||
| tagger.pyx | ||
| textcat.py | ||
| tok2vec.py | ||
| transition_parser.pxd | ||
| transition_parser.pyx | ||