spaCy/spacy/tokens
Jan Jessewitsch e4dcac4a4b
Merging multiple docs into one (#5032)
* Add static method to Doc to allow merging of multiple docs.

* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().

* Add test for Doc.from_docs() implementation.

* Fix using numpy's concatenate in Doc.from_docs.

* Replace typing's type annotations in from_docs.

* Simply remove type annotations in from_docs.

* Add documentation for Doc.from_docs to api.

* Simplify from_docs, its test and the api doc for codebase consistency.

* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.

* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.

* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.

* Add MORPH to attrs

* Update warnings calls

* Remove out-dated error from merge

* Rename space_delimiter to ensure_whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-03 11:32:42 +02:00
..
__init__.pxd * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
__init__.py Modify morphology to support arbitrary features (#4932) 2020-01-23 22:01:54 +01:00
_retokenize.pyx Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
_serialize.py DocBin: add version number, missing attributes and strings (#5685) 2020-07-02 17:41:50 +02:00
doc.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
doc.pyx Merging multiple docs into one (#5032) 2020-07-03 11:32:42 +02:00
morphanalysis.pxd Modify morphology to support arbitrary features (#4932) 2020-01-23 22:01:54 +01:00
morphanalysis.pyx refactor fixes (#5664) 2020-06-29 14:33:00 +02:00
span.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
span.pyx Remove deprecated methods 2020-07-01 22:33:39 +02:00
token.pxd Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
token.pyx Improve spacy.gold (no GoldParse, no json format!) (#5555) 2020-06-26 19:34:12 +02:00
underscore.py Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00