Grivaz
aeba99ab0d
Introduces a bulk merge function, in order to solve issue #653 ( #2696 )
...
* Fix comment
* Introduce bulk merge to increase performance on many span merges
* Sign contributor agreement
* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
ines
3c30d1763c
Merge branch 'master' into develop
2018-07-21 15:34:18 +02:00
Matthew Honnibal
e0caf3ae8c
Fix msgpack for new version
2018-07-20 17:32:00 +02:00
Matthew Honnibal
9db77fd914
Fix deserialization for msgpack
2018-07-20 14:11:09 +02:00
Ines Montani
cae4457c38
💫 Add .similarity warnings for no vectors and option to exclude warnings ( #2197 )
...
* Add logic to filter out warning IDs via environment variable
Usage: SPACY_WARNING_EXCLUDE=W001,W007
* Add warnings for empty vectors
* Add warning if no word vectors are used in .similarity methods
For example, if only tensors are available in small models – should hopefully clear up some confusion around this
* Capture warnings in tests
* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Mr Roboto
6f5ccda19c
Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False ( #2230 )
...
* Fixes issue #2228
* Adds a new contributor
2018-05-01 13:40:22 +02:00
ines
1c6d77610c
Add remove_extension method on Doc, Token and Span ( closes #2242 )
2018-04-28 23:33:09 +02:00
Suraj Rajan
5957f15227
Fixed typos for #2222,#2223 ( #2233 ) ( closes #2222 , closes #2223 )
2018-04-18 14:55:26 -07:00
Xiaoquan Kong
e2f13ec722
bugfix: Doc.noun_chunks
call Doc.noun_chunks_iterator
without checking ( closes #2194 )
2018-04-08 23:44:05 +02:00
ines
e5f47cd82d
Update errors
2018-04-03 21:40:29 +02:00
ines
62b4b527d7
Don't raise error if set_extension has getter and setter ( closes #2177 )
...
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29
Fix whitespace
2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings ( #2163 )
...
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager ( #2172 )
...
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.
The idea is to do merging and splitting like this:
with doc.retokenize() as retokenizer:
for start, end, label in matches:
retokenizer.merge(doc[start : end], attrs={'ent_type': label})
The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.
A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.
The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.
We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
0b375d50c8
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410
Resolve merge when cherry-picking ent iob patches from develop
2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33
Improve error message when entity sequence is inconsistent
2018-03-28 18:36:53 +02:00
ines
9e83513004
Add position of invalid token to error message
2018-03-27 23:56:59 +02:00
ines
693971dd8f
Improve error message if token text is empty string (see #2101 )
2018-03-27 22:25:40 +02:00
ines
0c829e6605
Fix whitespace
2018-03-27 22:20:59 +02:00
Thomas Opsomer
515e25910e
fix sent_start in serialization
2018-01-28 19:50:42 +01:00
Matthew Honnibal
56164ab688
Set l_edge and r_edge correctly for non-projective parses. Fixes #1799
2018-01-22 20:18:04 +01:00
Matthew Honnibal
ccb51a9f36
Make .similarity() return 1.0 if all orth attrs match
2018-01-15 16:29:48 +01:00
Matthew Honnibal
ab7c45b12d
Fix error message and handling of doc.sents
2018-01-15 15:21:11 +01:00
Matthew Honnibal
e10e9ad2c5
Improve efficiency of Doc.to_array
2017-11-23 12:33:27 +00:00
Matthew Honnibal
fa62427300
Remove lookup-based lemmatization
2017-11-23 12:32:22 +00:00
ines
1c218397f6
Ensure path in Doc.to_disk/from_disk (resolves ##1521)
...
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
144a93c2a5
Back-off to tensor for similarity if no vectors
2017-11-03 20:56:33 +01:00
Matthew Honnibal
62ed58935a
Add Doc.extend_tensor() method
2017-11-03 11:20:31 +01:00
ines
9659391944
Update deprecated methods and add warnings
2017-11-01 16:49:42 +01:00
ines
705a4e3e4a
Fix formatting
2017-11-01 16:44:08 +01:00
Matthew Honnibal
7e7116cdf7
Fix Doc.to_array when only one string attr provided
2017-11-01 13:26:43 +01:00
ines
544a407b93
Tidy up Doc, Token and Span and add missing docs
2017-10-27 17:07:26 +02:00
ines
6a0483b7aa
Tidy up and document Doc, Token and Span
2017-10-27 15:41:45 +02:00
Matthew Honnibal
ccd2ab1a62
Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
...
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs
2017-10-20 23:58:00 +05:30
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master
2017-10-20 17:09:37 +05:30
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array
2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array
2017-10-19 17:00:41 +05:30
Matthew Honnibal
394633efce
Make doc pickling support hooks
2017-10-17 19:44:09 +02:00
Matthew Honnibal
cdb0c426d8
Improve deserialization of user_data, esp. for Underscore
2017-10-17 19:29:20 +02:00
Matthew Honnibal
32a8564c79
Fix doc pickling
2017-10-17 18:20:24 +02:00
Matthew Honnibal
92c1eb2d6f
Fix Doc pickling. This also removes need for Binder class
2017-10-17 16:11:13 +02:00
Matthew Honnibal
a002264fec
Remove caching of Token in Doc, as caused cycle.
2017-10-16 19:34:21 +02:00
ines
e0ff145a8b
Merge branch 'develop' into feature/dot-underscore
2017-10-11 11:57:05 +02:00
Matthew Honnibal
3b527fa52b
Call morphology.assign_untagged when pushing token to Doc
2017-10-11 03:23:57 +02:00
Matthew Honnibal
e0a9b02b67
Merge Span._ and Span.as_doc methods
2017-10-09 22:00:15 -05:00
Matthew Honnibal
e938bce320
Adjust parsing transition system to allow preset sentence segments.
2017-10-08 23:53:34 +02:00
Matthew Honnibal
668a0ea640
Pass extensions into Underscore class
2017-10-07 18:56:01 +02:00
ines
2480f8f521
Add missing return in Doc.from_disk() ( closes #1330 )
2017-09-18 15:32:00 +02:00
Matthew Honnibal
03b5b9727a
Fix Doc.vector for empty doc objects
2017-08-22 19:52:19 +02:00
Matthew Honnibal
0551b7b03a
Fix doc.vector
2017-08-22 19:46:52 +02:00
Matthew Honnibal
8b7ac77c23
Allow span label to be string in Doc.char_span
2017-08-19 16:18:09 +02:00
Matthew Honnibal
80236116a6
Add Doc.char_span method, to get a span by character offset
2017-08-19 12:21:09 +02:00
Matthew Honnibal
a6a2159969
Add slot for text categories to Doc
2017-07-22 00:34:15 +02:00
Matthew Honnibal
2a3bd5ee90
Fix fetching of noun chunk iterator
2017-06-04 15:53:05 -05:00
Matthew Honnibal
92ae36f84e
Improve way noun chunks iterator is looked up
2017-06-04 21:53:39 +02:00
Matthew Honnibal
675f448313
Fix vector linkage on Doc
2017-06-04 14:25:30 -05:00
ines
459a1e8470
Fix whitespace
2017-06-03 11:31:18 +02:00
ines
5109bba910
Port over fix from #1070
2017-06-03 11:31:11 +02:00
Matthew Honnibal
498ad85309
Try using tensor for vector/similarity methdos
2017-05-30 23:35:17 +02:00
Matthew Honnibal
4ddff020c3
Fix compile error
2017-05-28 23:30:40 +02:00
Matthew Honnibal
6d3caeadd2
Fix type check for long
2017-05-28 23:22:45 +02:00
Matthew Honnibal
7996d21717
Fixes for new StringStore
2017-05-28 11:09:27 -05:00
Matthew Honnibal
fe11564b8e
Finish stringstore change. Also xfail vectors tests
2017-05-28 15:10:22 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
ines
66088851dc
Add Doc.to_disk() and Doc.from_disk() methods
2017-05-24 11:58:17 +02:00
Matthew Honnibal
d44b1eafc4
Fix conflict artefacts
2017-05-23 18:47:11 +02:00
Matthew Honnibal
d68dd1f251
Add SENT_START attribute, for custom sentence boundary detection
2017-05-23 18:37:58 +02:00
ines
23f9a3ccc8
Update docstrings and API docs for Doc
2017-05-19 18:47:39 +02:00
ines
8455cb1327
Update docstring for Doc.__getitem__
2017-05-19 00:30:51 +02:00
ines
b687ad109d
Update docstrings and API docs for Doc class
2017-05-18 23:59:44 +02:00
ines
b87066ff10
Update docstrings and API docs for Doc class
2017-05-18 22:17:41 +02:00
ines
9d85cda8e4
Fix models error message and use about.__docs_models__ (see #1051 )
2017-05-13 13:05:47 +02:00
ines
6b942763f0
Tidy up imports
2017-05-13 13:04:40 +02:00
ines
b9dea345e5
Remove old import
2017-05-13 12:32:11 +02:00
ines
293ee359c5
Fix formatting
2017-05-13 12:32:06 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
4efb391994
Fix serializer
2017-05-09 18:45:18 +02:00
Matthew Honnibal
1166b0c491
Implement Doc.to_bytes and Doc.from_bytes methods
2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
ines
0739ae7b76
Tidy up and fix formatting and imports
2017-04-15 13:05:15 +02:00
ines
e71a1f4bd0
Fix download commands in error messages (see #946 )
2017-04-01 10:20:57 +02:00
Matthew Honnibal
51882ee2b8
Fix check for setting ent_id in merge
2017-03-31 19:32:01 +02:00
Matthew Honnibal
9720103428
Improve attribute handlign in doc.merge(). Still unsatisfying
2017-03-31 13:59:58 +02:00
Matthew Honnibal
0fefdfcbda
Merge pull request #935 from ericzhao28/master
...
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862 )
2017-03-30 02:51:24 +02:00
Eric Zhao
aafdf6ffb8
Add option to use label karg to determine ent_type in doc.merge
2017-03-28 23:35:03 -07:00
Roman Inflianskas
66e1109b53
Add support for Universal Dependencies v2.0
2017-03-03 13:17:34 +01:00
Matvey Ezhov
32a22291bc
Small Doc.count_by
documentation update
...
Current example doesn't work
2017-01-31 19:18:45 +03:00
Matthew Honnibal
6c665b81df
Fix redundant == TAG in from_array conditional
2017-01-31 00:46:21 +11:00
Matthew Honnibal
44e2b0100d
Support TAG attribute in doc.from_array
2017-01-10 22:47:07 +01:00
kengz
73a38bd4d1
Merge remote-tracking branch 'upstream/master'
2016-12-30 12:19:59 -05:00
kengz
da44183ae1
move parse_tree logic to a new tokens/printers.py file
2016-12-30 12:19:18 -05:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00