Commit Graph

428 Commits

Author SHA1 Message Date
Matthew Honnibal
a1be01185c Fix array out of bounds error in Span 2018-02-28 12:27:09 +01:00
Thomas Opsomer
8df9e52829 lemma property to return hash instead of unicode 2018-02-27 19:50:01 +01:00
Matthew Honnibal
cf0e320f2b Add doc.is_sentenced attribute, re #1959 2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Thomas Opsomer
deab391cbf correct check on sent_start & raise if no boundaries 2018-02-15 16:58:30 +01:00
Thomas Opsomer
b902731313 Find span sentence when only sentence boundaries (no parser) 2018-02-14 22:18:54 +01:00
4altinok
ca8728035d added new lex feat to token 2018-02-11 18:55:48 +01:00
Thomas Opsomer
515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Matthew Honnibal
56164ab688 Set l_edge and r_edge correctly for non-projective parses. Fixes #1799 2018-01-22 20:18:04 +01:00
Matthew Honnibal
ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal
b904d81e9a Fix rich comparison against None objects. Closes #1757 2018-01-15 15:51:25 +01:00
Matthew Honnibal
ab7c45b12d Fix error message and handling of doc.sents 2018-01-15 15:21:11 +01:00
Matthew Honnibal
465a6f6452 Add missing Span.vocab property. Closes #1633 2018-01-14 15:06:30 +01:00
Matthew Honnibal
0cb090e526 Fix infinite recursion in token.sent_start. Closes #1640 2018-01-14 15:02:15 +01:00
Matthew Honnibal
5cbe913b6f Don't raise deprecation warning in property. Closes #1813, #1712 2018-01-14 14:55:58 +01:00
Matthew Honnibal
e10e9ad2c5 Improve efficiency of Doc.to_array 2017-11-23 12:33:27 +00:00
Matthew Honnibal
fa62427300 Remove lookup-based lemmatization 2017-11-23 12:32:22 +00:00
Matthew Honnibal
fb26b2cb12 Use lookup lemmatizer if lemma unset 2017-11-23 12:31:58 +00:00
Burton DeWilde
a5c6869b2d Fix bug where span.orth_ != span.text (see #1612) 2017-11-20 12:05:43 -06:00
Motoki Wu
a52e195a0a Fixes Issue #1207 where noun_chunks of Span gives an error.
Make sure to reference `self.doc` when getting the noun chunks.

Same fix as 9750a0128c
2017-11-17 17:16:20 -08:00
ines
1c218397f6 Ensure path in Doc.to_disk/from_disk (resolves ##1521)
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal
62ed58935a Add Doc.extend_tensor() method 2017-11-03 11:20:31 +01:00
ines
9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines
705a4e3e4a Fix formatting 2017-11-01 16:44:08 +01:00
Matthew Honnibal
9e0ebee81c Add Token.is_sent_start property, so can deprecate Token.sent_start 2017-11-01 13:27:14 +01:00
Matthew Honnibal
7e7116cdf7 Fix Doc.to_array when only one string attr provided 2017-11-01 13:26:43 +01:00
Matthew Honnibal
301fb2bb60 Implement Span.n_lefts and Span.n_rights 2017-11-01 13:25:12 +01:00
Matthew Honnibal
86eba61fae Fix token.vector when vectors are missing 2017-11-01 00:47:35 +01:00
ines
d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines
d2df81d907 Fix not implemented Span getters 2017-10-27 18:09:28 +02:00
ines
544a407b93 Tidy up Doc, Token and Span and add missing docs 2017-10-27 17:07:26 +02:00
ines
6a0483b7aa Tidy up and document Doc, Token and Span 2017-10-27 15:41:45 +02:00
ines
1a559d4c95 Remove old, unused file 2017-10-27 15:34:35 +02:00
ines
ea4a41c8fb Tidy up util and helpers 2017-10-27 14:39:09 +02:00
Matthew Honnibal
b66b8f028b Fix #1375 -- out-of-bounds on token.nbor() 2017-10-24 12:10:39 +02:00
Matthew Honnibal
ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
fdf25d10ba Merge pull request #1440 from ramananbalakrishnan/develop
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
ines
a31f048b4d Fix formatting 2017-10-23 10:38:06 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal
394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal
cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal
32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
Matthew Honnibal
3b527fa52b Call morphology.assign_untagged when pushing token to Doc 2017-10-11 03:23:57 +02:00
Matthew Honnibal
e0a9b02b67 Merge Span._ and Span.as_doc methods 2017-10-09 22:00:15 -05:00
ines
3fc4fe61d2 Fix typo 2017-10-10 04:15:14 +02:00
ines
59c4f27499 Add get, set and has methods to Underscore 2017-10-10 04:14:35 +02:00
Matthew Honnibal
51d18937af Partially apply doc/span/token into method
We want methods to act like they're "bound" to the object, so that you can make your method conditional on the `doc`, `span` or `token` instance --- like, well, a method. We therefore partially apply the function, which works like this:

```
def partial(unbound_method, constant_arg):
    def bound_method(*args, **kwargs):
        return unbound_method(constant_arg, *args, **kwargs)
    return bound_method
2017-10-10 02:21:28 +02:00
Matthew Honnibal
e938bce320 Adjust parsing transition system to allow preset sentence segments. 2017-10-08 23:53:34 +02:00
Matthew Honnibal
080afd4924 Add ternary value setting to Token.sent_start 2017-10-08 23:51:58 +02:00
Matthew Honnibal
7ae67ec6a1 Add Span.as_doc method 2017-10-08 23:50:20 +02:00
Matthew Honnibal
668a0ea640 Pass extensions into Underscore class 2017-10-07 18:56:01 +02:00
Matthew Honnibal
1289129fd9 Add Underscore class 2017-10-07 18:00:14 +02:00
Matthew Honnibal
9bfd585a11 Fix parameter name in .pxd file 2017-09-26 07:28:50 -05:00
ines
2480f8f521 Add missing return in Doc.from_disk() (closes #1330) 2017-09-18 15:32:00 +02:00
Matthew Honnibal
03b5b9727a Fix Doc.vector for empty doc objects 2017-08-22 19:52:19 +02:00
Matthew Honnibal
0551b7b03a Fix doc.vector 2017-08-22 19:46:52 +02:00
Matthew Honnibal
d55d6e1cfa Fix comparison of Token from different docs. Closes #1257 2017-08-19 16:39:32 +02:00
Matthew Honnibal
dea229c634 Fix Span.to_array method 2017-08-19 16:24:28 +02:00
Matthew Honnibal
8b7ac77c23 Allow span label to be string in Doc.char_span 2017-08-19 16:18:09 +02:00
Matthew Honnibal
80236116a6 Add Doc.char_span method, to get a span by character offset 2017-08-19 12:21:09 +02:00
Matthew Honnibal
482bba1722 Add Span.to_array method 2017-08-19 12:20:45 +02:00
Matthew Honnibal
a6a2159969 Add slot for text categories to Doc 2017-07-22 00:34:15 +02:00
Matthew Honnibal
2a3bd5ee90 Fix fetching of noun chunk iterator 2017-06-04 15:53:05 -05:00
Matthew Honnibal
92ae36f84e Improve way noun chunks iterator is looked up 2017-06-04 21:53:39 +02:00
Matthew Honnibal
675f448313 Fix vector linkage on Doc 2017-06-04 14:25:30 -05:00
Matthew Honnibal
f4662e9218 Fix vector linkage for token 2017-06-04 14:19:58 -05:00
ines
459a1e8470 Fix whitespace 2017-06-03 11:31:18 +02:00
ines
5109bba910 Port over fix from #1070 2017-06-03 11:31:11 +02:00
Matthew Honnibal
498ad85309 Try using tensor for vector/similarity methdos 2017-05-30 23:35:17 +02:00
Matthew Honnibal
4ddff020c3 Fix compile error 2017-05-28 23:30:40 +02:00
Matthew Honnibal
6d3caeadd2 Fix type check for long 2017-05-28 23:22:45 +02:00
Matthew Honnibal
7996d21717 Fixes for new StringStore 2017-05-28 11:09:27 -05:00
Matthew Honnibal
fe11564b8e Finish stringstore change. Also xfail vectors tests 2017-05-28 15:10:22 +02:00
Matthew Honnibal
84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal
39293ab2ee Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 11:46:57 +02:00
Matthew Honnibal
2445707f3c Re-delegate vectors to vocab 2017-05-28 11:46:10 +02:00
ines
66088851dc Add Doc.to_disk() and Doc.from_disk() methods 2017-05-24 11:58:17 +02:00
Matthew Honnibal
d44b1eafc4 Fix conflict artefacts 2017-05-23 18:47:11 +02:00
Matthew Honnibal
01e59e4e6e * Add Token.sent_start property, re Issue #235 2017-05-23 18:41:11 +02:00
Matthew Honnibal
d68dd1f251 Add SENT_START attribute, for custom sentence boundary detection 2017-05-23 18:37:58 +02:00
ines
7ed8a92ed1 Update docstrings and API docs for Token 2017-05-20 15:13:33 +02:00
ines
a804045597 Use is_ancestor instead of deprecated is_ancestor_of 2017-05-19 20:23:40 +02:00
ines
e9e62b01b0 Update docstrings and API docs for Token 2017-05-19 18:47:56 +02:00
ines
62ceec4fc6 Update docstrings and API docs for Span 2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8 Update docstrings and API docs for Doc 2017-05-19 18:47:39 +02:00
ines
0791f0aae6 Update docstrings and API docs for Span class 2017-05-19 00:31:31 +02:00
ines
8455cb1327 Update docstring for Doc.__getitem__ 2017-05-19 00:30:51 +02:00
ines
b687ad109d Update docstrings and API docs for Doc class 2017-05-18 23:59:44 +02:00
ines
593361ee3c Update docstrings for Span class 2017-05-18 22:17:41 +02:00
ines
b87066ff10 Update docstrings and API docs for Doc class 2017-05-18 22:17:41 +02:00
Matthew Honnibal
4b9d69f428 Merge branch 'v2' into develop
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module

Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
ines
9d85cda8e4 Fix models error message and use about.__docs_models__ (see #1051) 2017-05-13 13:05:47 +02:00
ines
6b942763f0 Tidy up imports 2017-05-13 13:04:40 +02:00
ines
6129016e15 Replace deepcopy 2017-05-13 12:32:37 +02:00
ines
df68bf45ce Set defaults for light and flat kwargs 2017-05-13 12:32:23 +02:00
ines
b9dea345e5 Remove old import 2017-05-13 12:32:11 +02:00
ines
293ee359c5 Fix formatting 2017-05-13 12:32:06 +02:00
Matthew Honnibal
ee1d35bdb0 Fix merge conflict 2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
Matthew Honnibal
4efb391994 Fix serializer 2017-05-09 18:45:18 +02:00
Matthew Honnibal
1166b0c491 Implement Doc.to_bytes and Doc.from_bytes methods 2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6 Strip serializer from code 2017-05-09 17:28:50 +02:00
Matthew Honnibal
62ecdea9f2 Add binder class for document serialization 2017-05-09 17:21:00 +02:00
Matthew Honnibal
6782eedf9b Tmp GPU code 2017-05-07 11:04:24 -05:00
Matthew Honnibal
4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal
6a4221a6de Allow lemma to be set from Python. Re #973 2017-04-16 18:07:53 +02:00
ines
0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
ines
3b667a24d4 Remove whitespace 2017-04-01 10:21:08 +02:00
ines
e71a1f4bd0 Fix download commands in error messages (see #946) 2017-04-01 10:20:57 +02:00
Matthew Honnibal
51882ee2b8 Fix check for setting ent_id in merge 2017-03-31 19:32:01 +02:00
Matthew Honnibal
fc3900e5b2 Allow ent_id to be set in Token 2017-03-31 14:00:14 +02:00
Matthew Honnibal
9720103428 Improve attribute handlign in doc.merge(). Still unsatisfying 2017-03-31 13:59:58 +02:00
Matthew Honnibal
0fefdfcbda Merge pull request #935 from ericzhao28/master
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)
2017-03-30 02:51:24 +02:00
Eric Zhao
aafdf6ffb8 Add option to use label karg to determine ent_type in doc.merge 2017-03-28 23:35:03 -07:00
Matthew Honnibal
28bb546939 Merge pull request #883 from ericzhao28/master
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
66c1f194f9 Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
Em
9c809efc25 Removed mapStr 2017-03-11 16:23:26 -08:00
Em
426d17167f Added string manipulation for spans 2017-03-10 16:50:02 -08:00
Roman Inflianskas
66e1109b53 Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
Matvey Ezhov
32a22291bc Small Doc.count_by documentation update
Current example doesn't work
2017-01-31 19:18:45 +03:00
Matthew Honnibal
6c665b81df Fix redundant == TAG in from_array conditional 2017-01-31 00:46:21 +11:00
Matthew Honnibal
e7f8e13cf3 Make Token hashable. Fixes #743 2017-01-16 13:27:57 +01:00
Matthew Honnibal
12cd27b821 Amend 8ae8b443f: Handle comparison with None tokens. 2017-01-11 13:03:32 +01:00
Matthew Honnibal
44e2b0100d Support TAG attribute in doc.from_array 2017-01-10 22:47:07 +01:00
Matthew Honnibal
8ae8b443f1 Add richcmp method to Token. Closes #631 2017-01-09 19:30:31 +01:00
kengz
73a38bd4d1 Merge remote-tracking branch 'upstream/master' 2016-12-30 12:19:59 -05:00
kengz
da44183ae1 move parse_tree logic to a new tokens/printers.py file 2016-12-30 12:19:18 -05:00
Matthew Honnibal
404019ad2f Fix issue #672: ent_iob_ was a string, not unicode, due to missing unicode_literals statement. 2016-12-18 22:33:53 +01:00
Matthew Honnibal
f6e356aada Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667 2016-12-02 11:05:50 +01:00
Matthew Honnibal
87613edf8f Add set_struct_attr staticmethod to token 2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr. 2016-11-25 11:35:17 +01:00
Pokey Rule
3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
tiago
b38cfd0ef9 now span.merge returns token like it says on documentation 2016-11-09 14:58:19 +00:00
Matthew Honnibal
1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal
293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal
f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal
05a8b752a2 Fix Issue #600: Missing setters for Token attribute. 2016-11-02 23:28:59 +01:00
Matthew Honnibal
11664b9f20 Fix variable error in token 2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce Fix variable error in Span 2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal
b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00