spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-16 07:45:56 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	92c1eb2d6f	Fix Doc pickling. This also removes need for Binder class	2017-10-17 16:11:13 +02:00
Matthew Honnibal	a002264fec	Remove caching of Token in Doc, as caused cycle.	2017-10-16 19:34:21 +02:00
Matthew Honnibal	59c216196c	Allow weakrefs on Doc objects	2017-10-16 19:22:11 +02:00
ines	e0ff145a8b	Merge branch 'develop' into feature/dot-underscore	2017-10-11 11:57:05 +02:00
Matthew Honnibal	3b527fa52b	Call morphology.assign_untagged when pushing token to Doc	2017-10-11 03:23:57 +02:00
Matthew Honnibal	e0a9b02b67	Merge Span._ and Span.as_doc methods	2017-10-09 22:00:15 -05:00
ines	3fc4fe61d2	Fix typo	2017-10-10 04:15:14 +02:00
ines	59c4f27499	Add get, set and has methods to Underscore	2017-10-10 04:14:35 +02:00
Matthew Honnibal	51d18937af	Partially apply doc/span/token into method We want methods to act like they're "bound" to the object, so that you can make your method conditional on the `doc`, `span` or `token` instance --- like, well, a method. We therefore partially apply the function, which works like this: ``` def partial(unbound_method, constant_arg): def bound_method(args, kwargs): return unbound_method(constant_arg, args, **kwargs) return bound_method	2017-10-10 02:21:28 +02:00
Matthew Honnibal	e938bce320	Adjust parsing transition system to allow preset sentence segments.	2017-10-08 23:53:34 +02:00
Matthew Honnibal	080afd4924	Add ternary value setting to Token.sent_start	2017-10-08 23:51:58 +02:00
Matthew Honnibal	7ae67ec6a1	Add Span.as_doc method	2017-10-08 23:50:20 +02:00
Matthew Honnibal	668a0ea640	Pass extensions into Underscore class	2017-10-07 18:56:01 +02:00
Matthew Honnibal	1289129fd9	Add Underscore class	2017-10-07 18:00:14 +02:00
Matthew Honnibal	9bfd585a11	Fix parameter name in .pxd file	2017-09-26 07:28:50 -05:00
ines	2480f8f521	Add missing return in Doc.from_disk() (closes #1330 )	2017-09-18 15:32:00 +02:00
Matthew Honnibal	03b5b9727a	Fix Doc.vector for empty doc objects	2017-08-22 19:52:19 +02:00
Matthew Honnibal	0551b7b03a	Fix doc.vector	2017-08-22 19:46:52 +02:00
Matthew Honnibal	d55d6e1cfa	Fix comparison of Token from different docs. Closes #1257	2017-08-19 16:39:32 +02:00
Matthew Honnibal	dea229c634	Fix Span.to_array method	2017-08-19 16:24:28 +02:00
Matthew Honnibal	8b7ac77c23	Allow span label to be string in Doc.char_span	2017-08-19 16:18:09 +02:00
Matthew Honnibal	80236116a6	Add Doc.char_span method, to get a span by character offset	2017-08-19 12:21:09 +02:00
Matthew Honnibal	482bba1722	Add Span.to_array method	2017-08-19 12:20:45 +02:00
Matthew Honnibal	a6a2159969	Add slot for text categories to Doc	2017-07-22 00:34:15 +02:00
Matthew Honnibal	2a3bd5ee90	Fix fetching of noun chunk iterator	2017-06-04 15:53:05 -05:00
Matthew Honnibal	92ae36f84e	Improve way noun chunks iterator is looked up	2017-06-04 21:53:39 +02:00
Matthew Honnibal	675f448313	Fix vector linkage on Doc	2017-06-04 14:25:30 -05:00
Matthew Honnibal	f4662e9218	Fix vector linkage for token	2017-06-04 14:19:58 -05:00
ines	459a1e8470	Fix whitespace	2017-06-03 11:31:18 +02:00
ines	5109bba910	Port over fix from #1070	2017-06-03 11:31:11 +02:00
Matthew Honnibal	498ad85309	Try using tensor for vector/similarity methdos	2017-05-30 23:35:17 +02:00
Matthew Honnibal	4ddff020c3	Fix compile error	2017-05-28 23:30:40 +02:00
Matthew Honnibal	6d3caeadd2	Fix type check for long	2017-05-28 23:22:45 +02:00
Matthew Honnibal	7996d21717	Fixes for new StringStore	2017-05-28 11:09:27 -05:00
Matthew Honnibal	fe11564b8e	Finish stringstore change. Also xfail vectors tests	2017-05-28 15:10:22 +02:00
Matthew Honnibal	84e66ca6d4	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
Matthew Honnibal	39293ab2ee	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 11:46:57 +02:00
Matthew Honnibal	2445707f3c	Re-delegate vectors to vocab	2017-05-28 11:46:10 +02:00
ines	66088851dc	Add Doc.to_disk() and Doc.from_disk() methods	2017-05-24 11:58:17 +02:00
Matthew Honnibal	d44b1eafc4	Fix conflict artefacts	2017-05-23 18:47:11 +02:00
Matthew Honnibal	01e59e4e6e	* Add Token.sent_start property, re Issue #235	2017-05-23 18:41:11 +02:00
Matthew Honnibal	d68dd1f251	Add SENT_START attribute, for custom sentence boundary detection	2017-05-23 18:37:58 +02:00
ines	7ed8a92ed1	Update docstrings and API docs for Token	2017-05-20 15:13:33 +02:00
ines	a804045597	Use is_ancestor instead of deprecated is_ancestor_of	2017-05-19 20:23:40 +02:00
ines	e9e62b01b0	Update docstrings and API docs for Token	2017-05-19 18:47:56 +02:00
ines	62ceec4fc6	Update docstrings and API docs for Span	2017-05-19 18:47:46 +02:00
ines	23f9a3ccc8	Update docstrings and API docs for Doc	2017-05-19 18:47:39 +02:00
ines	0791f0aae6	Update docstrings and API docs for Span class	2017-05-19 00:31:31 +02:00
ines	8455cb1327	Update docstring for Doc.__getitem__	2017-05-19 00:30:51 +02:00
ines	b687ad109d	Update docstrings and API docs for Doc class	2017-05-18 23:59:44 +02:00
ines	593361ee3c	Update docstrings for Span class	2017-05-18 22:17:41 +02:00
ines	b87066ff10	Update docstrings and API docs for Doc class	2017-05-18 22:17:41 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
ines	9d85cda8e4	Fix models error message and use about.__docs_models__ (see #1051 )	2017-05-13 13:05:47 +02:00
ines	6b942763f0	Tidy up imports	2017-05-13 13:04:40 +02:00
ines	6129016e15	Replace deepcopy	2017-05-13 12:32:37 +02:00
ines	df68bf45ce	Set defaults for light and flat kwargs	2017-05-13 12:32:23 +02:00
ines	b9dea345e5	Remove old import	2017-05-13 12:32:11 +02:00
ines	293ee359c5	Fix formatting	2017-05-13 12:32:06 +02:00
Matthew Honnibal	ee1d35bdb0	Fix merge conflict	2017-05-13 03:20:19 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
Matthew Honnibal	4efb391994	Fix serializer	2017-05-09 18:45:18 +02:00
Matthew Honnibal	1166b0c491	Implement Doc.to_bytes and Doc.from_bytes methods	2017-05-09 18:11:34 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
Matthew Honnibal	62ecdea9f2	Add binder class for document serialization	2017-05-09 17:21:00 +02:00
Matthew Honnibal	6782eedf9b	Tmp GPU code	2017-05-07 11:04:24 -05:00
Matthew Honnibal	4d98511db7	Make Span hashable. Closes #1019	2017-04-26 19:01:05 +02:00
Matthew Honnibal	6a4221a6de	Allow lemma to be set from Python. Re #973	2017-04-16 18:07:53 +02:00
ines	0739ae7b76	Tidy up and fix formatting and imports	2017-04-15 13:05:15 +02:00
ines	3b667a24d4	Remove whitespace	2017-04-01 10:21:08 +02:00
ines	e71a1f4bd0	Fix download commands in error messages (see #946 )	2017-04-01 10:20:57 +02:00
Matthew Honnibal	51882ee2b8	Fix check for setting ent_id in merge	2017-03-31 19:32:01 +02:00
Matthew Honnibal	fc3900e5b2	Allow ent_id to be set in Token	2017-03-31 14:00:14 +02:00
Matthew Honnibal	9720103428	Improve attribute handlign in doc.merge(). Still unsatisfying	2017-03-31 13:59:58 +02:00
Matthew Honnibal	0fefdfcbda	Merge pull request #935 from ericzhao28/master Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)	2017-03-30 02:51:24 +02:00
Eric Zhao	aafdf6ffb8	Add option to use label karg to determine ent_type in doc.merge	2017-03-28 23:35:03 -07:00
Matthew Honnibal	28bb546939	Merge pull request #883 from ericzhao28/master Add `lower_` and `upper_` properties to `Span` class	2017-03-16 23:35:47 +01:00
ines	66c1f194f9	Use consistent unicode declarations	2017-03-12 13:07:28 +01:00
Em	9c809efc25	Removed mapStr	2017-03-11 16:23:26 -08:00
Em	426d17167f	Added string manipulation for spans	2017-03-10 16:50:02 -08:00
Roman Inflianskas	66e1109b53	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
Matvey Ezhov	32a22291bc	Small `Doc.count_by` documentation update Current example doesn't work	2017-01-31 19:18:45 +03:00
Matthew Honnibal	6c665b81df	Fix redundant == TAG in from_array conditional	2017-01-31 00:46:21 +11:00
Matthew Honnibal	e7f8e13cf3	Make Token hashable. Fixes #743	2017-01-16 13:27:57 +01:00
Matthew Honnibal	12cd27b821	Amend 8ae8b443f: Handle comparison with None tokens.	2017-01-11 13:03:32 +01:00
Matthew Honnibal	44e2b0100d	Support TAG attribute in doc.from_array	2017-01-10 22:47:07 +01:00
Matthew Honnibal	8ae8b443f1	Add richcmp method to Token. Closes #631	2017-01-09 19:30:31 +01:00
kengz	73a38bd4d1	Merge remote-tracking branch 'upstream/master'	2016-12-30 12:19:59 -05:00
kengz	da44183ae1	move parse_tree logic to a new tokens/printers.py file	2016-12-30 12:19:18 -05:00
Matthew Honnibal	404019ad2f	Fix issue #672 : ent_iob_ was a string, not unicode, due to missing unicode_literals statement.	2016-12-18 22:33:53 +01:00
Matthew Honnibal	f6e356aada	Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667	2016-12-02 11:05:50 +01:00
Matthew Honnibal	87613edf8f	Add set_struct_attr staticmethod to token	2016-11-25 12:41:47 +01:00
Matthew Honnibal	fb69aa648f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-25 11:35:44 +01:00
Matthew Honnibal	9a03a3f85e	Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.	2016-11-25 11:35:17 +01:00
Pokey Rule	3e3bda142d	Add noun_chunks to Span	2016-11-24 10:47:20 +00:00
tiago	b38cfd0ef9	now span.merge returns token like it says on documentation	2016-11-09 14:58:19 +00:00
Matthew Honnibal	1fb09c3dc1	Fix morphology tagger	2016-11-04 19:19:09 +01:00
Matthew Honnibal	293c79c09a	Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.	2016-11-04 00:29:07 +01:00
Matthew Honnibal	f292f7f0e6	Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.	2016-11-02 23:48:43 +01:00
Matthew Honnibal	05a8b752a2	Fix Issue #600 : Missing setters for Token attribute.	2016-11-02 23:28:59 +01:00
Matthew Honnibal	11664b9f20	Fix variable error in token	2016-11-01 13:28:00 +01:00
Matthew Honnibal	8c4d1b46ce	Fix variable error in Span	2016-11-01 13:27:44 +01:00
Matthew Honnibal	e7af6b937f	Fix syntax error while fixing doc strings	2016-11-01 13:27:32 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	4ca31b4d87	Fix clobbering of 'missing' named ent values after assigning ents.	2016-10-26 13:13:56 +02:00
Matthew Honnibal	15c9b59f0e	Fix Issue #461 : O tag was being clobbered by doc.ents.__set__	2016-10-23 15:50:26 +02:00
Matthew Honnibal	2c3a67b693	Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.	2016-10-23 14:49:31 +02:00
Matthew Honnibal	e80944276f	Fix Span.vector_norm	2016-10-20 21:58:56 +02:00
Matthew Honnibal	3588a18fb8	Fix hook names in doc	2016-10-19 21:15:16 +02:00
Matthew Honnibal	5d5742b773	Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.	2016-10-19 20:54:22 +02:00
Matthew Honnibal	9b60186266	Fix doc class	2016-10-17 15:23:47 +02:00
Matthew Honnibal	7fd98fc91c	Remove deprecation shim around str/bytes in Token.	2016-10-17 14:02:47 +02:00
Matthew Honnibal	b67697a97b	Improve API for doc.merge() and span.merge(), to use keyword arguments.	2016-10-17 14:02:13 +02:00
Matthew Honnibal	fbb7f3f15c	Add user_data attribute to Doc object.	2016-10-17 11:43:22 +02:00
Matthew Honnibal	c1abc8f6ed	Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec	2016-10-17 11:18:41 +02:00
Matthew Honnibal	09ab447a18	Remove tensor property from token.	2016-10-17 02:45:09 +02:00
Matthew Honnibal	5d10e2005c	Defer some attributes to Doc, via getters_for_tokens attribute.	2016-10-17 02:44:49 +02:00
Matthew Honnibal	8829984efb	Remove tensor attribute from Span and Token.	2016-10-17 02:44:04 +02:00
Matthew Honnibal	d15a88c66a	Defer some attributes to Doc via getters_for_spans	2016-10-17 02:43:35 +02:00
Matthew Honnibal	62230dd13a	Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring	2016-10-17 02:42:51 +02:00
Matthew Honnibal	ae11ea8240	Add getters_for_tokens and getters_for_spans attributes to Doc object.	2016-10-17 02:42:05 +02:00
Matthew Honnibal	311a985fe0	Add input error handling in Doc	2016-10-16 18:16:42 +02:00
Matthew Honnibal	06322ba99d	Add words and spaces keyword arguments to Doc.	2016-10-16 18:13:03 +02:00
Matthew Honnibal	f3be9d0a9a	Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs	2016-10-14 03:24:13 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	6736977d82	Revert "Changes to Doc and Token for new string store scheme" This reverts commit `99de44d864`.	2016-09-30 20:11:15 +02:00
Matthew Honnibal	99de44d864	Changes to Doc and Token for new string store scheme	2016-09-30 20:00:21 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	d3dc5718b2	Fix syntax error in Doc	2016-09-28 11:39:49 +02:00
Matthew Honnibal	1b520e7bab	Improve docstrings for Doc object	2016-09-28 11:15:13 +02:00
Matthew Honnibal	fc4a7ad794	Test and fix Issue #411 : IndexError when .sents property is used on empty string.	2016-09-27 18:49:14 +02:00
Matthew Honnibal	15e42a1ba9	Allow entities to be set by Span, or by 4-tuple (with entity ID)	2016-09-24 01:17:43 +02:00
Matthew Honnibal	e48df859b5	Fix typedef import in span.pyx	2016-09-23 16:02:28 +02:00
Matthew Honnibal	4de13606fd	Fix token.pyx	2016-09-23 15:07:07 +02:00
Matthew Honnibal	b4de419e19	Import hash_t typedef in token.pyx	2016-09-23 14:22:06 +02:00
Matthew Honnibal	c1a2e96604	Clean up notes at end of token.pyx	2016-09-21 20:45:51 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Matthew Honnibal	2735b6247b	Fix orths_and_spaces in Doc.__init__	2016-09-21 14:52:05 +02:00
Matthew Honnibal	cdc10e9a1c	* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.	2016-05-20 10:14:06 +02:00
Matthew Honnibal	5d86c30f0b	* Fix Issue #367 : Missing has_vector property on Doc and Span objects	2016-05-09 12:36:14 +02:00
Matthew Honnibal	8c0888d6cb	* Fix error in span.sent	2016-05-06 00:28:05 +02:00
Matthew Honnibal	26095f9722	* Add span.sent property, re Issue #366	2016-05-06 00:17:38 +02:00
Matthew Honnibal	76021cb853	* Fix bug in Doc.text, introduced by `a862edc`	2016-05-04 11:02:16 +02:00
Matthew Honnibal	29a114e645	* Don't assign 0-valued tags in Doc.from_array	2016-05-02 16:07:50 +02:00
Matthew Honnibal	276fbe9996	* Fix assignment of iterator on Doc object	2016-05-02 15:26:24 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	6df3858dbc	* Fix Issue #323 : Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition.	2016-04-12 13:17:59 +10:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Matthew Honnibal	26622f0ffc	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-03-29 14:31:52 +11:00
Matthew Honnibal	ad119c074f	* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.	2016-03-29 13:02:42 +11:00

1 2 3 4 5 ...

382 Commits