spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-13 13:17:06 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	dfa752d064	Increment version	2016-10-19 23:19:13 +02:00
Matthew Honnibal	3588a18fb8	Fix hook names in doc	2016-10-19 21:15:16 +02:00
Matthew Honnibal	5d5742b773	Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.	2016-10-19 20:54:22 +02:00
Matthew Honnibal	ed5e178817	Add sentiment property on lexeme object	2016-10-19 20:52:52 +02:00
Matthew Honnibal	d4aaf2752c	Fix issue #535 : Pipeline elements added even when data not installed.	2016-10-19 19:55:19 +02:00
Matthew Honnibal	04d1c959da	Fix version	2016-10-19 03:45:37 +02:00
Matthew Honnibal	d35aa7344e	Change version ID to make PyPi happy	2016-10-19 03:24:39 +02:00
Matthew Honnibal	89d2a5c8b3	Increment build version.	2016-10-19 03:05:17 +02:00
Matthew Honnibal	622b0a9674	Tweak download script	2016-10-19 00:52:16 +02:00
Matthew Honnibal	5a5c7192a5	Fix download.py for GloVe vectors.	2016-10-19 00:47:44 +02:00
Matthew Honnibal	edc45c19d6	Update download script	2016-10-19 00:41:14 +02:00
Matthew Honnibal	2bbb050500	Fix default of serializer_freqs	2016-10-18 19:55:41 +02:00
Matthew Honnibal	1b651db9c5	Fix parser creation in Language class.	2016-10-18 19:36:44 +02:00
Matthew Honnibal	45a6f9b9c7	Fix loading of tagger.	2016-10-18 19:33:04 +02:00
Matthew Honnibal	76c815f40d	Fix spacy.load	2016-10-18 19:23:31 +02:00
Matthew Honnibal	8c8f5c62c6	Add LANG attribute to English and German	2016-10-18 18:52:48 +02:00
Matthew Honnibal	05e2a589a4	Fix None label in matcher	2016-10-18 18:05:21 +02:00
Matthew Honnibal	c3a8a1cf51	Update serializer test.	2016-10-18 16:18:46 +02:00
Matthew Honnibal	7d5212f131	Refactor defaults	2016-10-18 16:18:25 +02:00
Matthew Honnibal	a45a9d5092	Remove stray .tensor attribute from Lexeme	2016-10-18 01:16:32 +02:00
Matthew Honnibal	9258db788a	Revert "Have the matcher return character offsets, to handle the match better." This reverts commit `049c937540`.	2016-10-17 16:49:51 +02:00
Matthew Honnibal	7d446e5094	Revert "Update matcher test, to reflect character offset return instead of token offset." This reverts commit `f8d3e3bcfe`.	2016-10-17 16:49:49 +02:00
Matthew Honnibal	4bf2c53c13	Revert "Hack on matcher tests, for new implementation." This reverts commit `dbe60644ab`.	2016-10-17 16:49:48 +02:00
Matthew Honnibal	2fd97c71cc	Revert "Don't try to pickle matcher." This reverts commit `97bd0c9d00`.	2016-10-17 16:49:43 +02:00
Matthew Honnibal	97bd0c9d00	Don't try to pickle matcher.	2016-10-17 16:38:40 +02:00
Matthew Honnibal	dbe60644ab	Hack on matcher tests, for new implementation.	2016-10-17 16:12:22 +02:00
Matthew Honnibal	f8d3e3bcfe	Update matcher test, to reflect character offset return instead of token offset.	2016-10-17 16:00:10 +02:00
Matthew Honnibal	049c937540	Have the matcher return character offsets, to handle the match better.	2016-10-17 15:58:57 +02:00
Matthew Honnibal	9b60186266	Fix doc class	2016-10-17 15:23:47 +02:00
Matthew Honnibal	6cbdc94959	Lots of updates to Matcher, to make entity handling sane.	2016-10-17 15:23:31 +02:00
Matthew Honnibal	7fd98fc91c	Remove deprecation shim around str/bytes in Token.	2016-10-17 14:02:47 +02:00
Matthew Honnibal	b67697a97b	Improve API for doc.merge() and span.merge(), to use keyword arguments.	2016-10-17 14:02:13 +02:00
Matthew Honnibal	fbb7f3f15c	Add user_data attribute to Doc object.	2016-10-17 11:43:22 +02:00
Matthew Honnibal	c1abc8f6ed	Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec	2016-10-17 11:18:41 +02:00
Matthew Honnibal	4ba9eadf3d	Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1	2016-10-17 02:45:44 +02:00
Matthew Honnibal	09ab447a18	Remove tensor property from token.	2016-10-17 02:45:09 +02:00
Matthew Honnibal	5d10e2005c	Defer some attributes to Doc, via getters_for_tokens attribute.	2016-10-17 02:44:49 +02:00
Matthew Honnibal	8829984efb	Remove tensor attribute from Span and Token.	2016-10-17 02:44:04 +02:00
Matthew Honnibal	d15a88c66a	Defer some attributes to Doc via getters_for_spans	2016-10-17 02:43:35 +02:00
Matthew Honnibal	62230dd13a	Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring	2016-10-17 02:42:51 +02:00
Matthew Honnibal	ae11ea8240	Add getters_for_tokens and getters_for_spans attributes to Doc object.	2016-10-17 02:42:05 +02:00
Matthew Honnibal	be48a7b4f3	Fix conftest for website tests.	2016-10-17 01:54:26 +02:00
Matthew Honnibal	8951bf6989	Update matcher tests	2016-10-17 01:53:24 +02:00
Matthew Honnibal	0cf4aff470	Set default path in EN/DE tests.	2016-10-17 01:52:49 +02:00
Matthew Honnibal	cd71b6b0a9	Remove test of parser pickle	2016-10-17 01:52:10 +02:00
Matthew Honnibal	5bc101006e	Add cfg field to Tagger	2016-10-17 01:03:41 +02:00
Matthew Honnibal	517f090cbf	Use GoldParse in tagger.update	2016-10-17 00:55:15 +02:00
Matthew Honnibal	59038f7efa	Restore support for prior data format -- specifically, the labels field of the config.	2016-10-17 00:53:26 +02:00
Matthew Honnibal	7887ab3b36	Fix default use of feature_templates in parser	2016-10-16 21:41:56 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
Matthew Honnibal	311a985fe0	Add input error handling in Doc	2016-10-16 18:16:42 +02:00
Matthew Honnibal	06322ba99d	Add words and spaces keyword arguments to Doc.	2016-10-16 18:13:03 +02:00
Matthew Honnibal	ca51f3b77e	Use DependencyParser and EntityRecognizer in the Language class.	2016-10-16 17:58:12 +02:00
Matthew Honnibal	195d998a12	Fix GoldParse argument to tagger.update	2016-10-16 17:05:09 +02:00
Matthew Honnibal	274a4d4272	Fix queue Python property in StateClass	2016-10-16 17:04:41 +02:00
Matthew Honnibal	e8c8aa08ce	Make action_name optional in StepwiseState	2016-10-16 17:04:16 +02:00
Matthew Honnibal	4bb73b1a93	Fix parser labels in pipeline	2016-10-16 17:03:22 +02:00
Matthew Honnibal	a81c5a7abf	Fix name of labels keyword to 'actions'.	2016-10-16 12:00:27 +02:00
Matthew Honnibal	a079677984	Fix omission of O action when creating blank entity recognizer	2016-10-16 11:43:25 +02:00
Matthew Honnibal	5444d38cc6	Update test for biluo tags	2016-10-16 11:42:45 +02:00
Matthew Honnibal	4fc56d4a31	Rename 'labels' to 'actions' in parser options	2016-10-16 11:42:26 +02:00
Matthew Honnibal	8a6b35d266	Delay binding in MakeDoc	2016-10-16 11:41:55 +02:00
Matthew Honnibal	52b48b415e	Fix GoldParse class	2016-10-16 11:41:36 +02:00
Matthew Honnibal	3259a63779	Whitespace	2016-10-16 01:47:28 +02:00
Matthew Honnibal	509b30834f	Add a pipeline module, to collect and wrap processes for annotation	2016-10-16 01:47:12 +02:00
Matthew Honnibal	0317cea0ad	Fix GoldParse	2016-10-15 23:55:07 +02:00
Matthew Honnibal	1c62573a41	Fix spacy.train	2016-10-15 23:53:46 +02:00
Matthew Honnibal	a48aa15384	Improve the API for the GoldParse class.	2016-10-15 23:53:29 +02:00
Matthew Honnibal	e07fe92b27	Draft a refactored init for the GoldParse class	2016-10-15 22:09:52 +02:00
Matthew Honnibal	47afef7d6b	Add init.py for gold tests	2016-10-15 21:51:28 +02:00
Matthew Honnibal	86ae665c78	Add function for entity->biluo transformation	2016-10-15 21:51:04 +02:00
Matthew Honnibal	2163fd238f	Add tests for entity->biluo transformation	2016-10-15 21:50:43 +02:00
Matthew Honnibal	5e923b9bfa	Return None in match_best_version if not path exists.	2016-10-15 14:47:29 +02:00
Matthew Honnibal	2516382106	Fix loading of English in span test	2016-10-15 14:44:37 +02:00
Matthew Honnibal	dda2fc6bef	Add empty data directory	2016-10-15 14:25:25 +02:00
Matthew Honnibal	049197e0ae	Update tests, somewhat messily.	2016-10-15 14:14:04 +02:00
Matthew Honnibal	1e1a1d9517	Update matcher test	2016-10-15 14:13:41 +02:00
Matthew Honnibal	9cc9ce0f14	Load with default path=False in tests.	2016-10-15 14:13:23 +02:00
Matthew Honnibal	08e9134760	Change default value of path to True	2016-10-15 14:12:54 +02:00
Matthew Honnibal	788657f062	Ensure words are added to vocab before test, so that the lexicon is updated correctly.	2016-10-15 14:12:18 +02:00
Matthew Honnibal	4a1a2bce68	Update version in about.py	2016-10-15 13:44:27 +02:00
Matthew Honnibal	6d8cb515ac	Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.	2016-10-14 17:38:29 +02:00
Matthew Honnibal	2cc515b2ed	Add add_flag method to Vocab, re Issue #504 .	2016-10-14 12:15:38 +02:00
Matthew Honnibal	f3be9d0a9a	Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs	2016-10-14 03:24:13 +02:00
Matthew Honnibal	9b55d97a8f	Update train method	2016-10-13 03:24:53 +02:00
Matthew Honnibal	645d99523a	Move merge_sents method into spacy.gold	2016-10-13 03:24:29 +02:00
Matthew Honnibal	41f88ce938	Fix dep model loading in parser	2016-10-12 20:26:38 +02:00
Matthew Honnibal	d9ae2d68af	Load features by string-name for backwards compatibility.	2016-10-12 20:15:11 +02:00
Matthew Honnibal	a42fbcf946	Require model for test_is_properties	2016-10-12 19:35:18 +02:00
Matthew Honnibal	20c948361b	Use local path in test_lemmatizer	2016-10-12 19:35:00 +02:00
Matthew Honnibal	1318d0bc65	Test with the non-loaded versions of the English and German pipelines.	2016-10-12 19:13:31 +02:00
Matthew Honnibal	0e2bedc373	Fix default labels for parser and NER	2016-10-12 19:12:40 +02:00
Matthew Honnibal	3a03c668c3	Fix message in ParserStateError	2016-10-12 14:44:31 +02:00
Matthew Honnibal	6bf505e865	Fix error on ParserStateError	2016-10-12 14:35:55 +02:00
Matthew Honnibal	ba5e048502	Add docstring for Trainer class.	2016-10-12 14:26:02 +02:00
Matthew Honnibal	847a4a4182	Refactor Language, dropping Language.blank() method.	2016-10-12 13:45:58 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	90baa9c7e6	Revert "Changes to matcher.pyx for new StringStore scheme" This reverts commit `3ff09614e0`.	2016-09-30 20:20:13 +02:00
Matthew Honnibal	1b6b129c04	Revert "Changes to morphology.pyx for new StringStore scheme" This reverts commit `95f8cfd745`.	2016-09-30 20:20:02 +02:00
Matthew Honnibal	1d70db58aa	Revert "Changes to iterators.pyx for new StringStore scheme" This reverts commit `4f794b215a`.	2016-09-30 20:19:53 +02:00
Matthew Honnibal	de01e427fd	Revert "Changes to strings.pyx for new StringStore scheme" This reverts commit `22d4752d64`.	2016-09-30 20:19:42 +02:00
Matthew Honnibal	9e09b39b9f	Revert "Changes to transition systems for new StringStore scheme" This reverts commit `0442e0ab1e`.	2016-09-30 20:11:49 +02:00
Matthew Honnibal	e3285f6f30	Revert "Fix report of ParserStateError" This reverts commit `78f19baafa`.	2016-09-30 20:11:33 +02:00
Matthew Honnibal	6736977d82	Revert "Changes to Doc and Token for new string store scheme" This reverts commit `99de44d864`.	2016-09-30 20:11:15 +02:00
Matthew Honnibal	bd7fe6420c	Revert "Changes to test for new string-store" This reverts commit `21e90d7d0b`.	2016-09-30 20:11:01 +02:00
Matthew Honnibal	1f1cd5013f	Revert "Changes to vocab for new stringstore scheme" This reverts commit `a51149a717`.	2016-09-30 20:10:30 +02:00
Matthew Honnibal	1e7d0af127	Revert "Changes to Lexeme for new string store scheme" This reverts commit `717741b6cf`.	2016-09-30 20:10:13 +02:00
Matthew Honnibal	ba51cb8325	Revert "Changes to tagger for new string store scheme" This reverts commit `f5a6aac906`.	2016-09-30 20:09:53 +02:00
Matthew Honnibal	23b7244842	Make sure symbols are unicode strings	2016-09-30 20:02:19 +02:00
Matthew Honnibal	f5a6aac906	Changes to tagger for new string store scheme	2016-09-30 20:01:51 +02:00
Matthew Honnibal	717741b6cf	Changes to Lexeme for new string store scheme	2016-09-30 20:01:36 +02:00
Matthew Honnibal	a51149a717	Changes to vocab for new stringstore scheme	2016-09-30 20:01:19 +02:00
Matthew Honnibal	21e90d7d0b	Changes to test for new string-store	2016-09-30 20:00:58 +02:00
Matthew Honnibal	99de44d864	Changes to Doc and Token for new string store scheme	2016-09-30 20:00:21 +02:00
Matthew Honnibal	78f19baafa	Fix report of ParserStateError	2016-09-30 19:59:22 +02:00
Matthew Honnibal	0442e0ab1e	Changes to transition systems for new StringStore scheme	2016-09-30 19:58:51 +02:00
Matthew Honnibal	22d4752d64	Changes to strings.pyx for new StringStore scheme	2016-09-30 19:58:09 +02:00
Matthew Honnibal	4f794b215a	Changes to iterators.pyx for new StringStore scheme	2016-09-30 19:57:49 +02:00
Matthew Honnibal	95f8cfd745	Changes to morphology.pyx for new StringStore scheme	2016-09-30 19:57:10 +02:00
Matthew Honnibal	3ff09614e0	Changes to matcher.pyx for new StringStore scheme	2016-09-30 19:56:48 +02:00
Matthew Honnibal	eceeaefe53	Fix defaults for Parser and Entity, adding a blank= argument.	2016-09-30 19:56:06 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	d3dc5718b2	Fix syntax error in Doc	2016-09-28 11:39:49 +02:00
Matthew Honnibal	1b520e7bab	Improve docstrings for Doc object	2016-09-28 11:15:13 +02:00
Matthew Honnibal	81a47c01d8	Fix test for empty sentence string.	2016-09-27 19:21:22 +02:00
Matthew Honnibal	4cbf0d3bb6	Handle errors when no valid actions are available, pointing users to the issue tracker.	2016-09-27 19:19:53 +02:00
Matthew Honnibal	430473bd98	Raise errors when no actions are available, re Issue #429	2016-09-27 19:09:37 +02:00
Matthew Honnibal	fc4a7ad794	Test and fix Issue #411 : IndexError when .sents property is used on empty string.	2016-09-27 18:49:14 +02:00
Matthew Honnibal	3d370b7d45	Add test for Issue #445 , fixed in `3cb4d455d`, with improved lemmatizer logic	2016-09-27 18:39:46 +02:00
Matthew Honnibal	a2f3510d6d	Fix lemmatizer	2016-09-27 17:47:05 +02:00
Matthew Honnibal	07776d8096	Fix pos name conflict in lemmatize	2016-09-27 17:35:58 +02:00
Matthew Honnibal	35cd953f9e	Fix pos name conflict with morphology	2016-09-27 14:16:22 +02:00
Matthew Honnibal	8e7df3c4ca	Expect the parser data, if parser.load() is called.	2016-09-27 14:02:12 +02:00
Matthew Honnibal	bb4f201ad2	Pass morphological features from tag map into the lemmatizer.	2016-09-27 14:01:43 +02:00
Matthew Honnibal	40509e8bca	Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.	2016-09-27 14:01:16 +02:00
Matthew Honnibal	9c8ac91d72	Add test for Issue #435	2016-09-27 13:52:38 +02:00
Matthew Honnibal	3cb4d455d2	Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435	2016-09-27 13:52:11 +02:00
Matthew Honnibal	e233328d38	Fix Issue #371 : Lexeme objects were unhashable.	2016-09-27 13:22:30 +02:00
Matthew Honnibal	e382e48d9f	Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.	2016-09-27 13:21:28 +02:00
Matthew Honnibal	a44763af0e	Fix Issue #469 : Incorrectly cased root label in noun chunk iterator	2016-09-27 13:13:01 +02:00
Matthew Honnibal	b14b9b096b	Return None if /deps directory not present, instead of trying to load the parser.	2016-09-26 18:48:03 +02:00
Matthew Honnibal	e07b9665f7	Don't expect parser model	2016-09-26 18:09:33 +02:00
Matthew Honnibal	ee6fa106da	Fix parser features	2016-09-26 17:57:32 +02:00
Matthew Honnibal	e607e4b598	Fix parser loading	2016-09-26 17:51:11 +02:00
Matthew Honnibal	0b2d7ae9d6	Fix Entity creation	2016-09-26 15:41:22 +02:00
Matthew Honnibal	2debc4e0a2	Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.	2016-09-26 11:57:54 +02:00
Matthew Honnibal	722199acb8	Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey	2016-09-26 11:07:46 +02:00
Matthew Honnibal	e56653f848	Add language data for German	2016-09-25 15:44:45 +02:00
Matthew Honnibal	7db956133e	Move tokenizer data for German into spacy.de.language_data	2016-09-25 15:37:33 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	d7e9acdcdf	Add English language data, so that the tokenizer doesn't require the data download	2016-09-25 14:49:00 +02:00
Matthew Honnibal	82b8cc5efb	Whitespace	2016-09-24 22:17:01 +02:00
Matthew Honnibal	fd58f7655a	Python 3 compatible basestring	2016-09-24 22:16:43 +02:00
Matthew Honnibal	082e95b19e	Python 3 compatible basestring	2016-09-24 22:09:21 +02:00
Matthew Honnibal	f19af6cb2c	Python 3 compatible basestring	2016-09-24 22:08:43 +02:00
Matthew Honnibal	3ed4cdfe32	Handle pathlib.Path objects in CFile	2016-09-24 22:01:46 +02:00
Matthew Honnibal	df88690177	Fix encoding of path variable	2016-09-24 21:13:15 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	453683aaf0	Fix spacy/vocab.pyx	2016-09-24 20:50:31 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	9dc8043a7e	Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.	2016-09-24 14:08:53 +02:00
Matthew Honnibal	b00f683a0c	Fix matcher test	2016-09-24 11:20:58 +02:00
Matthew Honnibal	eaf4065480	Expose the _patterns private member	2016-09-24 11:20:42 +02:00
Matthew Honnibal	15e42a1ba9	Allow entities to be set by Span, or by 4-tuple (with entity ID)	2016-09-24 01:17:43 +02:00
Matthew Honnibal	60fdf4d5f1	Remove commented out debuggng code	2016-09-24 01:17:18 +02:00
Matthew Honnibal	939a791a52	Update tests	2016-09-24 01:17:03 +02:00
Matthew Honnibal	55f1f7edaf	Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a backwards incompatibility.	2016-09-24 01:16:45 +02:00
Matthew Honnibal	e48df859b5	Fix typedef import in span.pyx	2016-09-23 16:02:28 +02:00
Matthew Honnibal	4de13606fd	Fix token.pyx	2016-09-23 15:07:07 +02:00
Matthew Honnibal	b4de419e19	Import hash_t typedef in token.pyx	2016-09-23 14:22:06 +02:00
Matthew Honnibal	c1a2e96604	Clean up notes at end of token.pyx	2016-09-21 20:45:51 +02:00
Matthew Honnibal	f6e587b1c7	Fix matcher tests	2016-09-21 20:45:20 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Matthew Honnibal	2735b6247b	Fix orths_and_spaces in Doc.__init__	2016-09-21 14:52:05 +02:00
Matthew Honnibal	070af4af9d	Revert "* Working neural net, but features hacky. Switching to extractor." This reverts commit `7c2f1a673b`.	2016-09-21 12:26:14 +02:00
Matthew Honnibal	6b202ec43f	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-09-21 12:08:25 +02:00
Mahmoud Lababidi	4c9ccc3b8b	Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged.	2016-09-14 10:04:09 -04:00
Adam Ever Hadani	f1c0762443	exit code 0 for when downloading a model that already was downloaded	2016-07-13 16:22:14 -07:00
Matthew Honnibal	7c2f1a673b	* Working neural net, but features hacky. Switching to extractor.	2016-05-26 19:06:10 +02:00
Matthew Honnibal	cdc10e9a1c	* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.	2016-05-20 10:14:06 +02:00
Matthew Honnibal	13fad36e49	* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop	2016-05-20 10:11:05 +02:00
Matthew Honnibal	02276cc444	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-17 16:56:22 +02:00
Matthew Honnibal	4d7f5468bb	* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded	2016-05-17 16:55:42 +02:00
Daylen Yang	5405e7dd73	Fix get_lang_class parsing (take 2)	2016-05-16 16:40:31 -07:00
Matthew Honnibal	b240104f40	Revert "Fix get_lang_class parsing"	2016-05-17 08:04:26 +10:00
Daylen Yang	1692c2df3c	Fix get_lang_class parsing We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.	2016-05-16 14:38:20 -07:00
Matthew Honnibal	17137f5c0c	* Fix issue #372 : mistake in Lexeme rich comparison	2016-05-12 12:58:57 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	c61ee8f9fa	* Increment version	2016-05-09 13:20:00 +02:00
Matthew Honnibal	5d86c30f0b	* Fix Issue #367 : Missing has_vector property on Doc and Span objects	2016-05-09 12:36:14 +02:00
Wolfgang Seeker	7b78239436	add fix for German noun chunk iterator (issue #365 )	2016-05-06 01:41:26 +02:00
Matthew Honnibal	8c0888d6cb	* Fix error in span.sent	2016-05-06 00:28:05 +02:00
Matthew Honnibal	bb94022975	* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.	2016-05-06 00:21:05 +02:00
Matthew Honnibal	41342ca79b	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-06 00:17:58 +02:00
Matthew Honnibal	26095f9722	* Add span.sent property, re Issue #366	2016-05-06 00:17:38 +02:00
Wolfgang Seeker	dbf8f5f3ec	fix bug in StateC.set_break()	2016-05-05 15:15:34 +02:00
Wolfgang Seeker	3c44b5dc1a	call deprojectivization after parsing	2016-05-05 15:10:36 +02:00
Matthew Honnibal	472f576b82	* Deprojectivize German parses	2016-05-05 15:01:10 +02:00
Matthew Honnibal	9bbd6cf031	* Work on Chinese support	2016-05-05 11:39:12 +02:00
Matthew Honnibal	a6a25166ba	* Remove print from test	2016-05-05 11:10:59 +02:00
Matthew Honnibal	e31df66d26	* Fix Issue #361 : Lexemes didn't have rich comparison.	2016-05-05 01:32:26 +02:00
Matthew Honnibal	7441ca30ee	* Add tests for Issue #361 : Lexeme rich comparison	2016-05-05 01:31:58 +02:00
Matthew Honnibal	72564213e3	* Add test for Issue #309	2016-05-04 16:00:28 +02:00
Matthew Honnibal	76f1d871da	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-04 15:54:00 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	b4bfc6ae55	* Add test for Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:17 +02:00
Matthew Honnibal	76021cb853	* Fix bug in Doc.text, introduced by `a862edc`	2016-05-04 11:02:16 +02:00
Wolfgang Seeker	e4ea2bea01	fix whitespace	2016-05-04 07:40:38 +02:00
Wolfgang Seeker	5bf2fd1f78	make the code less cryptic	2016-05-03 17:19:05 +02:00
Wolfgang Seeker	a06fca9fdf	German noun chunk iterator now doesn't return tokens more than once	2016-05-03 16:58:59 +02:00
Wolfgang Seeker	7825b75548	add tests for German noun chunker	2016-05-03 15:01:28 +02:00
Wolfgang Seeker	7b246c13cb	reformulate noun chunk tests for English	2016-05-03 14:24:35 +02:00
Wolfgang Seeker	1786331cd8	add model sanity test	2016-05-03 12:51:47 +02:00
Matthew Honnibal	1f1532142f	* Fix cost calculation on non-monotonic oracle	2016-05-03 00:21:08 +02:00
Matthew Honnibal	377a624046	Merge pull request #358 from wbwseeker/german_lemmatizer_dummy German lemmatizer dummy	2016-05-03 07:38:26 +10:00
Wolfgang Seeker	92bfbebeec	remove unnecessary imports	2016-05-02 17:33:22 +02:00
Wolfgang Seeker	857454ffa0	fix indentation -.-	2016-05-02 17:10:41 +02:00
Matthew Honnibal	308a28c26c	* Whitespace	2016-05-02 16:08:11 +02:00
Matthew Honnibal	29a114e645	* Don't assign 0-valued tags in Doc.from_array	2016-05-02 16:07:50 +02:00
Matthew Honnibal	c1c11a8ae0	* Fix formatting on serializer tests	2016-05-02 16:07:21 +02:00
Wolfgang Seeker	dae6bc05eb	define German dummy lemmatizer until morphology is done	2016-05-02 16:04:53 +02:00
Matthew Honnibal	6e1f1c4b9e	Merge pull request #357 from wbwseeker/german_ner German ner	2016-05-02 23:39:34 +10:00
Wolfgang Seeker	b6b96b233c	don't require read_json_file to expect particular annotations	2016-05-02 15:29:30 +02:00
Matthew Honnibal	902a389d85	* Fix merge conflict in test_parse	2016-05-02 15:28:07 +02:00
Matthew Honnibal	276fbe9996	* Fix assignment of iterator on Doc object	2016-05-02 15:26:24 +02:00
Matthew Honnibal	02c23cc1d0	* Fix sentence boundary test	2016-05-02 15:26:07 +02:00
Matthew Honnibal	d2f469b809	* Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct	2016-05-02 15:25:27 +02:00
Wolfgang Seeker	b11cbb06c6	remove old tests for sentence boundary detection	2016-05-02 14:36:35 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	e526be5602	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-02 13:08:08 +02:00
Wolfgang Seeker	fa961ea694	add tests for serialization bug	2016-05-02 11:01:56 +02:00
Matthew Honnibal	97b2bba249	* Merge updated/simplified Break approach	2016-04-25 19:44:42 +00:00
Matthew Honnibal	77609588b6	* Fix assignment of root label to words left as root implicitly, after parsing ends.	2016-04-25 19:41:59 +00:00
Matthew Honnibal	7c2d2deaa7	* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322	2016-04-25 19:41:59 +00:00
Wolfgang Seeker	c2f76a4024	Merge branch 'master' into german_ner	2016-04-25 13:21:23 +02:00
Wolfgang Seeker	1003e7ccec	remove debug output from tests	2016-04-25 12:12:40 +02:00
Wolfgang Seeker	f57f843e85	fix bug in updating tree structure when introducing additional roots	2016-04-25 12:01:19 +02:00
Matthew Honnibal	478a8d1829	* Register Chinese language in spacy/__init__.py	2016-04-24 18:45:16 +02:00
Matthew Honnibal	8569dbc2d0	* Add initial stuff for Chinese parsing	2016-04-24 18:44:24 +02:00
Wolfgang Seeker	4d7f393fae	don't require json-files to have syntactic annotation	2016-04-22 16:32:27 +02:00
Wolfgang Seeker	b6477fc4f4	adjusted tests to Travis Setup	2016-04-21 17:15:10 +02:00
Wolfgang Seeker	736ffcb9a2	remove whitespace	2016-04-21 16:55:55 +02:00
Wolfgang Seeker	6c7301cc6d	the parser now introduces sentence boundaries properly when predicting dependents with root labels	2016-04-21 16:50:53 +02:00
Wolfgang Seeker	12024b0b0a	bugfix: introducing multiple roots now updates original head's properties adjust tests to rely less on statistical model	2016-04-20 16:42:41 +02:00
Matthew Honnibal	67ce96c9c9	* Make patterns argument to Matcher class optional	2016-04-17 21:32:24 +02:00
Matthew Honnibal	8b4677d34d	* Add missing keyword arguments to spacy.load() function	2016-04-17 21:31:50 +02:00
Matthew Honnibal	2add5206aa	* Fix description of matcher test	2016-04-17 15:40:21 +02:00
Matthew Honnibal	2b419d5b8c	* Update test for Issue #242	2016-04-17 15:34:23 +02:00
Matthew Honnibal	f12b043308	* Add test for Issue #242 : Overlapping matches not well recognised.	2016-04-17 15:19:17 +02:00
Wolfgang Seeker	b98cc3266d	bugfix: iterators now reset properly when called a second time	2016-04-15 17:49:16 +02:00
Wolfgang Seeker	e6945c4d0e	bugfix: uppercase attr values before looking them up	2016-04-15 15:46:31 +02:00
Matthew Honnibal	c0909afe22	Merge pull request #312 from wbwseeker/space_head_bug add restrictions to L-arc and R-arc to prevent space heads	2016-04-15 20:36:03 +10:00
Wolfgang Seeker	289b10f441	remove some comments	2016-04-14 15:37:51 +02:00
Matthew Honnibal	6f82065761	* Fix infixed commas in tokenizer, re Issue #326 . Need to benchmark on empirical data, to make sure this doesn't break other cases.	2016-04-14 11:36:03 +02:00
Matthew Honnibal	0f957dd586	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-04-14 10:37:56 +02:00
Matthew Honnibal	108aca0e50	* Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping.	2016-04-14 10:37:39 +02:00
Matthew Honnibal	61d20de35d	* Fix language.py docstring	2016-04-14 10:36:57 +02:00
Wolfgang Seeker	d99a9cbce9	different handling of space tokens space tokens are now always attached to the previous non-space token there are two exceptions: leading space tokens are attached to the first following non-space token in input that consists exclusively of space tokens, the last space token is the head of all others.	2016-04-13 15:28:28 +02:00
Matthew Honnibal	04d0209be9	* Recognise multiple infixes in a token.	2016-04-13 18:38:26 +10:00
Henning Peters	a473d6e937	fix tests (use english model)	2016-04-12 16:41:57 +02:00
Henning Peters	f2d011c034	avoid polluting spacy namespace with lang classes	2016-04-12 16:31:16 +02:00
Henning Peters	ff690f76ba	fix loading non-german models	2016-04-12 16:00:56 +02:00
Henning Peters	6215272786	remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels	2016-04-12 11:28:07 +02:00
Matthew Honnibal	6df3858dbc	* Fix Issue #323 : Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition.	2016-04-12 13:17:59 +10:00
Wolfgang Seeker	d328e0b4a8	Merge branch 'master' into space_head_bug	2016-04-11 12:11:01 +02:00
Wolfgang Seeker	80bea62842	bugfix in unit test	2016-04-08 16:46:44 +02:00
Wolfgang Seeker	be4903a1b2	update version numbers	2016-04-08 13:54:05 +02:00
Wolfgang Seeker	1fe911cdb0	bigfix	2016-04-07 18:19:51 +02:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Henning Peters	470cdf5bf9	remove deprecated LOCAL_DATA_DIR	2016-04-05 11:25:54 +02:00
Matthew Honnibal	26622f0ffc	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-03-29 14:31:52 +11:00
Matthew Honnibal	b1fe41b45d	* Extend infix test, commenting on limitation of tokenizer w.r.t. infixes at the moment.	2016-03-29 14:31:05 +11:00
Matthew Honnibal	9c73983bdd	* Add test for hyphenation problem in Issue #302	2016-03-29 14:27:13 +11:00
Matthew Honnibal	ad119c074f	* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.	2016-03-29 13:02:42 +11:00
Matthew Honnibal	8c7a1908ee	Merge pull request #307 from scoder/faster_string_store remove internal redundancy and overhead from StringStore	2016-03-29 12:59:52 +11:00
Wolfgang Seeker	7195b6742d	add restrictions to L-arc and R-arc to prevent space heads	2016-03-28 10:40:52 +02:00
Matthew Honnibal	8c77a994c6	Merge pull request #305 from henningpeters/master multiple langs in download script	2016-03-26 21:54:59 +11:00
Henning Peters	c90d4a6f17	relative imports in __init__.py	2016-03-26 11:44:53 +01:00
Henning Peters	db095a162c	fix	2016-03-25 18:59:47 +01:00
Henning Peters	b8f63071eb	add lang registration facility	2016-03-25 18:54:45 +01:00
Matthew Honnibal	4a37fdcee1	Merge pull request #287 from wbwseeker/deproj_sentbnd_bug add function to Token for setting head and dep (and dep_)	2016-03-25 09:47:45 +11:00
Stefan Behnel	f18805ee1c	make StringStore.__contains__() return True for the empty string (which is also contained in iteration)	2016-03-24 15:42:12 +01:00
Stefan Behnel	f2cfbfc412	remove internal redundancy and overhead from StringStore	2016-03-24 15:25:27 +01:00
Wolfgang Seeker	d65ef41d08	make error messages language independent	2016-03-24 11:47:09 +01:00
Henning Peters	963570aa49	Merge branch 'master' of github.com:spacy-io/spaCy	2016-03-24 11:19:47 +01:00
Henning Peters	a7d7ea3afa	first idea for supporting multiple langs in download script	2016-03-24 11:19:43 +01:00
Wolfgang Seeker	5080077097	revert init_model.py back to pre-german state (because it makes more sense) simplify token.n_rights and token.n_lefts	2016-03-21 16:10:25 +01:00
Wolfgang Seeker	5e2e8e951a	add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model	2016-03-16 15:53:35 +01:00
Matthew Honnibal	80134eb12d	Merge branch 'master' of https://github.com/spacy-io/spaCy	2016-03-15 19:14:50 +00:00
Wolfgang Seeker	2ae253ef5b	changed head.__set__ to make it simpler	2016-03-14 13:43:48 +01:00
Henning Peters	c12d3dd200	add __init__.py to empty package dirs	2016-03-14 11:28:03 +01:00
Henning Peters	54f3447b5f	cleanup	2016-03-14 01:46:33 +01:00
Wolfgang Seeker	46e3f979f1	add function for setting head and label to token change PseudoProjectivity.deprojectivize to use these functions	2016-03-11 17:31:06 +01:00
Wolfgang Seeker	03fb498dbe	introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately	2016-03-10 13:01:34 +01:00
Wolfgang Seeker	bc9c62e279	replace Language functions with corresponding orth functions implement punctuation functions in orth	2016-03-09 18:07:37 +01:00
Wolfgang Seeker	d9312bc9ea	add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators	2016-03-09 16:18:48 +01:00
Matthew Honnibal	1508528c8c	* Increment version	2016-03-08 15:58:45 +00:00
Matthew Honnibal	963fe5258e	* Add missing __contains__ method to vocab	2016-03-08 15:49:10 +00:00
Matthew Honnibal	478aa21cb0	* Remove broken __reduce__ method on vocab	2016-03-08 15:48:21 +00:00
Matthew Honnibal	20235bde00	Merge pull request #282 from henningpeters/switch_vectors initial proposal for ability to switch vectors	2016-03-09 01:39:41 +11:00
Henning Peters	eb7ae61b1c	cleanup api	2016-03-08 12:59:18 +01:00
Henning Peters	b740f20191	hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2	2016-03-06 09:19:27 +01:00
Henning Peters	aa4d964c14	cleanup api	2016-03-05 17:51:32 +01:00
Henning Peters	931c07a609	initial proposal for separate vector package	2016-03-04 11:09:06 +01:00
Wolfgang Seeker	7adbd7a785	replace Counter with normal dict	2016-03-03 21:36:27 +01:00
Wolfgang Seeker	1ae487a4f6	add backwards compatibility with python 2.6	2016-03-03 21:18:12 +01:00
Wolfgang Seeker	9d1e6de4a0	make a proper list from zip iterator	2016-03-03 19:51:01 +01:00
Wolfgang Seeker	49f9d1c085	change test_nonproj.py to not use zip inside numpy.asarray	2016-03-03 19:42:09 +01:00
Wolfgang Seeker	72b8df0684	turned PseudoProjectivity into a normal python class	2016-03-03 19:05:08 +01:00
Matthew Honnibal	fcaa0ad7ce	Merge pull request #280 from wbwseeker/german_parser German parser	2016-03-04 03:27:42 +11:00
Wolfgang Seeker	690c5acabf	adjust train.py to train both english and german models	2016-03-03 15:21:00 +01:00
Wolfgang Seeker	3448cb40a4	integrated pseudo-projective parsing into parser - nonproj.pyx holds a class PseudoProjectivity which currently holds all functionality to implement Nivre & Nilsson 2005's pseudo-projective parsing using the HEAD decoration scheme - changed lefts/rights in Token to account for possible non-projective structures	2016-03-01 10:09:08 +01:00
Wolfgang Seeker	56b7210e82	moved nonproj.py to syntax/nonproj.pyx	2016-02-25 15:08:49 +01:00
Henning Peters	f3df736e0a	remove unidecode-related test	2016-02-24 18:22:22 +01:00
Wolfgang Seeker	4b2297d5d4	add class PseudoProjective for pseudo-projective parsing PseudoProjective() implements the algorithm from Nivre & Nilsson 2005 using their HEAD decoration scheme.	2016-02-24 11:26:25 +01:00
Henning Peters	12d58a7099	remove text-unidecode dependency	2016-02-24 08:01:59 +01:00
Wolfgang Seeker	8d531c958b	replace tests for non-projectivity - add functions to find non-projective edges - add test file for non-projectivity functions	2016-02-22 14:40:40 +01:00
Matthew Honnibal	141639ea3a	* Fix bug in tokenizer that caused new tokens to be added for affixes	2016-02-21 23:17:47 +00:00
Wolfgang Seeker	eae35e9b27	add tokenizer files for German, add/change code to train German pos tagger - add files to specify rules for German tokenization - change generate_specials.py to generate from an external file (abbrev.de.tab) - copy gazetteer.json from lang_data/en/ - init_model.py - change doc freq threshold to 0 - add train_german_tagger.py - expects conll09-formatted input	2016-02-18 13:24:20 +01:00
Henning Peters	9cc4f8d5b3	avoid shadowing __name__	2016-02-15 01:33:39 +01:00
Henning Peters	4c9e3c7911	upgrade spuntik, enforce data api via model version constraints	2016-02-14 16:03:17 +01:00
Henning Peters	9d8966a2c0	Update test_tokenizer.py	2016-02-10 19:24:37 +01:00
Henning Peters	3b5f1e753b	py26 compatibility	2016-02-10 14:32:54 +01:00
Henning Peters	ee1f1ac300	mark test_sentence_space() as model test	2016-02-10 07:49:11 +01:00
Matthew Honnibal	5d96b3ef4f	* Increment version	2016-02-07 13:48:58 +01:00
Matthew Honnibal	1b83cb9dfa	* Fix Issue #251 : Incorrect right edge calculation on left-clobber low in the tree	2016-02-07 00:00:42 +01:00
Matthew Honnibal	c6623889c1	* Add test for Issue #251 : Incorrect right edges, caused by bad update to r_edge in del_arc, triggered from non-monotonic left-arc	2016-02-06 23:47:51 +01:00
Matthew Honnibal	a95974ad3f	* Fix oov probability	2016-02-06 15:13:55 +01:00
Matthew Honnibal	af8514cb0c	* Refine the way the is_parsed attribute is set by from_array	2016-02-06 14:44:35 +01:00
Matthew Honnibal	161b01d4c0	* Tweak usage example for multi-processing	2016-02-06 14:44:11 +01:00
Matthew Honnibal	7f24229f10	* Don't try to pickle the tokenizer	2016-02-06 14:09:05 +01:00
Matthew Honnibal	dcb401f3e1	* Remove broken Vocab pickling	2016-02-06 14:08:47 +01:00
Matthew Honnibal	e66d45bf66	* Restore previous patch to Span.root, as it seems it wasn't the cause of the problem.	2016-02-06 13:37:41 +01:00
Matthew Honnibal	4412a70dc5	* Initialize StateC._empty_token to 0, to avoid undefined behaviour.	2016-02-06 13:34:38 +01:00
Matthew Honnibal	1b41f868d2	* Check for errors in parser, and parallelise the left-over batch	2016-02-06 10:06:30 +01:00
Matthew Honnibal	031b00cb91	* Fix Span.root calculation	2016-02-05 20:12:09 +01:00
Matthew Honnibal	165ca28b80	* Set is_parsed flag in Parser.pipe	2016-02-05 19:51:44 +01:00
Matthew Honnibal	bdd579db0a	* Set is_parsed flag in Parser.pipe	2016-02-05 19:50:11 +01:00
Matthew Honnibal	7119e77fb6	* Fix Matcher.pipe	2016-02-05 19:46:02 +01:00
Matthew Honnibal	1cf0100bf6	* Add test for multithreading	2016-02-05 19:38:22 +01:00
Matthew Honnibal	b04c9aad71	* Fix off-by-one in Parser.pipe	2016-02-05 19:37:50 +01:00
Matthew Honnibal	e5c447e237	* Questionable fix to problem in Span.root	2016-02-05 19:18:35 +01:00
Matthew Honnibal	1ef84a0557	* Merge master into rethinc2	2016-02-05 12:55:59 +01:00
Matthew Honnibal	4cf34fc170	Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2	2016-02-05 12:48:28 +01:00
Matthew Honnibal	249dccbe95	* Fix Language.pipe	2016-02-05 12:47:57 +01:00
Matthew Honnibal	c0e63feccc	* xfail pickle tests	2016-02-05 12:46:58 +01:00
Matthew Honnibal	6aa92b70f1	* Fix merge problem in span	2016-02-05 12:46:11 +01:00
Matthew Honnibal	048dfe35aa	* cimport cython.parallel	2016-02-05 12:20:42 +01:00

... 5 6 7 8 9 ...

2073 Commits