spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-29 03:16:31 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	f111b228e0	Fix re-parsing of previously parsed text If a Doc object had been previously parsed, it was possible for invalid parses to be added. There were two problems: 1) The parse was only being partially erased 2) The RightArc action was able to create a 1-cycle. This patch fixes both errors, and avoids resetting the parse if one is present. In theory this might allow a better parse to be predicted by running the parser twice. Closes #1253.	2017-10-20 16:27:36 +02:00
Matthew Honnibal	9010a1a060	Create vectors correctly	2017-10-20 14:19:46 +02:00
Matthew Honnibal	33229b1c9e	Remove print statement	2017-10-20 14:19:29 +02:00
Matthew Honnibal	cfae54c507	Make change to Vectors.__init__	2017-10-20 14:19:04 +02:00
Matthew Honnibal	ebecaddb76	Make 'data_or_width' two keyword args in Vectors.__init__ Previously the data and width options were one argument in Vectors, which meant you couldn't say vectors = Vectors(strings, width=300). It's better to have two keywords.	2017-10-20 14:17:15 +02:00
Matthew Honnibal	49895fbef6	Rename 'SP' special tag to '_SP' Renaming the tag with an underscore lets us add it to the tag map without worrying that we'll change the sequence of tags, which throws off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag, the "VERB" tag is pushed to a different class ID, and the model is all messed up.	2017-10-20 14:01:12 +02:00
Matthew Honnibal	506cf2eb13	Remove cpdef enum, to avoid too much code generation	2017-10-20 14:00:23 +02:00
Matthew Honnibal	6218af0105	Remove cpdef enum, to avoid too much code generation	2017-10-20 13:59:57 +02:00
Matthew Honnibal	92ac9316b5	Fix initialization of vectors, to address serialization problem	2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan	0726946563	cleanup to_array implementation using fixes on master	2017-10-20 17:09:37 +05:30
ines	108f1f786e	Update symbols and document missing token attributes (see #1439 )	2017-10-20 13:08:44 +02:00
ines	4acab77a8a	Add missing symbol for LAW entities (resolves #1427 )	2017-10-20 13:07:57 +02:00
Ramanan Balakrishnan	b3ab124fc5	Support strings for attribute list in doc.to_array	2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan	7b9b1be44c	Support single value for attribute list in doc.to_array	2017-10-19 17:00:41 +05:30
Matthew Honnibal	61bc203f3f	Merge pull request #1438 from explosion/feature/fast-parser 💫 Improve runtime CPU efficiency of parser/NER	2017-10-19 02:42:21 +02:00
Matthew Honnibal	15e5a04a8d	Clean up more depth=0 conditional code	2017-10-19 01:48:43 +02:00
Matthew Honnibal	906c50ac59	Fix loop typing, that caused error on windows	2017-10-19 01:48:39 +02:00
ines	24512420b1	Show error if data_path does not exist or is None (see #1102 )	2017-10-19 00:53:49 +02:00
ines	bf415fd778	Add test for serializing extension attrs (see #1085 )	2017-10-19 00:53:08 +02:00
Matthew Honnibal	960788aaa2	Eliminate dead code in parser, and raise errors for obsolete options	2017-10-19 00:42:34 +02:00
Matthew Honnibal	bbfd7d8d5d	Clean up parser multi-threading	2017-10-19 00:25:21 +02:00
Matthew Honnibal	f018f2030c	Try optimized parser forward loop	2017-10-18 21:48:00 +02:00
Matthew Honnibal	65bf5e85bd	Improve piping in language.pipe	2017-10-18 21:46:12 +02:00
Matthew Honnibal	633a75c7e0	Break parser batches into sub-batches, sorted by length.	2017-10-18 21:45:01 +02:00
Ines Montani	f0d577e460	Merge pull request #1425 from explosion/feature/hindi-tokenizer 💫 Basic Hindi tokenization support	2017-10-18 13:34:52 +02:00
Matthew Honnibal	394633efce	Make doc pickling support hooks	2017-10-17 19:44:09 +02:00
Matthew Honnibal	fe844148f6	Test pickling hooks	2017-10-17 19:43:52 +02:00
Matthew Honnibal	cdb0c426d8	Improve deserialization of user_data, esp. for Underscore	2017-10-17 19:29:20 +02:00
Matthew Honnibal	374819edf8	Test user_data deserialization, re #1085	2017-10-17 19:28:54 +02:00
Matthew Honnibal	e35a83d142	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-17 18:22:06 +02:00
Matthew Honnibal	f45973848c	Rename 'tokens' variable 'doc' in tokenizer	2017-10-17 18:21:41 +02:00
Matthew Honnibal	839de87ca9	Make lambda func a named function, for pickling	2017-10-17 18:21:20 +02:00
Matthew Honnibal	9baa8fe7ec	Convert closure to functools.partial, to promote pickling	2017-10-17 18:20:52 +02:00
Matthew Honnibal	32a8564c79	Fix doc pickling	2017-10-17 18:20:24 +02:00
Matthew Honnibal	8ca97f32a3	Fix doc pickling test	2017-10-17 18:19:57 +02:00
Matthew Honnibal	9ce7d6af87	Make lex attr functions top-level functions, to promote pickling	2017-10-17 18:19:18 +02:00
Matthew Honnibal	1cc85a89ef	Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes().	2017-10-17 18:18:49 +02:00
Matthew Honnibal	0d57b9748a	Serialize lex_attr_getters with dill, for better pickle support	2017-10-17 18:17:45 +02:00
Matthew Honnibal	45d1dd90b1	Add tests for pickling doc	2017-10-17 17:20:58 +02:00
Ines Montani	afa67de7ee	Merge pull request #1428 from roanuz/develop Fix trailing whitespace and Language.from_disk overwrites	2017-10-17 16:29:15 +02:00
Matthew Honnibal	92c1eb2d6f	Fix Doc pickling. This also removes need for Binder class	2017-10-17 16:11:13 +02:00
Matthew Honnibal	ed8da9b11f	Add missing return statement in SentenceSegmenter	2017-10-17 15:32:56 +02:00
Ines Montani	aab299c8ae	Merge pull request #1429 from vishnunekkanti/develop fix syntax error in zh	2017-10-17 14:45:02 +02:00
Anto Binish Kaspar	534240648e	Fix trailing whitespace on morphology features	2017-10-17 17:15:58 +05:30
Anto Binish Kaspar	8f5b60c168	Fix Language.from_disk overwrites the meta.json file.	2017-10-17 17:15:32 +05:30
ines	8ca344712d	Add Language.has_pipe method	2017-10-17 11:20:07 +02:00
ines	485c4f6df5	Add Hungarian examples (see #1107 )	2017-10-17 02:37:45 +02:00
Matthew Honnibal	19531bad4c	Merge branch 'develop' into feature/streaming-data-memory-growth	2017-10-16 21:44:11 +02:00
Matthew Honnibal	df488274b1	Fix deserialization of vectors	2017-10-16 20:55:00 +02:00
Matthew Honnibal	4018486d31	Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth	2017-10-16 20:49:48 +02:00
Matthew Honnibal	4174477161	Fix equality check in test	2017-10-16 19:50:35 +02:00
Matthew Honnibal	2bc06e4b22	Bump rolling buffer size to 10k	2017-10-16 19:38:29 +02:00
Matthew Honnibal	66e2eb8f39	Clean up remnant of frozen in StringStore	2017-10-16 19:34:41 +02:00
Matthew Honnibal	a002264fec	Remove caching of Token in Doc, as caused cycle.	2017-10-16 19:34:21 +02:00
Matthew Honnibal	3e037054c8	Remove obsolete is_frozen functionality from StringStore	2017-10-16 19:23:10 +02:00
Matthew Honnibal	5c14f3f033	Create a rolling buffer for the StringStore in Language.pipe()	2017-10-16 19:22:40 +02:00
Matthew Honnibal	59c216196c	Allow weakrefs on Doc objects	2017-10-16 19:22:11 +02:00
ines	d5418553eb	Fix whitespace	2017-10-16 18:30:04 +02:00
ines	6ceadcdb5c	Make sure from_disk passes string to numpy (see #1421 ) If path is a WindowsPath, numpy does not recognise it as a path and as a result, doesn't open the file. https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369	2017-10-16 18:29:56 +02:00
Matthew Honnibal	010a7309ff	Merge pull request #1402 from explosion/feature/fix-matcher-operators 💫 Fix Matcher variable-length operators	2017-10-16 17:53:19 +02:00
Matthew Honnibal	c29927d2e7	Fix matcher test	2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti	d3c54cf39a	fixed SyntaxError while checking for jieba	2017-10-16 18:51:33 +05:30
Matthew Honnibal	a928ae2f35	Merge branch 'develop' into feature/fix-matcher-operators	2017-10-16 13:38:36 +02:00
Matthew Honnibal	56aa42cc5d	Fix and document matcher operator 'shadowing' behaviour	2017-10-16 13:38:20 +02:00
Matthew Honnibal	748d525801	Add more matcher operator tests	2017-10-16 13:38:01 +02:00
Matthew Honnibal	0433181658	Document operator semantics in Matcher docstring	2017-10-16 12:06:33 +02:00
ines	266e7180a7	Add Language class, stop words and basic stemmer that sets NORM	2017-10-14 14:59:52 +02:00
ines	e85e1d571b	Update base punctuation	2017-10-14 14:59:23 +02:00
ines	9d6c8eaa49	Update base norm exceptions with more unicode characters e.g. unicode variations of punctuation used in Chinese	2017-10-14 14:58:52 +02:00
ines	3516aa0cea	Port over changes from #1389	2017-10-14 13:32:55 +02:00
ines	cd6a29dce7	Port over changes from #1294	2017-10-14 13:28:46 +02:00
ines	38c756fd85	Port over changes from #1287	2017-10-14 13:16:21 +02:00
ines	612224c10d	Port over changes from #1157	2017-10-14 13:11:39 +02:00
ines	9b3f8f9ec3	Fix formatting and add comment on languages	2017-10-14 13:11:18 +02:00
ines	a4d974d97b	Port over URL pattern changes from #1411	2017-10-14 12:58:07 +02:00
ines	09aed58140	Port over changes from #1333 and add comments	2017-10-14 12:52:59 +02:00
Matthew Honnibal	cf6da9301a	Update lemmatizer test	2017-10-12 22:50:52 +02:00
Matthew Honnibal	9b90d235d1	Fix tag check in lemmatizer	2017-10-12 22:50:43 +02:00
Matthew Honnibal	dc01acd821	Escape encoding in validate function	2017-10-12 22:23:21 +02:00
Matthew Honnibal	27b927259a	Add locale_escape compat function	2017-10-12 22:22:04 +02:00
ines	9c6de3dcfa	Merge branch 'develop' into feature/cli-validate	2017-10-12 21:44:28 +02:00
Matthew Honnibal	462caf835a	Fix SBD test	2017-10-12 21:18:22 +02:00
ines	fff1028391	Add validate CLI command	2017-10-12 20:05:06 +02:00
Matthew Honnibal	908f44c3fe	Disable history features by default	2017-10-12 14:56:11 +02:00
Matthew Honnibal	a955843684	Increase default number of epochs	2017-10-12 13:13:01 +02:00
Matthew Honnibal	cecfcc7711	Set default hyper params back to 'slow' settings	2017-10-12 13:12:26 +02:00
Ines Montani	37aa523a8e	Merge pull request #1408 from explosion/feature/dot-underscore 💫 Custom attributes via Doc._, Token._ and Span._	2017-10-11 18:35:56 +02:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	51519251c2	Fix underscore method test	2017-10-11 13:34:19 +02:00
ines	c6ae49e8bf	Fix formatting	2017-10-11 13:34:11 +02:00
ines	453c47ca24	Add German lemmatizer tests	2017-10-11 13:27:26 +02:00
ines	15fe0fd82d	Fix tests	2017-10-11 13:27:18 +02:00
ines	6dd14dc342	Add lookup lemmas to tokens without POS tags	2017-10-11 13:27:10 +02:00
ines	9620c1a640	Add lemma_lookup to Language defaults	2017-10-11 13:26:05 +02:00
ines	9fd471372a	Add lookup lemmatizer to lemmatizer as lookup() method	2017-10-11 13:25:51 +02:00
ines	e0ff145a8b	Merge branch 'develop' into feature/dot-underscore	2017-10-11 11:57:05 +02:00
ines	c1d6d43c83	Merge branch 'develop' into feature/lemmatizer	2017-10-11 11:56:35 +02:00
Matthew Honnibal	17c467e0ab	Avoid clobbering existing lemmas	2017-10-11 03:33:06 -05:00
Matthew Honnibal	807e109f2b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-11 02:47:59 -05:00
Matthew Honnibal	6e552c9d83	Prune number of non-projective labels more aggressiely	2017-10-11 02:46:44 -05:00

1 2 3 4 5 ...

4160 Commits