spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	490ad3eaf0	Check that empty strings are handled. Closes #1242	2017-10-21 00:52:14 +02:00
Matthew Honnibal	8f8bccecb9	Patch deserialisation for invalid loads, to avoid model failure	2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan	d2fe56a577	Add LCA matrix for spans and docs	2017-10-20 23:58:00 +05:30
Matthew Honnibal	d8391b1c4d	Fix #1434 : Matcher failed on ending ? if no token	2017-10-20 16:49:36 +02:00
Matthew Honnibal	fec53f09f7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-20 16:28:34 +02:00
Matthew Honnibal	f111b228e0	Fix re-parsing of previously parsed text If a Doc object had been previously parsed, it was possible for invalid parses to be added. There were two problems: 1) The parse was only being partially erased 2) The RightArc action was able to create a 1-cycle. This patch fixes both errors, and avoids resetting the parse if one is present. In theory this might allow a better parse to be predicted by running the parser twice. Closes #1253.	2017-10-20 16:27:36 +02:00
Matthew Honnibal	1036798155	Make parser consistent if maxout==1	2017-10-20 16:24:16 +02:00
Matthew Honnibal	3faf9189a2	Make parser hidden shape consistent even if maxout==1	2017-10-20 16:23:31 +02:00
Matthew Honnibal	9010a1a060	Create vectors correctly	2017-10-20 14:19:46 +02:00
Matthew Honnibal	33229b1c9e	Remove print statement	2017-10-20 14:19:29 +02:00
Matthew Honnibal	cfae54c507	Make change to Vectors.__init__	2017-10-20 14:19:04 +02:00
Matthew Honnibal	ebecaddb76	Make 'data_or_width' two keyword args in Vectors.__init__ Previously the data and width options were one argument in Vectors, which meant you couldn't say vectors = Vectors(strings, width=300). It's better to have two keywords.	2017-10-20 14:17:15 +02:00
Matthew Honnibal	49895fbef6	Rename 'SP' special tag to '_SP' Renaming the tag with an underscore lets us add it to the tag map without worrying that we'll change the sequence of tags, which throws off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag, the "VERB" tag is pushed to a different class ID, and the model is all messed up.	2017-10-20 14:01:12 +02:00
Matthew Honnibal	506cf2eb13	Remove cpdef enum, to avoid too much code generation	2017-10-20 14:00:23 +02:00
Matthew Honnibal	6218af0105	Remove cpdef enum, to avoid too much code generation	2017-10-20 13:59:57 +02:00
Matthew Honnibal	92ac9316b5	Fix initialization of vectors, to address serialization problem	2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan	0726946563	cleanup to_array implementation using fixes on master	2017-10-20 17:09:37 +05:30
ines	108f1f786e	Update symbols and document missing token attributes (see #1439 )	2017-10-20 13:08:44 +02:00
ines	4acab77a8a	Add missing symbol for LAW entities (resolves #1427 )	2017-10-20 13:07:57 +02:00
Matthew Honnibal	b101736555	Fix precomputed layer	2017-10-20 12:14:52 +02:00
Ramanan Balakrishnan	b3ab124fc5	Support strings for attribute list in doc.to_array	2017-10-20 11:46:57 +05:30
Matthew Honnibal	64658e02e5	Implement fancier initialisation for precomputed layer	2017-10-20 03:07:45 +02:00
Matthew Honnibal	827cd8a883	Fix support of maxout pieces in parser	2017-10-20 03:07:17 +02:00
Matthew Honnibal	a8850b4282	Remove redundant PrecomputableMaxouts class	2017-10-19 20:27:34 +02:00
Matthew Honnibal	a17a1b60c7	Clean up redundant PrecomputableMaxouts class	2017-10-19 20:26:37 +02:00
Matthew Honnibal	b00d0a2c97	Fix bias in parser	2017-10-19 18:42:11 +02:00
Matthew Honnibal	b54b4b8a97	Make parser_maxout_pieces hyper-param work	2017-10-19 13:45:18 +02:00
Matthew Honnibal	03a215c5fd	Make PrecomputableAffines work	2017-10-19 13:44:49 +02:00
Ramanan Balakrishnan	7b9b1be44c	Support single value for attribute list in doc.to_array	2017-10-19 17:00:41 +05:30
Matthew Honnibal	61bc203f3f	Merge pull request #1438 from explosion/feature/fast-parser 💫 Improve runtime CPU efficiency of parser/NER	2017-10-19 02:42:21 +02:00
Matthew Honnibal	15e5a04a8d	Clean up more depth=0 conditional code	2017-10-19 01:48:43 +02:00
Matthew Honnibal	906c50ac59	Fix loop typing, that caused error on windows	2017-10-19 01:48:39 +02:00
ines	24512420b1	Show error if data_path does not exist or is None (see #1102 )	2017-10-19 00:53:49 +02:00
ines	bf415fd778	Add test for serializing extension attrs (see #1085 )	2017-10-19 00:53:08 +02:00
Matthew Honnibal	960788aaa2	Eliminate dead code in parser, and raise errors for obsolete options	2017-10-19 00:42:34 +02:00
Matthew Honnibal	bbfd7d8d5d	Clean up parser multi-threading	2017-10-19 00:25:21 +02:00
Matthew Honnibal	f018f2030c	Try optimized parser forward loop	2017-10-18 21:48:00 +02:00
Matthew Honnibal	65bf5e85bd	Improve piping in language.pipe	2017-10-18 21:46:12 +02:00
Matthew Honnibal	633a75c7e0	Break parser batches into sub-batches, sorted by length.	2017-10-18 21:45:01 +02:00
Ines Montani	f0d577e460	Merge pull request #1425 from explosion/feature/hindi-tokenizer 💫 Basic Hindi tokenization support	2017-10-18 13:34:52 +02:00
Matthew Honnibal	394633efce	Make doc pickling support hooks	2017-10-17 19:44:09 +02:00
Matthew Honnibal	fe844148f6	Test pickling hooks	2017-10-17 19:43:52 +02:00
Matthew Honnibal	cdb0c426d8	Improve deserialization of user_data, esp. for Underscore	2017-10-17 19:29:20 +02:00
Matthew Honnibal	374819edf8	Test user_data deserialization, re #1085	2017-10-17 19:28:54 +02:00
Matthew Honnibal	e35a83d142	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-17 18:22:06 +02:00
Matthew Honnibal	f45973848c	Rename 'tokens' variable 'doc' in tokenizer	2017-10-17 18:21:41 +02:00
Matthew Honnibal	839de87ca9	Make lambda func a named function, for pickling	2017-10-17 18:21:20 +02:00
Matthew Honnibal	9baa8fe7ec	Convert closure to functools.partial, to promote pickling	2017-10-17 18:20:52 +02:00
Matthew Honnibal	32a8564c79	Fix doc pickling	2017-10-17 18:20:24 +02:00
Matthew Honnibal	8ca97f32a3	Fix doc pickling test	2017-10-17 18:19:57 +02:00
Matthew Honnibal	9ce7d6af87	Make lex attr functions top-level functions, to promote pickling	2017-10-17 18:19:18 +02:00
Matthew Honnibal	1cc85a89ef	Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes().	2017-10-17 18:18:49 +02:00
Matthew Honnibal	0d57b9748a	Serialize lex_attr_getters with dill, for better pickle support	2017-10-17 18:17:45 +02:00
Matthew Honnibal	45d1dd90b1	Add tests for pickling doc	2017-10-17 17:20:58 +02:00
Ines Montani	afa67de7ee	Merge pull request #1428 from roanuz/develop Fix trailing whitespace and Language.from_disk overwrites	2017-10-17 16:29:15 +02:00
Matthew Honnibal	92c1eb2d6f	Fix Doc pickling. This also removes need for Binder class	2017-10-17 16:11:13 +02:00
Matthew Honnibal	ed8da9b11f	Add missing return statement in SentenceSegmenter	2017-10-17 15:32:56 +02:00
Ines Montani	aab299c8ae	Merge pull request #1429 from vishnunekkanti/develop fix syntax error in zh	2017-10-17 14:45:02 +02:00
Anto Binish Kaspar	534240648e	Fix trailing whitespace on morphology features	2017-10-17 17:15:58 +05:30
Anto Binish Kaspar	8f5b60c168	Fix Language.from_disk overwrites the meta.json file.	2017-10-17 17:15:32 +05:30
ines	8ca344712d	Add Language.has_pipe method	2017-10-17 11:20:07 +02:00
ines	485c4f6df5	Add Hungarian examples (see #1107 )	2017-10-17 02:37:45 +02:00
Matthew Honnibal	19531bad4c	Merge branch 'develop' into feature/streaming-data-memory-growth	2017-10-16 21:44:11 +02:00
Matthew Honnibal	df488274b1	Fix deserialization of vectors	2017-10-16 20:55:00 +02:00
Matthew Honnibal	4018486d31	Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth	2017-10-16 20:49:48 +02:00
Matthew Honnibal	4174477161	Fix equality check in test	2017-10-16 19:50:35 +02:00
Matthew Honnibal	2bc06e4b22	Bump rolling buffer size to 10k	2017-10-16 19:38:29 +02:00
Matthew Honnibal	66e2eb8f39	Clean up remnant of frozen in StringStore	2017-10-16 19:34:41 +02:00
Matthew Honnibal	a002264fec	Remove caching of Token in Doc, as caused cycle.	2017-10-16 19:34:21 +02:00
Matthew Honnibal	3e037054c8	Remove obsolete is_frozen functionality from StringStore	2017-10-16 19:23:10 +02:00
Matthew Honnibal	5c14f3f033	Create a rolling buffer for the StringStore in Language.pipe()	2017-10-16 19:22:40 +02:00
Matthew Honnibal	59c216196c	Allow weakrefs on Doc objects	2017-10-16 19:22:11 +02:00
ines	d5418553eb	Fix whitespace	2017-10-16 18:30:04 +02:00
ines	6ceadcdb5c	Make sure from_disk passes string to numpy (see #1421 ) If path is a WindowsPath, numpy does not recognise it as a path and as a result, doesn't open the file. https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369	2017-10-16 18:29:56 +02:00
Matthew Honnibal	010a7309ff	Merge pull request #1402 from explosion/feature/fix-matcher-operators 💫 Fix Matcher variable-length operators	2017-10-16 17:53:19 +02:00
Matthew Honnibal	c29927d2e7	Fix matcher test	2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti	d3c54cf39a	fixed SyntaxError while checking for jieba	2017-10-16 18:51:33 +05:30
Matthew Honnibal	a928ae2f35	Merge branch 'develop' into feature/fix-matcher-operators	2017-10-16 13:38:36 +02:00
Matthew Honnibal	56aa42cc5d	Fix and document matcher operator 'shadowing' behaviour	2017-10-16 13:38:20 +02:00
Matthew Honnibal	748d525801	Add more matcher operator tests	2017-10-16 13:38:01 +02:00
Matthew Honnibal	0433181658	Document operator semantics in Matcher docstring	2017-10-16 12:06:33 +02:00
ines	266e7180a7	Add Language class, stop words and basic stemmer that sets NORM	2017-10-14 14:59:52 +02:00
ines	e85e1d571b	Update base punctuation	2017-10-14 14:59:23 +02:00
ines	9d6c8eaa49	Update base norm exceptions with more unicode characters e.g. unicode variations of punctuation used in Chinese	2017-10-14 14:58:52 +02:00
ines	3516aa0cea	Port over changes from #1389	2017-10-14 13:32:55 +02:00
ines	cd6a29dce7	Port over changes from #1294	2017-10-14 13:28:46 +02:00
ines	38c756fd85	Port over changes from #1287	2017-10-14 13:16:21 +02:00
ines	612224c10d	Port over changes from #1157	2017-10-14 13:11:39 +02:00
ines	9b3f8f9ec3	Fix formatting and add comment on languages	2017-10-14 13:11:18 +02:00
ines	a4d974d97b	Port over URL pattern changes from #1411	2017-10-14 12:58:07 +02:00
ines	09aed58140	Port over changes from #1333 and add comments	2017-10-14 12:52:59 +02:00
Matthew Honnibal	cf6da9301a	Update lemmatizer test	2017-10-12 22:50:52 +02:00
Matthew Honnibal	9b90d235d1	Fix tag check in lemmatizer	2017-10-12 22:50:43 +02:00
Matthew Honnibal	dc01acd821	Escape encoding in validate function	2017-10-12 22:23:21 +02:00
Matthew Honnibal	27b927259a	Add locale_escape compat function	2017-10-12 22:22:04 +02:00
ines	9c6de3dcfa	Merge branch 'develop' into feature/cli-validate	2017-10-12 21:44:28 +02:00
Matthew Honnibal	462caf835a	Fix SBD test	2017-10-12 21:18:22 +02:00
ines	fff1028391	Add validate CLI command	2017-10-12 20:05:06 +02:00
Matthew Honnibal	908f44c3fe	Disable history features by default	2017-10-12 14:56:11 +02:00
Matthew Honnibal	a955843684	Increase default number of epochs	2017-10-12 13:13:01 +02:00
Matthew Honnibal	cecfcc7711	Set default hyper params back to 'slow' settings	2017-10-12 13:12:26 +02:00
Ines Montani	37aa523a8e	Merge pull request #1408 from explosion/feature/dot-underscore 💫 Custom attributes via Doc._, Token._ and Span._	2017-10-11 18:35:56 +02:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	51519251c2	Fix underscore method test	2017-10-11 13:34:19 +02:00
ines	c6ae49e8bf	Fix formatting	2017-10-11 13:34:11 +02:00
ines	453c47ca24	Add German lemmatizer tests	2017-10-11 13:27:26 +02:00
ines	15fe0fd82d	Fix tests	2017-10-11 13:27:18 +02:00
ines	6dd14dc342	Add lookup lemmas to tokens without POS tags	2017-10-11 13:27:10 +02:00
ines	9620c1a640	Add lemma_lookup to Language defaults	2017-10-11 13:26:05 +02:00
ines	9fd471372a	Add lookup lemmatizer to lemmatizer as lookup() method	2017-10-11 13:25:51 +02:00
ines	e0ff145a8b	Merge branch 'develop' into feature/dot-underscore	2017-10-11 11:57:05 +02:00
ines	c1d6d43c83	Merge branch 'develop' into feature/lemmatizer	2017-10-11 11:56:35 +02:00
Matthew Honnibal	17c467e0ab	Avoid clobbering existing lemmas	2017-10-11 03:33:06 -05:00
Matthew Honnibal	807e109f2b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-11 02:47:59 -05:00
Matthew Honnibal	6e552c9d83	Prune number of non-projective labels more aggressiely	2017-10-11 02:46:44 -05:00
Matthew Honnibal	76fe24f44d	Improve embedding defaults	2017-10-11 09:44:17 +02:00
Matthew Honnibal	188f620046	Improve parser defaults	2017-10-11 09:43:48 +02:00
Matthew Honnibal	acba2e1051	Fix metadata in training	2017-10-11 08:55:52 +02:00
Matthew Honnibal	74c2c6a58c	Add default name and lang to meta	2017-10-11 08:49:12 +02:00
Matthew Honnibal	3814a161e6	Avoid clobbering preset lemmas	2017-10-11 08:41:03 +02:00
Matthew Honnibal	fd47f8e89f	Fix failing test	2017-10-11 08:38:34 +02:00
Matthew Honnibal	462b2e26b4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-11 08:23:04 +02:00
Matthew Honnibal	a6ac4699eb	Allow Morphology class to setup tokens Add Morphology.assign_untagged() C-method, and call it from Doc.push_back() when a token is created. This gives a place to allow the Morphology class to initialize token data.	2017-10-11 03:24:14 +02:00
Matthew Honnibal	3b527fa52b	Call morphology.assign_untagged when pushing token to Doc	2017-10-11 03:23:57 +02:00
Matthew Honnibal	c15d8278cb	Avoid lemmatizing inappropriate tags in English lemmatizer	2017-10-11 03:23:23 +02:00
Matthew Honnibal	d528b6e36d	Add assign_untagged method in Morphology	2017-10-11 03:22:49 +02:00
Matthew Honnibal	2c118ab3a6	Add tests for Doc creation	2017-10-11 03:21:23 +02:00
ines	820bf85075	Move LookupLemmatizer to spacy.lemmatizer	2017-10-11 02:25:13 +02:00
ines	417d45f5d0	Add lemmatizer data as variable on language data Don't create lookup lemmatizer within Language class and just pass in the data so it can be set on Token creation	2017-10-11 02:24:58 +02:00
ines	0c2343d73a	Tidy up language data	2017-10-11 02:22:49 +02:00
Matthew Honnibal	d84136b4a9	Update add label test	2017-10-10 22:57:41 +02:00
Matthew Honnibal	3065f12ef2	Make add parser label work for hidden_depth=0	2017-10-10 22:57:31 +02:00
ines	bfd58dd0fc	Merge branch 'develop' into feature/dot-underscore	2017-10-10 22:03:51 +02:00
Matthew Honnibal	73bca3d382	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 12:51:37 -05:00
Matthew Honnibal	5156074df1	Make loading code more consistent in train command	2017-10-10 12:51:20 -05:00
Matthew Honnibal	d70fba6807	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 19:33:10 +02:00
Matthew Honnibal	8143618497	Set prefix length back to 1	2017-10-10 19:32:54 +02:00
Matthew Honnibal	97c9b5db8b	Patch spacy.train for new pipeline management	2017-10-09 23:41:16 -05:00
Matthew Honnibal	a635240398	Add conll_ner2json converter	2017-10-09 22:03:26 -05:00
Matthew Honnibal	e0a9b02b67	Merge Span._ and Span.as_doc methods	2017-10-09 22:00:15 -05:00
Matthew Honnibal	dce8afb9cf	Set prefix length to 3	2017-10-09 21:55:55 -05:00
Matthew Honnibal	8265b90c83	Update parser defaults	2017-10-09 21:55:20 -05:00
Matthew Honnibal	dd2b0601d1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-09 21:30:46 -05:00
Matthew Honnibal	09d61ada5e	Merge pull request #1396 from explosion/feature/pipeline-management 💫 Improve pipeline and factory management	2017-10-10 04:29:54 +02:00
ines	67350fa496	Use better logic for auto-generating component name Instances don't have __name__, so we try __class__.__name__ as well, before giving up and defaulting to repr(component).	2017-10-10 04:23:05 +02:00
ines	3fc4fe61d2	Fix typo	2017-10-10 04:15:14 +02:00
ines	59c4f27499	Add get, set and has methods to Underscore	2017-10-10 04:14:35 +02:00
Matthew Honnibal	19136fd155	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 03:58:30 +02:00
Matthew Honnibal	8978212ee5	Patch serialization bug raised in #1105	2017-10-10 03:58:12 +02:00
Matthew Honnibal	f0f2739ae3	Add test for serialization issue raised in #1105	2017-10-10 03:57:58 +02:00

1 2 3 4 5 ...

4225 Commits