spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-13 13:44:21 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	19531bad4c	Merge branch 'develop' into feature/streaming-data-memory-growth	2017-10-16 21:44:11 +02:00
Matthew Honnibal	df488274b1	Fix deserialization of vectors	2017-10-16 20:55:00 +02:00
Matthew Honnibal	4018486d31	Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth	2017-10-16 20:49:48 +02:00
Matthew Honnibal	4174477161	Fix equality check in test	2017-10-16 19:50:35 +02:00
Matthew Honnibal	2bc06e4b22	Bump rolling buffer size to 10k	2017-10-16 19:38:29 +02:00
Matthew Honnibal	66e2eb8f39	Clean up remnant of frozen in StringStore	2017-10-16 19:34:41 +02:00
Matthew Honnibal	a002264fec	Remove caching of Token in Doc, as caused cycle.	2017-10-16 19:34:21 +02:00
Matthew Honnibal	3e037054c8	Remove obsolete is_frozen functionality from StringStore	2017-10-16 19:23:10 +02:00
Matthew Honnibal	5c14f3f033	Create a rolling buffer for the StringStore in Language.pipe()	2017-10-16 19:22:40 +02:00
Matthew Honnibal	59c216196c	Allow weakrefs on Doc objects	2017-10-16 19:22:11 +02:00
ines	d5418553eb	Fix whitespace	2017-10-16 18:30:04 +02:00
ines	6ceadcdb5c	Make sure from_disk passes string to numpy (see #1421 ) If path is a WindowsPath, numpy does not recognise it as a path and as a result, doesn't open the file. https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369	2017-10-16 18:29:56 +02:00
Matthew Honnibal	010a7309ff	Merge pull request #1402 from explosion/feature/fix-matcher-operators 💫 Fix Matcher variable-length operators	2017-10-16 17:53:19 +02:00
Matthew Honnibal	c29927d2e7	Fix matcher test	2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti	d3c54cf39a	fixed SyntaxError while checking for jieba	2017-10-16 18:51:33 +05:30
Matthew Honnibal	a928ae2f35	Merge branch 'develop' into feature/fix-matcher-operators	2017-10-16 13:38:36 +02:00
Matthew Honnibal	56aa42cc5d	Fix and document matcher operator 'shadowing' behaviour	2017-10-16 13:38:20 +02:00
Matthew Honnibal	748d525801	Add more matcher operator tests	2017-10-16 13:38:01 +02:00
Matthew Honnibal	0433181658	Document operator semantics in Matcher docstring	2017-10-16 12:06:33 +02:00
ines	266e7180a7	Add Language class, stop words and basic stemmer that sets NORM	2017-10-14 14:59:52 +02:00
ines	e85e1d571b	Update base punctuation	2017-10-14 14:59:23 +02:00
ines	9d6c8eaa49	Update base norm exceptions with more unicode characters e.g. unicode variations of punctuation used in Chinese	2017-10-14 14:58:52 +02:00
ines	3516aa0cea	Port over changes from #1389	2017-10-14 13:32:55 +02:00
ines	cd6a29dce7	Port over changes from #1294	2017-10-14 13:28:46 +02:00
ines	38c756fd85	Port over changes from #1287	2017-10-14 13:16:21 +02:00
ines	612224c10d	Port over changes from #1157	2017-10-14 13:11:39 +02:00
ines	9b3f8f9ec3	Fix formatting and add comment on languages	2017-10-14 13:11:18 +02:00
ines	a4d974d97b	Port over URL pattern changes from #1411	2017-10-14 12:58:07 +02:00
ines	09aed58140	Port over changes from #1333 and add comments	2017-10-14 12:52:59 +02:00
Matthew Honnibal	cf6da9301a	Update lemmatizer test	2017-10-12 22:50:52 +02:00
Matthew Honnibal	9b90d235d1	Fix tag check in lemmatizer	2017-10-12 22:50:43 +02:00
Matthew Honnibal	dc01acd821	Escape encoding in validate function	2017-10-12 22:23:21 +02:00
Matthew Honnibal	27b927259a	Add locale_escape compat function	2017-10-12 22:22:04 +02:00
ines	9c6de3dcfa	Merge branch 'develop' into feature/cli-validate	2017-10-12 21:44:28 +02:00
Matthew Honnibal	462caf835a	Fix SBD test	2017-10-12 21:18:22 +02:00
ines	fff1028391	Add validate CLI command	2017-10-12 20:05:06 +02:00
Matthew Honnibal	908f44c3fe	Disable history features by default	2017-10-12 14:56:11 +02:00
Matthew Honnibal	a955843684	Increase default number of epochs	2017-10-12 13:13:01 +02:00
Matthew Honnibal	cecfcc7711	Set default hyper params back to 'slow' settings	2017-10-12 13:12:26 +02:00
Ines Montani	37aa523a8e	Merge pull request #1408 from explosion/feature/dot-underscore 💫 Custom attributes via Doc._, Token._ and Span._	2017-10-11 18:35:56 +02:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	51519251c2	Fix underscore method test	2017-10-11 13:34:19 +02:00
ines	c6ae49e8bf	Fix formatting	2017-10-11 13:34:11 +02:00
ines	453c47ca24	Add German lemmatizer tests	2017-10-11 13:27:26 +02:00
ines	15fe0fd82d	Fix tests	2017-10-11 13:27:18 +02:00
ines	6dd14dc342	Add lookup lemmas to tokens without POS tags	2017-10-11 13:27:10 +02:00
ines	9620c1a640	Add lemma_lookup to Language defaults	2017-10-11 13:26:05 +02:00
ines	9fd471372a	Add lookup lemmatizer to lemmatizer as lookup() method	2017-10-11 13:25:51 +02:00
ines	e0ff145a8b	Merge branch 'develop' into feature/dot-underscore	2017-10-11 11:57:05 +02:00
ines	c1d6d43c83	Merge branch 'develop' into feature/lemmatizer	2017-10-11 11:56:35 +02:00
Matthew Honnibal	17c467e0ab	Avoid clobbering existing lemmas	2017-10-11 03:33:06 -05:00
Matthew Honnibal	807e109f2b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-11 02:47:59 -05:00
Matthew Honnibal	6e552c9d83	Prune number of non-projective labels more aggressiely	2017-10-11 02:46:44 -05:00
Matthew Honnibal	76fe24f44d	Improve embedding defaults	2017-10-11 09:44:17 +02:00
Matthew Honnibal	188f620046	Improve parser defaults	2017-10-11 09:43:48 +02:00
Matthew Honnibal	acba2e1051	Fix metadata in training	2017-10-11 08:55:52 +02:00
Matthew Honnibal	74c2c6a58c	Add default name and lang to meta	2017-10-11 08:49:12 +02:00
Matthew Honnibal	3814a161e6	Avoid clobbering preset lemmas	2017-10-11 08:41:03 +02:00
Matthew Honnibal	fd47f8e89f	Fix failing test	2017-10-11 08:38:34 +02:00
Matthew Honnibal	462b2e26b4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-11 08:23:04 +02:00
Matthew Honnibal	a6ac4699eb	Allow Morphology class to setup tokens Add Morphology.assign_untagged() C-method, and call it from Doc.push_back() when a token is created. This gives a place to allow the Morphology class to initialize token data.	2017-10-11 03:24:14 +02:00
Matthew Honnibal	3b527fa52b	Call morphology.assign_untagged when pushing token to Doc	2017-10-11 03:23:57 +02:00
Matthew Honnibal	c15d8278cb	Avoid lemmatizing inappropriate tags in English lemmatizer	2017-10-11 03:23:23 +02:00
Matthew Honnibal	d528b6e36d	Add assign_untagged method in Morphology	2017-10-11 03:22:49 +02:00
Matthew Honnibal	2c118ab3a6	Add tests for Doc creation	2017-10-11 03:21:23 +02:00
ines	820bf85075	Move LookupLemmatizer to spacy.lemmatizer	2017-10-11 02:25:13 +02:00
ines	417d45f5d0	Add lemmatizer data as variable on language data Don't create lookup lemmatizer within Language class and just pass in the data so it can be set on Token creation	2017-10-11 02:24:58 +02:00
ines	0c2343d73a	Tidy up language data	2017-10-11 02:22:49 +02:00
Matthew Honnibal	d84136b4a9	Update add label test	2017-10-10 22:57:41 +02:00
Matthew Honnibal	3065f12ef2	Make add parser label work for hidden_depth=0	2017-10-10 22:57:31 +02:00
ines	bfd58dd0fc	Merge branch 'develop' into feature/dot-underscore	2017-10-10 22:03:51 +02:00
Matthew Honnibal	73bca3d382	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 12:51:37 -05:00
Matthew Honnibal	5156074df1	Make loading code more consistent in train command	2017-10-10 12:51:20 -05:00
Matthew Honnibal	d70fba6807	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 19:33:10 +02:00
Matthew Honnibal	8143618497	Set prefix length back to 1	2017-10-10 19:32:54 +02:00
Matthew Honnibal	97c9b5db8b	Patch spacy.train for new pipeline management	2017-10-09 23:41:16 -05:00
Matthew Honnibal	a635240398	Add conll_ner2json converter	2017-10-09 22:03:26 -05:00
Matthew Honnibal	e0a9b02b67	Merge Span._ and Span.as_doc methods	2017-10-09 22:00:15 -05:00
Matthew Honnibal	dce8afb9cf	Set prefix length to 3	2017-10-09 21:55:55 -05:00
Matthew Honnibal	8265b90c83	Update parser defaults	2017-10-09 21:55:20 -05:00
Matthew Honnibal	dd2b0601d1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-09 21:30:46 -05:00
Matthew Honnibal	09d61ada5e	Merge pull request #1396 from explosion/feature/pipeline-management 💫 Improve pipeline and factory management	2017-10-10 04:29:54 +02:00
ines	67350fa496	Use better logic for auto-generating component name Instances don't have __name__, so we try __class__.__name__ as well, before giving up and defaulting to repr(component).	2017-10-10 04:23:05 +02:00
ines	3fc4fe61d2	Fix typo	2017-10-10 04:15:14 +02:00
ines	59c4f27499	Add get, set and has methods to Underscore	2017-10-10 04:14:35 +02:00
Matthew Honnibal	19136fd155	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-10 03:58:30 +02:00
Matthew Honnibal	8978212ee5	Patch serialization bug raised in #1105	2017-10-10 03:58:12 +02:00
Matthew Honnibal	f0f2739ae3	Add test for serialization issue raised in #1105	2017-10-10 03:57:58 +02:00
Matthew Honnibal	735d18654d	Add NER converter for CoNLL 2003 data	2017-10-09 20:06:28 -05:00
Matthew Honnibal	51d18937af	Partially apply doc/span/token into method We want methods to act like they're "bound" to the object, so that you can make your method conditional on the `doc`, `span` or `token` instance --- like, well, a method. We therefore partially apply the function, which works like this: ``` def partial(unbound_method, constant_arg): def bound_method(args, kwargs): return unbound_method(constant_arg, args, **kwargs) return bound_method	2017-10-10 02:21:28 +02:00
Matthew Honnibal	808d8740d6	Remove print statement	2017-10-09 08:45:20 -05:00
Matthew Honnibal	0f41b25f60	Add speed benchmarks to metadata	2017-10-09 08:05:37 -05:00
ines	de374dc72a	Merge branch 'feature/pipeline-management' into feature/dot-underscore	2017-10-09 14:37:51 +02:00
Matthew Honnibal	2534cd57d7	Add bandaid solution to the 'shadowing' problem in #864	2017-10-09 08:59:35 +02:00
Matthew Honnibal	d8a2506023	Merge pull request #1401 from explosion/feature/add-parser-action 💫 Allow labels to be added to pre-trained parser and NER modes	2017-10-09 04:57:51 +02:00
Matthew Honnibal	689349e32f	Merge pull request #1400 from explosion/feature/sentence-parsing 💫 Force parser to respect preset sentence boundaries	2017-10-09 04:31:43 +02:00
Matthew Honnibal	e79fc41ff8	Merge pull request #1391 from explosion/feature/multilabel-textcat 💫 Fix multi-label support for text classification	2017-10-09 04:22:31 +02:00
Matthew Honnibal	fad2b8315f	Merge branch 'develop' into feature/add-parser-action	2017-10-09 04:13:04 +02:00
Matthew Honnibal	6c79841c0d	Fix tests for history features	2017-10-09 04:12:24 +02:00
Matthew Honnibal	dde87e6b0d	Add tests for adding parser actions	2017-10-09 03:42:35 +02:00
Matthew Honnibal	b2b8506f2c	Remove whitespace	2017-10-09 03:35:57 +02:00
Matthew Honnibal	d43a83e37a	Allow parser.add_label for pretrained models	2017-10-09 03:35:40 +02:00
Matthew Honnibal	81a64119db	Fix string-to-unicode problem	2017-10-09 00:59:49 +02:00
Matthew Honnibal	02c2af7119	Fix test	2017-10-09 00:29:37 +02:00
Matthew Honnibal	4cc84b0234	Prohibit Break when sent_start < 0	2017-10-09 00:02:45 +02:00
Matthew Honnibal	5a67efeccc	Add tests for sentence segmentation presetting	2017-10-09 00:02:23 +02:00
Matthew Honnibal	e938bce320	Adjust parsing transition system to allow preset sentence segments.	2017-10-08 23:53:34 +02:00
Matthew Honnibal	080afd4924	Add ternary value setting to Token.sent_start	2017-10-08 23:51:58 +02:00
Matthew Honnibal	7ae67ec6a1	Add Span.as_doc method	2017-10-08 23:50:20 +02:00
Matthew Honnibal	20309fb9db	Make history features default to zero	2017-10-08 20:32:14 +02:00
Matthew Honnibal	e74c8d2fad	Merge remote-tracking branch 'origin/develop' into feature/sentence-parsing	2017-10-08 20:20:41 +02:00
Matthew Honnibal	18063803de	Make TokenC.sent_tart an int, to allow ternary value	2017-10-08 19:58:54 +02:00
Matthew Honnibal	be4f0b6460	Update defaults	2017-10-08 02:08:12 -05:00
Matthew Honnibal	42b401d08b	Change default hidden depth to 1	2017-10-07 21:05:21 -05:00
Matthew Honnibal	9d66a915da	Update training defaults	2017-10-07 21:02:38 -05:00
Matthew Honnibal	d163115e91	Add non-linearity after history features	2017-10-07 21:00:43 -05:00
Matthew Honnibal	92c5d78b42	Unhack NER.add_action	2017-10-07 19:02:40 +02:00
Matthew Honnibal	f2b590f672	Increment version	2017-10-07 19:01:01 +02:00
Matthew Honnibal	9bd8191739	Add tests for Underscore	2017-10-07 18:56:19 +02:00
Matthew Honnibal	668a0ea640	Pass extensions into Underscore class	2017-10-07 18:56:01 +02:00
Matthew Honnibal	1289129fd9	Add Underscore class	2017-10-07 18:00:14 +02:00
Matthew Honnibal	eb0595bea9	Merge pull request #1392 from explosion/feature/parser-history-model 💫 Parser history features	2017-10-07 15:07:02 +02:00
Matthew Honnibal	3d22ccf495	Update default hyper-parameters	2017-10-07 07:16:41 -05:00
Matthew Honnibal	09442d25ec	Merge remote-tracking branch 'origin/develop' into feature/parser-history-model	2017-10-07 07:05:04 -05:00
Matthew Honnibal	3b67eabfea	Allow empty dictionaries to match any token in Matcher Often patterns need to match "any token". A clean way to denote this is with the empty dict {}: this sets no constraints on the token, so should always match. The problem was that having attributes length==0 was used as an end-of-array signal, so the matcher didn't handle this case correctly. This patch compiles empty token spec dicts into a constraint NULL_ATTR==0. The NULL_ATTR attribute, 0, is always set to 0 on the lexeme -- so this always matches.	2017-10-07 03:36:15 +02:00
ines	0adadcb3f0	Fix beam parse model test	2017-10-07 02:15:15 +02:00
ines	b38a8f4a94	Fix and update pipe methods tests	2017-10-07 02:06:23 +02:00
Matthew Honnibal	0384f08218	Trigger nonproj.deprojectivize as a postprocess	2017-10-07 02:00:47 +02:00
Matthew Honnibal	3a65a0c970	Start adding tests for new pipeline management	2017-10-07 01:48:23 +02:00
ines	e43530269c	Update docstrings	2017-10-07 01:04:50 +02:00
ines	61a503a611	Fix parser test	2017-10-07 00:38:51 +02:00
ines	b39409173e	Add disable option and True/False/None values for pipeline	2017-10-07 00:29:08 +02:00
ines	2586b61b15	Fix formatting, tidy up and remove unused imports	2017-10-07 00:26:05 +02:00
ines	212c8f0711	Implement new Language methods and pipeline API	2017-10-07 00:25:54 +02:00
Matthew Honnibal	8be46d766e	Remove print statement	2017-10-06 16:19:02 -05:00
Matthew Honnibal	8e731009fe	Fix parser config serialization	2017-10-06 13:50:52 -05:00
Matthew Honnibal	f4c9a98166	Fix spacy evaluate command on non-GPU	2017-10-06 13:17:47 -05:00
Matthew Honnibal	16ba6aa8a6	Fix parser config serialization	2017-10-06 13:17:31 -05:00
Matthew Honnibal	c66399d8ae	Fix depth definition with history features	2017-10-06 06:20:05 -05:00
Matthew Honnibal	5c750a9c2f	Reserve 0 for 'missing' in history features	2017-10-06 06:10:13 -05:00
Matthew Honnibal	fbba7c517e	Pass dropout through to embed tables	2017-10-06 06:09:18 -05:00
Matthew Honnibal	21d11936fe	Fix significant train/test skew error in history feats	2017-10-06 06:08:50 -05:00
Matthew Honnibal	555d8c8bff	Fix beam history features	2017-10-05 22:21:50 -05:00
Matthew Honnibal	3db0a32fd6	Fix dropout for history features	2017-10-05 22:21:30 -05:00
Matthew Honnibal	b0618def8d	Add support for 2-token state option	2017-10-05 21:54:12 -05:00
Matthew Honnibal	363aa47b40	Clean up dead parsing code	2017-10-05 21:53:49 -05:00
Matthew Honnibal	ca12764772	Enable history features for beam parser	2017-10-05 21:53:29 -05:00
Matthew Honnibal	fc06b0a333	Fix training when hist_size==0	2017-10-05 21:52:28 -05:00
Matthew Honnibal	e25ffcb11f	Move history size under feature flags	2017-10-05 19:38:13 -05:00
Matthew Honnibal	563f46f026	Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly.	2017-10-05 18:43:02 -05:00

1 2 3 4 5 ...

4163 Commits