spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-13 13:44:21 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	a350be0601	Fix vector-name loading fix	2018-04-04 01:31:25 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Matthew Honnibal	81f4005f3d	Fix loading models with pretrained vectors	2018-04-03 23:11:48 +02:00
ines	3463ded7cf	Check if spaCy has compiled correctly and show error message	2018-04-03 22:18:47 +02:00
Matthew Honnibal	96b612873b	Add hyper-parameter to control whether parser makes a beam update	2018-04-03 22:02:56 +02:00
ines	e5f47cd82d	Update errors	2018-04-03 21:40:29 +02:00
Matthew Honnibal	f7e6313b43	Increment version to v2.0.11.dev0	2018-04-03 20:58:47 +02:00
ines	10462816bc	Fix tests for Python 2	2018-04-03 18:51:31 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
ines	ee3082ad29	Fix whitespace	2018-04-03 18:29:53 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Suraj Rajan	1cdbb7c97c	[2032] - Changed python set to cpp stl set (#2170 ) Changed python set to cpp stl set #2032 ## Description Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors. Reference : http://www.cplusplus.com/reference/set/set/ ### Types of change Enhancement for `Vectors` for faster initialising of word vectors(fasttext)	2018-03-31 13:28:25 +02:00
Matthew Honnibal	f3b7c5e537	Fix syntax error	2018-03-29 21:50:32 +02:00
Matthew Honnibal	23afa6429f	Add input length error, to address #1826	2018-03-29 21:45:26 +02:00
Ines Montani	a609a1ca29	Merge pull request #2152 from explosion/feature/tidy-up-dependencies 💫 Tidy up dependencies	2018-03-29 14:35:09 +02:00
Viet Trung Tran	ea2af94cd9	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 ) * support for Vietnamese * Contributor Agreement for adding Vietnamese support on spaCy	2018-03-29 12:19:51 +02:00
ines	e6979bdbbd	Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies	2018-03-29 00:19:37 +02:00
ines	83146458a2	Fix urllib for Python 3	2018-03-29 00:19:33 +02:00
Matthew Honnibal	8308bbc617	Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts	2018-03-29 00:14:55 +02:00
Matthew Honnibal	b5098079d8	Fix error on urllib	2018-03-29 00:08:16 +02:00
Ines Montani	0de599b16b	Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554 , resolves #1752 ) 💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents	2018-03-28 23:10:00 +02:00
Ines Montani	98e9cda677	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 ) 💫 Fix loading of multiple vector models	2018-03-28 23:08:24 +02:00
Matthew Honnibal	a7c5ae2beb	Avoid forcing a name on empty vectors, and remove print statement	2018-03-28 21:08:58 +02:00
ines	3eb67bbe4b	Allow entity types with dashes (resolves #1967 )	2018-03-28 20:51:26 +02:00
Matthew Honnibal	cf5fcf0546	Update serialization test	2018-03-28 20:12:53 +02:00
Matthew Honnibal	4555e3e251	Dont assume pretrained_vectors cfg set in build_tagger	2018-03-28 20:12:45 +02:00
Matthew Honnibal	0b375d50c8	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-28 18:39:03 +02:00
Matthew Honnibal	95fa89c4b8	Update doc.ents test	2018-03-28 18:39:03 +02:00
Matthew Honnibal	e807f88410	Resolve merge when cherry-picking ent iob patches from develop	2018-03-28 18:38:13 +02:00
Matthew Honnibal	99fbc7db33	Improve error message when entity sequence is inconsistent	2018-03-28 18:36:53 +02:00
Matthew Honnibal	cbd2794be0	Add test for ent_iob during span merge	2018-03-28 18:36:53 +02:00
Matthew Honnibal	f8dd905a24	Warn and fallback if vectors have no name	2018-03-28 18:24:53 +02:00
Matthew Honnibal	fd9e259414	Add test for #1660	2018-03-28 18:22:51 +02:00
Matthew Honnibal	bc4afa9881	Remove print statement	2018-03-28 17:48:37 +02:00
Matthew Honnibal	79dc241caa	Set pretrained_vectors in parser cfg	2018-03-28 17:35:07 +02:00
Matthew Honnibal	17c3e7efa2	Add message noting vectors	2018-03-28 16:33:43 +02:00
Matthew Honnibal	9bf6e93b3e	Set pretrained_vectors in begin_training	2018-03-28 16:32:41 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
ines	7fbc9e5874	Replace requests with urllib	2018-03-28 12:46:07 +02:00
ines	da1f200362	Add compat helpers for urllib	2018-03-28 12:45:53 +02:00
ines	ac88c72c9a	Fix ftfy workaround and remove old import	2018-03-28 12:14:28 +02:00
ines	ce6071ca89	Remove ftfy dependency and update docs	2018-03-28 12:09:42 +02:00
Matthew Honnibal	070b6c6495	Remove dependency on ftfy	2018-03-28 12:07:02 +02:00
ines	6d2c85f428	Drop six and related hacks as a dependency	2018-03-28 10:45:25 +02:00
ines	9e83513004	Add position of invalid token to error message	2018-03-27 23:56:59 +02:00
ines	11c4735ccf	Fix issue in Italian lemmatizer data (resolves #2050 )	2018-03-27 23:55:22 +02:00
ines	693971dd8f	Improve error message if token text is empty string (see #2101 )	2018-03-27 22:25:40 +02:00
ines	0c829e6605	Fix whitespace	2018-03-27 22:20:59 +02:00
Matthew Honnibal	d4680e4d83	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-27 13:36:37 +02:00
Matthew Honnibal	63a267b34d	Fix #2073 : Token.set_extension not working	2018-03-27 13:36:20 +02:00
Ines Montani	68226109f4	Merge pull request #2142 from jimregan/polish-more-tokens more exceptions	2018-03-24 19:06:44 +01:00
Matthew Honnibal	d566e673bf	Set version to v2.0.10	2018-03-24 18:09:03 +01:00
Matthew Honnibal	0d3bf0d4eb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-24 17:31:49 +01:00
dejanmarich	ccd1c04c63	Update stop_words.py Added more words	2018-03-24 17:31:24 +01:00
ines	f1446b0257	Port over Turkish changes	2018-03-24 17:31:07 +01:00
DuyguA	cd604878a4	quick typo fix	2018-03-24 17:26:35 +01:00
Matthew Honnibal	406548b976	Support .gz and .tar.gz files in spacy init-model	2018-03-24 17:18:32 +01:00
Jim O'Regan	efe037e8be	more exceptions	2018-03-24 00:05:27 +00:00
Matthew Honnibal	e3be3d65b3	Version as v2.0.10.dev0	2018-03-15 17:31:22 +01:00
ines	f3f8bfc367	Add built-in factories for merge_entities and merge_noun_chunks Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).	2018-03-15 17:16:54 +01:00
alldefector	f4e5904fc2	Fix Spanish noun_chunks failure caused by typo	2018-03-14 17:03:17 +01:00
Thomas Opsomer	fbf48b3f9f	lemma property to return hash instead of unicode	2018-03-14 17:03:00 +01:00
Matthew Honnibal	8cefc58abc	Fix Vectors pickling	2018-03-14 16:59:37 +01:00
Matthew Honnibal	307aefe131	Increment version to v2.0.9	2018-02-22 17:07:53 +01:00
Ines Montani	14e7e0f12a	Merge pull request #2000 from jimregan/polish-tag-map Polish tag map	2018-02-18 19:05:58 +01:00
Jim O'Regan	664407de5d	missing PrepCase attribute	2018-02-18 14:46:12 +00:00
Jim O'Regan	95f0673fbc	fix typo/missing here too	2018-02-18 14:38:27 +00:00
Matthew Honnibal	cf0e320f2b	Add doc.is_sentenced attribute, re #1959	2018-02-18 14:16:55 +01:00
Matthew Honnibal	1e5aeb4eec	Merge pull request #1987 from thomasopsomer/span-sent Make span.sent work when only manual / custom sbd	2018-02-18 14:05:37 +01:00
Matthew Honnibal	1cf774bdc1	Add output options return_matches and as_tuples to Matcher	2018-02-18 14:00:45 +01:00
Matthew Honnibal	dd9b0945af	Fix inconsistencies in the symbols table	2018-02-18 13:51:31 +01:00
Matthew Honnibal	66496ac8e1	Set version to v2.1.0.dev0	2018-02-18 13:48:39 +01:00
Matthew Honnibal	eb3040ce46	Merge pull request #1891 from fucking-signup/master Fix issue #1889	2018-02-18 13:47:47 +01:00
ines	6bba1db4cc	Drop six and related hacks as a dependency	2018-02-18 13:29:56 +01:00
Matthew Honnibal	b30b09192a	Merge pull request #1665 from jimregan/animacy typo in "inan", add "nhum"	2018-02-18 13:26:53 +01:00
Matthew Honnibal	1b3c98e01b	Set version to v2.0.8	2018-02-18 12:16:31 +01:00
Matthew Honnibal	f9f46e5a07	Revert matcher fixes from GregDubbin	2018-02-18 10:59:28 +01:00
Matthew Honnibal	86405e4ad1	Fix CLI for multitask objectives	2018-02-18 10:59:11 +01:00
Matthew Honnibal	a34749b2bf	Add multitask objectives options to train CLI	2018-02-17 22:03:54 +01:00
Matthew Honnibal	8f06903e09	Fix multitask objectives	2018-02-17 18:41:36 +01:00
Matthew Honnibal	d1246c95fb	Fix model loading when using multitask objectives	2018-02-17 18:11:36 +01:00
Matthew Honnibal	262d0a3148	Fix overwriting of lexical attributes when loading vectors during training	2018-02-17 18:11:11 +01:00
Matthew Honnibal	c0caf7cf27	Fix LANG symbol	2018-02-17 18:10:50 +01:00
Matthew Honnibal	0bf2f6be29	Add missing symbol for LANG attr. Fixes inconsistent numeric ID	2018-02-17 17:37:02 +01:00
Matthew Honnibal	97a228a4ce	Increment to v2.0.8.dev0	2018-02-17 16:54:36 +01:00
Aaron Marquez	ea571e8325	Merge branch 'master' into issue-1959	2018-02-16 15:14:09 -08:00
Matthew Honnibal	7d5c720fc3	Fix multitask objective when no pipeline provided	2018-02-15 23:50:21 +01:00
Aaron Marquez	f0d3672e17	Changed loading EN model	2018-02-15 14:28:38 -08:00
Aaron Marquez	3765d84d57	Fix issue #1959	2018-02-15 12:51:49 -08:00
Aaron Marquez	7ba4111554	Add test for issue-1959	2018-02-15 12:46:22 -08:00
Matthew Honnibal	59b7cf9db8	Add get_beam_parse method in ArcEager, for Prodigy	2018-02-15 21:03:16 +01:00
Matthew Honnibal	3e541de440	Merge branch 'master' of https://github.com/explosion/spaCy	2018-02-15 21:02:55 +01:00
Thomas Opsomer	5d24a81c0b	add test for span.sent when doc not parsed	2018-02-15 16:59:16 +01:00
Thomas Opsomer	deab391cbf	correct check on sent_start & raise if no boundaries	2018-02-15 16:58:30 +01:00
Matthew Honnibal	4cb861e080	Merge pull request #1968 from DuyguA/is_currency New lexical feature is_currency	2018-02-15 12:13:36 +01:00
Thomas Opsomer	b902731313	Find span sentence when only sentence boundaries (no parser)	2018-02-14 22:18:54 +01:00
Claudiu-Vlad Ursache	e28de12cbd	Ensure files opened in `from_disk` are closed Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).	2018-02-13 20:49:43 +01:00
Johannes Dollinger	012e874d09	Add contributor agreement for emulbreh	2018-02-13 13:40:33 +01:00
Johannes Dollinger	bf94c13382	Don't fix random seeds on import	2018-02-13 12:42:23 +01:00

1 2 3 4 5 ...

4894 Commits