spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 20:28:20 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Matthew Honnibal	a338c6f8f6	Fix JSON segmentation bug that affected French Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped slashes were being handled incorrectly. This bug caused low scores for French in the early v2.1.0 alphas, because most of the data was not being read in. Fittingly, the document that triggered the bug was a Wikipedia article about Perl. Parsing perl remains difficult!	2018-12-08 10:41:24 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Matthew Honnibal	9536ee787c	Add comma deletion to data noising	2018-12-01 13:42:18 +00:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	61e435610e	💫 Feature/improve pretraining (#2971 ) * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Tweak pretraining script * Fix data limits in spacy.gold * Fix pretrain script	2018-11-28 18:04:58 +01:00
Matthew Honnibal	0fdb25b958	Fix msgpack error	2018-11-27 19:35:55 +01:00
Matthew Honnibal	b9ef8ac616	Fix GoldParse class when no entities	2018-09-27 15:14:27 +02:00
Matthew Honnibal	3b6b018904	Fix loading of gold morphology	2018-09-26 21:01:48 +02:00
Matthew Honnibal	fb0abddd9e	Call morph morphology in GoldParse	2018-09-25 21:34:53 +02:00
Matthew Honnibal	834dfb0e9d	Add morph attribute to GoldParse	2018-09-25 21:32:05 +02:00
Matthew Honnibal	66a3f2ba21	Lower-case text before alignment	2018-08-16 00:42:36 +02:00
Matthew Honnibal	a9fb6d5511	Fix docs2jsonl function	2018-08-14 14:03:48 +02:00
Matthew Honnibal	2a5a61683e	Add function to get train format from Doc objects Our JSON training format is annoying to work with, and we've wanted to retire it for some time. In the meantime, we can at least add some missing functions to make it easier to live with. This patch adds a function that generates the JSON format from a list of Doc objects, one per paragraph. This should be a convenient way to handle a lot of data conversions: whatever format you have the source information in, you can use it to setup a Doc object. This approach should offer better future-proofing as well. Hopefully, we can steadily rewrite code that is sensitive to the current data-format, so that it instead goes through this function. Then when we change the data format, we won't have such a problem.	2018-08-14 13:13:10 +02:00
Matthew Honnibal	4336397ecb	Update develop from master	2018-08-14 03:04:28 +02:00
Xiaoquan Kong	f0c9652ed1	New Feature: display more detail when Error E067 (#2639 ) * Fix off-by-one error * Add verbose option * Update verbose option * Update documents for verbose option	2018-08-07 10:45:29 +02:00
ines	4a62486340	Merge branch 'master' into develop	2018-05-30 13:01:01 +02:00
Maciej	c7d53348d7	Fix bug in CLI iob and ner converter (#2392 ) (fixes #2385 ) * issue_2385 add tests for iob_to_biluo converter function * issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter * issue_2385 add test to fix b char bug * add contributor agreement * fill contributor agreement	2018-05-30 12:28:44 +02:00
Matthew Honnibal	8661218fe8	Refactor parser (#2308 ) * Work on refactoring greedy parser * Compile updated parser * Fix refactored parser * Update test * Fix refactored parser * Fix refactored parser * Readd beam search after refactor * Fix beam search after refactor * Fix parser * Fix beam parsing * Support oracle segmentation in ud-train CLI command * Avoid relying on final gold check in beam search * Add a keyword argument sink to GoldParse * Bug fixes to beam search after refactor * Avoid importing fused token symbol in ud-run-test, untl that's added * Avoid importing fused token symbol in ud-run-test, untl that's added * Don't modify Token in global scope * Fix error in beam gradient calculation * Default to beam_update_prob 1 * Set a more aggressive threshold on the max violn update * Disable some tests to figure out why CI fails * Disable some tests to figure out why CI fails * Add some diagnostics to travis.yml to try to figure out why build fails * Tell Thinc to link against system blas on Travis * Point thinc to libblas on Travis * Try running sudo=true for travis * Unhack travis.sh * Restore beam_density argument for parser beam * Require thinc 6.11.1.dev16 * Revert hacks to tests * Revert hacks to travis.yml * Update thinc requirement * Fix parser model loading * Fix size limits in training data * Add missing name attribute for parser * Fix appveyor for Windows	2018-05-15 22:17:29 +02:00
Matthew Honnibal	bf19f22340	Allow gold.sent_starts to be set from Python	2018-05-07 15:51:34 +02:00
Jens Dahl Møllerhøj	b9290397fb	rename SP to _SP (#2289 )	2018-05-03 18:33:49 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
ines	c699aec089	Add offsets_from_biluo_tags helper and tests (see #1626 )	2017-11-26 16:38:01 +01:00
Matthew Honnibal	86ddf692a1	Fix bug in limit calculation on dev data	2017-11-14 01:37:10 +01:00
Matthew Honnibal	1cab703bba	Move minibatch function to util	2017-11-06 23:45:36 +01:00
ines	d96e72f656	Tidy up rest	2017-10-27 21:07:59 +02:00
ines	a6135336f5	Tidy up gold	2017-10-27 17:02:55 +02:00
Matthew Honnibal	6e552c9d83	Prune number of non-projective labels more aggressiely	2017-10-11 02:46:44 -05:00
Matthew Honnibal	563f46f026	Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly.	2017-10-05 18:43:02 -05:00
Matthew Honnibal	ba23d63c35	Fix minibatch function, for fixed batch size	2017-09-14 13:37:41 +02:00
Matthew Honnibal	4bb6bc3f9e	Add support for sent_start to GoldParse	2017-08-25 20:03:14 -05:00
Matthew Honnibal	84b7ed49e4	Ensure updates aren't made if no gold available	2017-08-20 14:41:38 +02:00
Matthew Honnibal	ec63f4fe7b	Add option to control how missing entities are handled when getting NER tags	2017-07-29 21:58:37 +02:00
Matthew Honnibal	9bae0ddc50	Fix minibatching	2017-07-22 20:14:49 +02:00
Matthew Honnibal	ed6c85fa3c	Fix loading of text categories in GoldParse	2017-07-22 20:04:03 +02:00
Matthew Honnibal	7ea50182a5	Add support for text-classification labels to GoldParse	2017-07-20 00:17:47 +02:00
Matthew Honnibal	ebb6c49cd5	Make alignment case-insensitive for gold	2017-06-04 20:26:42 -05:00
Matthew Honnibal	fc4dd62e84	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 20:19:05 -05:00
Matthew Honnibal	a053b1218e	Fix item counting during training	2017-06-04 20:18:20 -05:00
Matthew Honnibal	9bc4a26213	Add option of data augmentation noise	2017-06-04 20:16:57 -05:00
Matthew Honnibal	f6955a459c	Fix prev commit	2017-06-03 14:38:37 -05:00
Matthew Honnibal	468ca6c760	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-03 14:33:51 -05:00
Matthew Honnibal	c647a0d33e	Fix training counter for gold preprocessing	2017-06-03 14:33:39 -05:00
Matthew Honnibal	e62f46d39f	Clarify gold.pyx slightly	2017-06-03 13:28:52 -05:00
Matthew Honnibal	be4a640f0c	Fix arc eager label costs for uint64	2017-05-30 20:37:58 +02:00
Matthew Honnibal	84e66ca6d4	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
Matthew Honnibal	d06f235fc9	Fix conflict on convert.py	2017-05-26 11:33:29 -05:00
Matthew Honnibal	2e587c6417	Export iob_to_biluo utility	2017-05-26 11:32:55 -05:00
Matthew Honnibal	daac3e3573	Always shuffle gold data, and support length cap	2017-05-26 11:30:52 -05:00
Matthew Honnibal	3a6e59cc53	Add minibatch function in spacy.gold	2017-05-25 17:15:09 -05:00
Matthew Honnibal	3959d778ac	Revert "Revert "WIP on improving parser efficiency"" This reverts commit `532afef4a8`.	2017-05-23 03:06:53 -05:00
Matthew Honnibal	532afef4a8	Revert "WIP on improving parser efficiency" This reverts commit `bdaac7ab44`.	2017-05-23 03:05:25 -05:00
Matthew Honnibal	bdaac7ab44	WIP on improving parser efficiency	2017-05-23 02:59:31 -05:00
Matthew Honnibal	c9760b2104	Support sentence limits in GoldCorpus	2017-05-22 10:40:46 -05:00
ines	54f04a9fe0	Update API docs with changes in spacy.gold and spacy.language	2017-05-22 12:29:30 +02:00
Matthew Honnibal	2a5eb9f61e	Make nonproj methods top-level functions, instead of class methods	2017-05-22 04:51:08 -05:00
Matthew Honnibal	025d9bbc37	Fix handling of non-projective deps	2017-05-22 04:51:08 -05:00
Matthew Honnibal	f13d6c7359	Support gold preprocessing and single gold files	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
Matthew Honnibal	432b3499b3	Fix memory leak	2017-05-21 13:38:46 -05:00
Matthew Honnibal	4803b3b69e	Add GoldCorpus class, to manage data streaming	2017-05-21 09:06:17 -05:00
ines	075f5ff87a	Update docstrings and API docs for GoldParse	2017-05-21 13:53:46 +02:00
Matthew Honnibal	fc8d3a112c	Add util.env_opt support: Can set hyper params through environment variables.	2017-05-18 04:36:53 -05:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	89a4f262fc	Fix training methods	2017-04-16 13:00:37 -05:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	958b12dec8	Use pathlib instead of os.path	2017-04-15 12:13:00 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque	f332bf05be	Remove unused import statements	2017-03-21 21:08:54 +01:00
Matthew Honnibal	2611ac2a89	Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens	2017-03-16 09:38:28 -05:00
Matthew Honnibal	3d4e389d23	Whitespace	2017-03-15 09:29:42 -05:00
Matthew Honnibal	159e8c46e1	Merge old training fixes with newer state	2016-11-25 09:16:36 -06:00
Matthew Honnibal	cc7e607a8a	Fix gold.pyx for 1.0	2016-11-25 08:57:59 -06:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	52b48b415e	Fix GoldParse class	2016-10-16 11:41:36 +02:00
Matthew Honnibal	0317cea0ad	Fix GoldParse	2016-10-15 23:55:07 +02:00
Matthew Honnibal	a48aa15384	Improve the API for the GoldParse class.	2016-10-15 23:53:29 +02:00
Matthew Honnibal	e07fe92b27	Draft a refactored init for the GoldParse class	2016-10-15 22:09:52 +02:00
Matthew Honnibal	86ae665c78	Add function for entity->biluo transformation	2016-10-15 21:51:04 +02:00
Matthew Honnibal	645d99523a	Move merge_sents method into spacy.gold	2016-10-13 03:24:29 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Wolfgang Seeker	b6b96b233c	don't require read_json_file to expect particular annotations	2016-05-02 15:29:30 +02:00
Wolfgang Seeker	4d7f393fae	don't require json-files to have syntactic annotation	2016-04-22 16:32:27 +02:00
Henning Peters	6215272786	remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels	2016-04-12 11:28:07 +02:00
Wolfgang Seeker	690c5acabf	adjust train.py to train both english and german models	2016-03-03 15:21:00 +01:00
Wolfgang Seeker	3448cb40a4	integrated pseudo-projective parsing into parser - nonproj.pyx holds a class PseudoProjectivity which currently holds all functionality to implement Nivre & Nilsson 2005's pseudo-projective parsing using the HEAD decoration scheme - changed lefts/rights in Token to account for possible non-projective structures	2016-03-01 10:09:08 +01:00
Wolfgang Seeker	4b2297d5d4	add class PseudoProjective for pseudo-projective parsing PseudoProjective() implements the algorithm from Nivre & Nilsson 2005 using their HEAD decoration scheme.	2016-02-24 11:26:25 +01:00
Wolfgang Seeker	8d531c958b	replace tests for non-projectivity - add functions to find non-projective edges - add test file for non-projectivity functions	2016-02-22 14:40:40 +01:00
Matthew Honnibal	83dccf0fd7	* Use io module insteads of deprecated codecs module	2015-10-10 14:13:01 +11:00
alvations	8caedba42a	caught more codecs.open -> io.open	2015-09-30 20:20:09 +02:00
Matthew Honnibal	7606d9936f	* Python3 correction for GoldParse	2015-07-28 14:44:53 +02:00
Matthew Honnibal	f4809e562f	* Allow json to be used as a fallback if ujson is not available	2015-07-25 18:11:36 +02:00
Matthew Honnibal	2ae0b439b2	* Fix space check in gold.pyx	2015-07-14 00:10:27 +02:00
Matthew Honnibal	89a91ad726	* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity	2015-07-09 13:30:41 +02:00

1 2 3 4

165 Commits