spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Author	SHA1	Message	Date
Ines Montani	ce7eec846b	Move CLi-specific Markdown helper to CLI	2018-12-01 04:55:48 +01:00
Ines Montani	40ae499f32	Remove unused helper function Now imported from wasabi	2018-12-01 04:54:46 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
Matthew Honnibal	2c37e0ccf6	💫 Use Blis for matrix multiplications (#2966 ) Our epic matrix multiplication odyssey is drawing to a close... I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython. The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced. With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with. * Use blis * Use -2 arg to Cython * Update dependencies * Fix requirements * Update setup dependencies * Fix requirement typo * Fix msgpack errors * Remove Python27 test from Appveyor, until Blis works there * Auto-format setup.py * Fix murmurhash version	2018-11-27 00:44:04 +01:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	6430b1fe64	Restore encoding arg on msgpack-numpy	2018-09-27 15:58:21 +02:00
Matthew Honnibal	8809dc4514	Remove deprecated encoding argument to msgpack	2018-09-27 12:56:23 +02:00
Matthew Honnibal	5afd98dff5	Add a stepping function, for changing batch sizes or learning rates	2018-09-14 18:37:16 +02:00
ines	3c30d1763c	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
Matthew Honnibal	e0caf3ae8c	Fix msgpack for new version	2018-07-20 17:32:00 +02:00
Ines Montani	e7b075565d	💫 Rule-based NER component (#2513 ) * Add helper function for reading in JSONL * Add rule-based NER component * Fix whitespace * Add component to factories * Add tests * Add option to disable indent on json_dumps compat Otherwise, reading JSONL back in line by line won't work * Fix error code	2018-07-18 19:43:16 +02:00
ines	3c3a175018	Merge branch 'master' into develop	2018-05-28 18:37:09 +02:00
ansgar-t	9732988951	escape html in displacy.render (#2378 ) (closes #2361 ) ## Description Fix for issue #2361 : replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. (As discussed in the comments to #2361) - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-05-28 18:36:41 +02:00
Ines Montani	862da5e793	Support pipeline factories via entry points (#2348 )	2018-05-22 18:29:45 +02:00
ines	5401c55c75	Merge branch 'master' into develop	2018-05-20 16:49:40 +02:00
ines	5768df4f09	Add SimpleFrozenDict util to use as default function argument	2018-05-20 15:13:37 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	8308bbc617	Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts	2018-03-29 00:14:55 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	8b7a74570f	Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"" This reverts commit `f41e626844`.	2018-03-27 19:22:52 +02:00
Matthew Honnibal	f41e626844	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `f57bfbccdc`.	2018-03-27 19:22:25 +02:00
Matthew Honnibal	c9ba3d3c2d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-03-27 18:59:08 +02:00
Matthew Honnibal	92c26a35d4	Update get_cuda_stream	2018-03-27 16:42:00 +00:00
Matthew Honnibal	bede11b67c	Improve label management in parser and NER (#2108 ) This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly. Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable. We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense. To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort. Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training. To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make. Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths. This is a squash merge, as I made a lot of very small commits. Individual commit messages below. * Simplify label management for TransitionSystem and its subclasses * Fix serialization for new label handling format in parser * Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir * Set actions in transition system * Require thinc 6.11.1.dev4 * Fix error in parser init * Add unicode declaration * Fix unicode declaration * Update textcat test * Try to get model training on less memory * Print json loc for now * Try rapidjson to reduce memory use * Remove rapidjson requirement * Try rapidjson for reduced mem usage * Handle None heads when projectivising * Stream json docs * Fix train script * Handle projectivity in GoldParse * Fix projectivity handling * Add minibatch_by_words util from ud_train * Minibatch by number of words in spacy.cli.train * Move minibatch_by_words util to spacy.util * Fix label handling * More hacking at label management in parser * Fix encoding in msgpack serialization in GoldParse * Adjust batch sizes in parser training * Fix minibatch_by_words * Add merge_subtokens function to pipeline.pyx * Register merge_subtokens factory * Restore use of msgpack tmp directory * Use minibatch-by-words in train * Handle retokenization in scorer * Change back-off approach for missing labels. Use 'dep' label * Update NER for new label management * Set NER tags for over-segmented words * Fix label alignment in gold * Fix label back-off for infrequent labels * Fix int type in labels dict key * Fix int type in labels dict key * Update feature definition for 8 feature set * Update ud-train script for new label stuff * Fix json streamer * Print the line number if conll eval fails * Update children and sentence boundaries after deprojectivisation * Export set_children_from_heads from doc.pxd * Render parses during UD training * Remove print statement * Require thinc 6.11.1.dev6. Try adding wheel as install_requires * Set different dev version, to flush pip cache * Update thinc version * Update GoldCorpus docs * Remove print statements * Fix formatting and links [ci skip]	2018-03-19 02:58:08 +01:00
Matthew Honnibal	31b156d60b	Fix itershuffle	2018-03-10 22:32:59 +01:00
Johannes Dollinger	012e874d09	Add contributor agreement for emulbreh	2018-02-13 13:40:33 +01:00
Johannes Dollinger	bf94c13382	Don't fix random seeds on import	2018-02-13 12:42:23 +01:00
ines	35653bef3a	Add missing import (fixes #1546 )	2017-11-10 19:05:18 +01:00
Matthew Honnibal	726f689da4	Fix missing import	2017-11-07 13:20:12 +01:00
ines	8fb48b9b91	Update and document new util functions	2017-11-07 00:22:43 +01:00
Matthew Honnibal	1cab703bba	Move minibatch function to util	2017-11-06 23:45:36 +01:00
ines	39e0586192	Add deprecated helper Uses warning to show DeprecationWarning and custom stack trace	2017-11-01 16:32:36 +01:00
Matthew Honnibal	a7bf38bf31	Remove misleading comment on util.get_cuda_stream()	2017-11-01 13:57:25 +01:00
ines	ea4a41c8fb	Tidy up util and helpers	2017-10-27 14:39:09 +02:00
Matthew Honnibal	9baa8fe7ec	Convert closure to functools.partial, to promote pickling	2017-10-17 18:20:52 +02:00
Matthew Honnibal	df488274b1	Fix deserialization of vectors	2017-10-16 20:55:00 +02:00
ines	d5418553eb	Fix whitespace	2017-10-16 18:30:04 +02:00
ines	6ceadcdb5c	Make sure from_disk passes string to numpy (see #1421 ) If path is a WindowsPath, numpy does not recognise it as a path and as a result, doesn't open the file. https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369	2017-10-16 18:29:56 +02:00
ines	b39409173e	Add disable option and True/False/None values for pipeline	2017-10-07 00:29:08 +02:00
ines	212c8f0711	Implement new Language methods and pipeline API	2017-10-07 00:25:54 +02:00
Matthew Honnibal	f24c2e3a8a	Fix evaluate for non-GPU	2017-10-03 22:47:31 +02:00
ines	8dbe49ecb8	Always compare lowercase package names Otherwise, is_package will return False if model name contains uppercase characters. See this issue: https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu le/46/6	2017-09-29 20:55:17 +02:00
ines	153c2589d4	Revert "Always compare lowercase package names" This reverts commit `7d77dc490f`.	2017-09-29 20:53:36 +02:00
ines	7d77dc490f	Always compare lowercase package names Otherwise, is_package will return False if model name contains uppercase characters. See this issue: https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu le/46/6	2017-09-29 20:52:28 +02:00
Matthew Honnibal	ffda38356a	Add util function to enable GPU	2017-09-20 19:16:35 -05:00
ines	68f66aebf8	Use pkg_resources instead of pip for is_package (resolves #1293 )	2017-09-16 20:27:59 +02:00
Matthew Honnibal	30e35d9666	Fix syntax error	2017-08-30 17:35:39 -05:00
ines	173089a45a	Add more validation for model meta	2017-08-29 11:21:46 +02:00
Matthew Honnibal	ed95009b5c	Fix data loading on Python 2	2017-08-18 21:57:06 +02:00
Dan O'Huiginn	ebf5a3ce59	Allow loading with python < 3.6 Don't rely on recent python features to load models Fixes Issue #1271	2017-08-17 15:15:47 +00:00
ines	ea167e14db	Fix model package loading from link	2017-06-05 13:10:49 +02:00
ines	dd6dc4c120	Update spacy.load() helper functions	2017-06-05 13:02:31 +02:00
ines	7db1a0e83e	Make sure printed values are always strings	2017-06-04 21:27:20 +02:00
ines	070e026ed9	Ensure path on read_json	2017-06-04 20:44:37 +02:00
ines	e1e73936b1	Raise correct error	2017-06-04 20:44:27 +02:00
ines	4c2bbc3ccc	Add add_lookups util function	2017-06-03 19:44:47 +02:00
ines	924c58bde3	Fix serialization of optional elements	2017-06-02 18:18:17 +02:00
Matthew Honnibal	1d18cedae8	Fiddle with msgpack bytes vs unicode	2017-06-01 10:48:43 -05:00
Matthew Honnibal	3ff7d7fcef	Merge for updated requirements	2017-06-01 04:57:47 -05:00
Matthew Honnibal	ae8010b526	Move weight serialization to Thinc	2017-06-01 02:56:12 -05:00
Matthew Honnibal	c8a58cfcf8	Fix Python2/3 load bug	2017-05-31 15:21:44 -05:00
Matthew Honnibal	8dfb9546f0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 07:21:14 -05:00
Matthew Honnibal	92f9e5cc9a	Silence env_opt, and fix serialization for GPU	2017-05-31 07:14:11 -05:00
Matthew Honnibal	33e5ec737f	Fix to/from disk methods	2017-05-31 13:43:10 +02:00
Matthew Honnibal	2a061e2777	Fix serialisation, for reals this time	2017-05-29 17:52:08 -05:00
Matthew Honnibal	35d981241f	Fix model deserialization	2017-05-29 14:46:31 -05:00
Matthew Honnibal	5b29f227ae	Fix serialization	2017-05-29 14:35:53 -05:00
Matthew Honnibal	1e6df0a2a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-29 14:30:12 -05:00
ines	08382f21e3	Pass model meta to nlp object in load_model	2017-05-29 20:44:11 +02:00
Matthew Honnibal	f1acdaab55	Fix serialization of weight offsets	2017-05-29 13:23:11 -05:00
Matthew Honnibal	c044e9c21c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-29 08:41:02 -05:00
Matthew Honnibal	aa4c33914b	Work on serialization	2017-05-29 08:40:45 -05:00
ines	567485a818	Fix and document model loading with pipeline and overrides	2017-05-29 14:10:10 +02:00
Matthew Honnibal	deac7eb01c	Fix for serialization	2017-05-29 13:54:18 +02:00
Matthew Honnibal	04c32aa091	Fix for serialization	2017-05-29 13:53:32 +02:00
Matthew Honnibal	a1960c2d09	Fix for serialization	2017-05-29 13:47:42 +02:00
Matthew Honnibal	7b06bb896e	Fix for serialization	2017-05-29 13:42:55 +02:00
Matthew Honnibal	f4aafca222	Merge changes to test_misc	2017-05-29 12:26:02 +02:00
Matthew Honnibal	ff26aa6c37	Work on to/from bytes/disk serialization methods	2017-05-29 11:45:45 +02:00
ines	df920ba0e7	Add tests for displaCy and util functions and fix util typo	2017-05-29 10:51:19 +02:00
Matthew Honnibal	c91b121aeb	Move serialization functions to util	2017-05-29 10:13:42 +02:00
Matthew Honnibal	6dad4117ad	Work on serialization for models	2017-05-29 01:37:57 +02:00
ines	c1983621fb	Update util functions for model loading	2017-05-28 00:22:40 +02:00
ines	c8543c8237	Fix formatting and docstrings and remove deprecated function	2017-05-28 00:22:40 +02:00
ines	51882c4984	Fix formatting	2017-05-26 12:37:45 +02:00
Matthew Honnibal	80cf42e33b	Fix compounding and decaying utils	2017-05-25 17:15:39 -05:00
Matthew Honnibal	b9cea9cd93	Add compounding and decaying functions	2017-05-25 16:16:10 -05:00
ines	b5fb43fdd8	Allow sys.exit status as exits keyword arg in util.prints()	2017-05-22 12:29:15 +02:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
Matthew Honnibal	0731971bfc	Add itershuffle utility function. Maybe belongs in thinc	2017-05-21 09:05:05 -05:00
ines	3871157d84	Update spacy.util documentation	2017-05-21 01:12:09 +02:00
Matthew Honnibal	238be0f16a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-18 08:32:22 -05:00
Matthew Honnibal	c214c0decb	Improve env_opt reporting	2017-05-18 08:32:03 -05:00
ines	489d2fb4ba	Add is_in_jupyter() helper for displaCy (see #1058 )	2017-05-18 14:13:14 +02:00
ines	abf0188b0a	Move cupy and CudaStream to compat	2017-05-18 14:12:45 +02:00
Matthew Honnibal	fc8d3a112c	Add util.env_opt support: Can set hyper params through environment variables.	2017-05-18 04:36:53 -05:00
Matthew Honnibal	1d7c18e58a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-15 21:53:47 +02:00
Matthew Honnibal	a9edb3aa1d	Improve integration of NN parser, to support unified training API	2017-05-15 21:53:27 +02:00

1 2 3 4 5 ...

269 Commits