spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-09 10:11:24 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	6936ca1664	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 09:44:07 +01:00
Matthew Honnibal	4405b5c875	Fix resizing edge-case for NER	2018-12-10 06:25:17 +00:00
Matthew Honnibal	0994dc50d8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 05:35:01 +00:00
Matthew Honnibal	24f2e9bc07	Tweak training params	2018-12-09 17:08:58 +00:00
Matthew Honnibal	16c5861d29	Fix NER space constraints Allow entities to end on spaces, to avoid stumping the oracle when we're inside an entity, and there's a space just before a correct entity.	2018-12-09 08:06:45 +01:00
Matthew Honnibal	1b1a1af193	Fix printing in spacy train	2018-12-09 06:03:49 +01:00
Matthew Honnibal	d2ac618af1	Set cbb_maxout_pieces=3	2018-12-08 23:27:29 +01:00
Matthew Honnibal	cb16b78b0d	Set dropout rate to 0.2	2018-12-08 19:59:11 +01:00
Matthew Honnibal	2c2db0c492	💫 Allow Span to take text label (#3031 ) Fixes #3027. * Allow Span.__init__ to take unicode values for the `label` argument. * Allow `Span.label_` to be writeable. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-08 13:08:41 +01:00
Matthew Honnibal	11a29af751	Set cupy.random seed in fix_random_seed helper	2018-12-08 12:37:38 +01:00
Ines Montani	ffdd5e964f	Small CLI improvements (#3030 ) * Add todo * Auto-format * Update wasabi pin * Format training results with wasabi * Remove loading animation from model saving Currently behaves weirdly * Inline messages * Remove unnecessary path2str Already taken care of by printer * Inline messages in CLI * Remove unused function * Move loading indicator into loading function * Check for invalid whitespace entities	2018-12-08 11:49:43 +01:00
Matthew Honnibal	8aa7882762	Make NORM a token attribute (#3029 ) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.	2018-12-08 10:49:10 +01:00
Matthew Honnibal	a338c6f8f6	Fix JSON segmentation bug that affected French Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped slashes were being handled incorrectly. This bug caused low scores for French in the early v2.1.0 alphas, because most of the data was not being read in. Fittingly, the document that triggered the bug was a Wikipedia article about Perl. Parsing perl remains difficult!	2018-12-08 10:41:24 +01:00
Matthew Honnibal	b2bfd1e1c8	Move dropout and batch sizes out of global scope in train cmd	2018-12-07 20:54:35 +01:00
Matthew Honnibal	40e0da9cc1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-07 00:12:22 +00:00
Matthew Honnibal	1e6725e9b7	Try to prevent spaces from being tagged as entities	2018-12-07 00:12:12 +00:00
Matthew Honnibal	427c0693c8	Fix missing comma in init-model command	2018-12-06 22:48:31 +01:00
Matthew Honnibal	d896fbca62	Fix batch size in parser.pipe	2018-12-06 21:45:56 +01:00
Matthew Honnibal	bb3304a4f1	Fix pickle tests	2018-12-06 20:46:36 +01:00
Matthew Honnibal	e619f45287	Fix pickle tests	2018-12-06 20:43:47 +01:00
Matthew Honnibal	0a60726215	Remove cytoolz usage in CLI	2018-12-06 20:37:00 +01:00
Matthew Honnibal	c0af627f32	Fix dill usage in vocab	2018-12-06 18:53:16 +01:00
Matthew Honnibal	9520489225	Fix removabl of dill (for srsly)	2018-12-06 18:46:09 +01:00
Matthew Honnibal	711f108532	Fix cytoolz import cytoolz	2018-12-06 16:04:12 +01:00
Matthew Honnibal	cabaadd793	Fix build error from bad import Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.	2018-12-06 15:12:39 +01:00
Matthew Honnibal	ea00dbaaa4	Remove usage of itertools.islice	2018-12-03 02:43:03 +01:00
Matthew Honnibal	c7b33b24f1	Fix conflict	2018-12-03 02:20:20 +01:00
Matthew Honnibal	2402ef498b	Remove unused import	2018-12-03 02:19:23 +01:00
Matthew Honnibal	1c71fdb805	Remove cytoolz usage from spaCy	2018-12-03 02:19:12 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Matthew Honnibal	a7b085ae46	Set version back to 2.1.0a4	2018-12-03 02:03:26 +01:00
Matthew Honnibal	8e9a4d2f5e	Increment version to 2.1.0a5	2018-12-03 01:59:50 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Matthew Honnibal	40a273245c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-01 14:43:29 +01:00
Matthew Honnibal	d9d339186b	Fix dropout and batch-size defaults	2018-12-01 13:42:35 +00:00
Matthew Honnibal	9536ee787c	Add comma deletion to data noising	2018-12-01 13:42:18 +00:00
Matthew Honnibal	21ee1c7a17	Improve parser multi-task objective	2018-12-01 13:41:24 +00:00
Matthew Honnibal	fe7d6f36b1	Fix parser default	2018-12-01 13:41:04 +00:00
Matthew Honnibal	a31d557f2d	Set version to v2.1.0a4	2018-12-01 14:40:03 +01:00
Ines Montani	5c966d0874	Simplify function	2018-12-01 04:59:12 +01:00
Ines Montani	ce7eec846b	Move CLi-specific Markdown helper to CLI	2018-12-01 04:55:48 +01:00
Ines Montani	40ae499f32	Remove unused helper function Now imported from wasabi	2018-12-01 04:54:46 +01:00
Matthew Honnibal	3139b020b5	Fix train script	2018-11-30 22:17:08 +00:00
Matthew Honnibal	4aa1002546	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-30 20:58:51 +00:00
Matthew Honnibal	6bd1cc57ee	Increase length limit for pretrain	2018-11-30 20:58:18 +00:00
Gavriel Loria	919729d38c	replace user-facing references to "sbd" with "sentencizer" (#2985 ) ## Description Fixes #2693 Previously, the tokens `sbd` and `sentencizer` would create the same nlp pipe. Internally, both would be called `sbd`. This setup became problematic because it was hard for a user relying on the `sentencizer` pipe name to realize that their pipe's name would be `sbd` for all functions other than creating a pipe. This PR intends to change the API and API documentation to fully support `sentencizer` and drop any user-facing references to `sbd`. ### Types of change end-user API bug ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 21:22:40 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	0369db75c1	Fix support for parser multi-task objectives	2018-11-30 19:53:59 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Matthew Honnibal	1b240f2119	Fix default token_vector_width	2018-11-30 16:40:11 +00:00

1 2 3 4 5 ...

5391 Commits