spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-04 02:46:40 +03:00

Author	SHA1	Message	Date
svlandeg	d133ffaff9	correct size, not counting dummy elements in the vector	2019-03-22 11:36:45 +01:00
svlandeg	33f8a0fe2e	check and unit test in case prior probs exceed 1	2019-03-22 11:36:45 +01:00
svlandeg	b55baaa1dc	avoid value 0 in preshmap and helpful user warnings	2019-03-22 11:36:45 +01:00
svlandeg	20a7b7b1c0	raising error when adding alias for unknown entity + unit test	2019-03-22 11:36:45 +01:00
svlandeg	8843f9279c	use StringStore	2019-03-22 11:36:45 +01:00
svlandeg	51560bf0ed	bugfix adding aliases	2019-03-22 11:36:45 +01:00
svlandeg	c4ba942765	get candidates by alias	2019-03-22 11:36:45 +01:00
svlandeg	151b855cc8	adding and retrieving aliases	2019-03-22 11:36:45 +01:00
svlandeg	cf34113250	very minimal KB functionality working	2019-03-22 11:36:44 +01:00
svlandeg	af281c5466	adding aliases per entity in the KB	2019-03-22 11:36:44 +01:00
svlandeg	f77b99c103	fix compile errors	2019-03-22 11:36:44 +01:00
svlandeg	27483f9080	add pyx and separate method to add aliases	2019-03-22 11:36:44 +01:00
svlandeg	feb71e15fd	hash the entity name	2019-03-22 11:36:44 +01:00
svlandeg	839dafa104	documented some comments and todos	2019-03-22 11:36:44 +01:00
svlandeg	7f37737878	kb snippet, draft by Matt (wip)	2019-03-22 11:36:44 +01:00
svlandeg	735fc2a735	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
svlandeg	d849eb2455	adding kb_id as field to token, el as nlp pipeline component	2019-03-22 11:34:46 +01:00
Matthew Honnibal	d811c97da1	Fix test that caused pytest to choke on Python3	2019-03-22 10:28:51 +01:00
Matthew Honnibal	a2ad9832e5	Add failing test for #3356	2019-03-22 02:42:37 +01:00
Matthew Honnibal	c66bd61e88	Fix lemmas	2019-03-21 14:22:12 +01:00
Matthew Honnibal	04395ffa49	Bring English tag_map in line with UD Treebank I wrote a small script to read the UD English training data and check that our tag map and morph rules were resulting in the best POS map. This hadn't been done for some time, and there have been various changes to the UD schema since it has been done. After these changes we should see much better agreement between our POS assignments and the UD POS tags.	2019-03-21 13:53:44 +01:00
Matthew Honnibal	c7f26abe5f	Merge pull request #3434 from Bharat123rox/narrow-unicode Raise Error for a narrow unicode build of Python	2019-03-20 12:19:52 +01:00
Matthew Honnibal	1c8ff59185	Merge pull request #3441 from explosion/fix/cli-ud-scripts 💫 Move UD scripts to bin	2019-03-20 12:19:15 +01:00
Matthew Honnibal	72889a16d5	Fix similarity calculation if vectors are on GPU (#3440 )	2019-03-20 12:09:59 +01:00
Matthew Honnibal	1612990e88	Implement cosine loss for spacy pretrain. Make default	2019-03-20 11:06:58 +00:00
Ines Montani	ae5b4d0e84	Fix formatting (hopefully also restarts build properly)	2019-03-20 09:55:45 +01:00
Ines Montani	6abc1ddb26	Update __main__.py	2019-03-20 09:43:26 +01:00
Bharat123Rox	f2547f02d6	Made changes suggested by @ines	2019-03-20 07:43:19 +05:30
Ines Montani	7400c7f8a7	Move UD scripts to bin	2019-03-20 01:19:34 +01:00
Ines Montani	685fff40cf	Revert "Add --always-link flag to cli.download (see #3435 )" This reverts commit `583a566843`.	2019-03-20 01:03:40 +01:00
Matthew Honnibal	6cfbb2d34e	Merge branch 'master' of https://github.com/explosion/spaCy	2019-03-20 00:59:54 +01:00
Matthew Honnibal	5a53e9358a	Set version to 2.1.1	2019-03-20 00:59:45 +01:00
Ines Montani	583a566843	Add --always-link flag to cli.download (see #3435 )	2019-03-19 22:03:27 +01:00
Bharat123Rox	6db1ddd9c7	Raise ValueError for narrow unicode build	2019-03-19 23:02:58 +05:30
Mehdi Hamoumi	9211f30ee3	Tiny correction in french lookup dictionary (#3427 )	2019-03-19 13:00:19 +01:00
Ines Montani	f0c1efcb00	Set version to 2.1.0	2019-03-17 22:42:58 +01:00
Matthew Honnibal	47e110375d	Fix jsonl to json conversion (#3419 ) * Fix spacy.gold.docs_to_json function * Fix jsonl2json converter	2019-03-17 22:12:54 +01:00
Matthew Honnibal	0a4b074184	Improve beam search defaults	2019-03-17 21:47:45 +01:00
Ines Montani	226db621d0	Strip out .dev versions in spacy validate [ci skip]	2019-03-17 12:16:53 +01:00
Matthew Honnibal	c6be9964ec	Set version to v2.1.0.dev1	2019-03-16 21:47:41 +01:00
Matthew Honnibal	61617c64d5	Revert changes to optimizer default hyper-params (WIP) (#3415 ) While developing v2.1, I ran a bunch of hyper-parameter search experiments to find settings that performed well for spaCy's NER and parser. I ended up changing the default Adam settings from beta1=0.9, beta2=0.999, eps=1e-8 to beta1=0.8, beta2=0.8, eps=1e-5. This was giving a small improvement in accuracy (like, 0.4%). Months later, I run the models with Prodigy, which uses beam-search decoding even when the model has been trained with a greedy objective. The new models performed terribly...So, wtf? After a couple of days debugging, I figured out that the new optimizer settings was causing the model to converge to solutions where the top-scoring class often had a score of like, -80. The variance on the weights had gone up enormously. I guess I needed to update the L2 regularisation as well? Anyway. Let's just revert the change --- if the optimizer is finding such extreme solutions, that seems bad, and not nearly worth the small improvement in accuracy. Currently training a slate of models, to verify the accuracy change is minimal. Once the training is complete, we can merge this. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-16 21:39:02 +01:00
Matthew Honnibal	62afa64a8d	Expose batch size and length caps on CLI for pretrain (#3417 ) Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`. Also improve CLI output. Closes #3216 ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-16 21:38:45 +01:00
Matthew Honnibal	58d562d9b0	Merge pull request #3416 from explosion/feature/improve-beam Improve beam search support	2019-03-16 18:42:18 +01:00
Ines Montani	2c5dd4d602	Update Vectors.find docs [ci skip]	2019-03-16 17:10:57 +01:00
Ines Montani	0f8739c7cb	Update train.py	2019-03-16 16:04:15 +01:00
Ines Montani	e7aa25d9b1	Fix beam width integration	2019-03-16 16:02:47 +01:00
Ines Montani	c94742ff64	Only add beam width if customised	2019-03-16 15:55:31 +01:00
Ines Montani	7a354761c7	Auto-format	2019-03-16 15:55:13 +01:00
Matthew Honnibal	daa8c3787a	Add eval_beam_widths argument to spacy train	2019-03-16 15:02:39 +01:00
Ines Montani	2eecd756fa	Update package name	2019-03-16 14:43:53 +01:00
Ines Montani	f55a52a2dd	Set version to v2.1.0.dev0	2019-03-16 13:47:03 +01:00
Ryan Ford	00842d7f1b	Merging conversion scripts for conll formats (#3405 ) * merging conllu/conll and conllubio scripts * tabs to spaces * removing conllubio2json from converters/__init__.py * Move not-really-CLI tests to misc * Add converter test using no-ud data * Fix test I broke * removing include_biluo parameter * fixing read_conllx * remove include_biluo from convert.py	2019-03-15 18:14:46 +01:00
Ines Montani	bec8db91e6	Add actual deprecation warning for n_threads (resolves #3410 )	2019-03-15 16:38:44 +01:00
Ines Montani	cb5dbfa63a	Tidy up references to n_threads and fix default	2019-03-15 16:24:26 +01:00
Ines Montani	852e1f105c	Tidy up docstrings	2019-03-15 16:23:17 +01:00
Matthew Honnibal	b13b2aeb54	Use hash_state in beam	2019-03-15 15:22:58 +01:00
Matthew Honnibal	693c8934e8	Normalize over all actions in parser, not just valid ones	2019-03-15 15:22:16 +01:00
Matthew Honnibal	b94b2b1168	Export hash_state from beam_utils	2019-03-15 15:20:28 +01:00
Matthew Honnibal	ad56641324	Fix Language.evaluate	2019-03-15 15:20:09 +01:00
Matthew Honnibal	f762c36e61	Evaluate accuracy at multiple beam widths	2019-03-15 15:19:49 +01:00
Matthew Honnibal	0703f5986b	Remove hack from beam	2019-03-15 00:48:39 +01:00
Sofie	c45ed32c74	label in span not writable anymore (#3408 ) * label in span not writable anymore * more explicit unit test and error message for readonly label * bit more explanation (view) * error msg tailored to specific case * fix None case	2019-03-15 00:46:45 +01:00
Ines Montani	8ac197d443	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-12 15:22:11 +01:00
Matthew Honnibal	6aab2d8533	Set version to v2.1.0a13	2019-03-12 15:14:06 +01:00
Ines Montani	8ee6514ab8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-12 15:11:39 +01:00
Ines Montani	479b5cff43	Auto-format [ci skip]	2019-03-12 13:35:34 +01:00
Matthew Honnibal	1179de0860	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-12 13:33:22 +01:00
Matthew Honnibal	8a4121cbc2	Fix bug introduced by component_cfg	2019-03-12 13:32:56 +01:00
Ines Montani	2912ddc9a6	Don't set extension attribute in Japanese (closes #3398 )	2019-03-12 13:30:33 +01:00
Matthew Honnibal	062934aa12	Set version to v2.1.0a12	2019-03-11 22:26:19 +01:00
Ines Montani	886e5966c0	Update test_displacy.py	2019-03-11 19:03:52 +01:00
Ines Montani	4bd2688eac	💫 Fix displaCy support for RTL languages (#3393 ) Closes #2091. ## Description With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes. Entity visualization now looks like this: <img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png"> And dependencies like this (ignore the most likely incorrect tags and dependencies): <img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png"> ### Types of change enhancement, bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 18:52:50 +01:00
Ines Montani	cdd418b93e	Auto-format [ci skip]	2019-03-11 17:10:50 +01:00
Matthew Honnibal	b0b990e405	Fix token.conjuncts (closes #795 ) (#3392 ) * Implement conjuncts method * Add span.conjuncts property * Un-xfail token.conjuncts tests * Update docs for token.conjuncts and span.conjuncts * Fix merge error in token.conjuncts	2019-03-11 17:05:45 +01:00
Matthew Honnibal	e2b9b523ce	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-11 15:59:28 +01:00
Ines Montani	47e9c274ef	Tidy up property code style (#3391 ) Use decorator if properties only have a getter and existing syntax if there's getter and setter	2019-03-11 15:59:09 +01:00
Matthew Honnibal	db79a704bf	Add xfail tests for token.conjuncts	2019-03-11 15:46:52 +01:00
Ines Montani	c3df4d1108	Move displaCy tests to own file	2019-03-11 15:28:34 +01:00
Ines Montani	c5a407e95a	Fix code style	2019-03-11 15:28:22 +01:00
Matthew Honnibal	39a4741e26	Add support for vocab.writing_system property (#3390 ) * Add xfail test for vocab.writing_system * Add vocab.writing_system property * Set Language.Defaults.writing_system * Set default writing system * Remove xfail on test_vocab_writing_system	2019-03-11 15:23:20 +01:00
Matthew Honnibal	05ef0a5abb	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-11 14:33:15 +01:00
Ines Montani	ee4f312e89	Add writing_system to ArabicDefaults (experimental)	2019-03-11 14:22:23 +01:00
Ines Montani	ebcf2bb1c3	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
Ines Montani	ef80cfde6f	Fix pickling of Japanese (closes #3191 )	2019-03-11 13:34:23 +01:00
Ines Montani	c399162a82	Tidy up	2019-03-11 13:34:14 +01:00
Ines Montani	7c05ca01e8	💫 Support mutable default values for extension attributes (#3389 ) * Support mutable default values in extensions * Update documentation	2019-03-11 12:50:44 +01:00
Matthew Honnibal	4e8a07c7d3	Set version to v2.1.0a11	2019-03-11 10:45:06 +01:00
Matthew Honnibal	80b94313b6	💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388 ) Closes #2203. Closes #3268. Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object. This PR applies two fixes: 1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite. 2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`). ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 01:31:21 +01:00
Matthew Honnibal	04ca710da7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-11 01:07:34 +01:00
Matthew Honnibal	5d25ee52fb	Fix English tag map	2019-03-11 01:06:02 +01:00
Ines Montani	8f45ff3dc2	Adjust formatting [ci skip]	2019-03-11 00:47:41 +01:00
Matthew Honnibal	7503e1e505	Improve English tag map. Re #593 , #3311	2019-03-10 23:50:00 +01:00
Matthew Honnibal	98acf5ffe4	💫 Allow passing of config parameters to specific pipeline components (#3386 ) * Add component_cfg kwarg to begin_training * Document component_cfg arg to begin_training * Update docs and auto-format * Support component_cfg across Language * Format * Update docs and docstrings [ci skip] * Fix begin_training	2019-03-10 23:36:47 +01:00
Ines Montani	c998cde7e2	Auto-format [ci skip]	2019-03-10 19:22:59 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	67e38690d4	Un-xfail passing tests and tidy up	2019-03-10 18:42:16 +01:00
Matthew Honnibal	27dd820753	Fix vocab deserialization when loading already present lexemes (#3383 ) * Fix vocab deserialization bug. Closes #2153 * Un-xfail test for #2153	2019-03-10 17:21:19 +01:00
Matthew Honnibal	d6eaa71afc	Handle scalar values in doc.from_array()	2019-03-10 16:54:03 +01:00
Matthew Honnibal	61e5ce02a4	Add xfailing test for #2153	2019-03-10 16:36:29 +01:00
Matthew Honnibal	7461e5e055	Fix batch bug in issue #3344	2019-03-10 16:01:34 +01:00

1 2 3 4 5 ...

5885 Commits