spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-10 17:26:42 +03:00

Author	SHA1	Message	Date
Motoki Wu	9c064e6ad9	Add resume logic to spacy pretrain (#3652 ) * Added ability to resume training * Add to readmee * Remove duplicate entry	2019-06-12 13:29:23 +02:00
Ramanan Balakrishnan	eb12703d10	minor fix to broken link in documentation (#3819 ) [ci skip]	2019-06-04 11:15:35 +02:00
Ines Montani	0c74506c9c	Fix typos in docs (closes #3802 ) [ci skip]	2019-06-01 11:35:01 +02:00
Nipun Sadvilkar	1f13005751	Incorrect Token attribute ent_iob_ description (#3800 ) * Incorrect Token attribute ent_iob_ description * Add spaCy contributor agreement	2019-05-31 16:50:45 +02:00
Ramanan Balakrishnan	26c37c5a4d	fix all references to BILUO annotation format (#3797 )	2019-05-31 12:19:19 +02:00
mak	89379a7fa4	Corrected example model URL in requirements.txt (#3786 ) The URL used to show how to add a model to the requirements.txt had the old release path (excl. explosion).	2019-05-29 10:51:55 +02:00
Ines Montani	7634812172	Document Language.evaluate	2019-05-24 14:06:36 +02:00
Ines Montani	45e6855550	Update Language.update docs	2019-05-24 14:06:26 +02:00
Ines Montani	b78a8dc1d2	Update Scorer and add API docs	2019-05-24 14:06:04 +02:00
Ines Montani	321c9f5acc	Fix lex_id docs (closes #3743 )	2019-05-16 23:15:58 +02:00
Ines Montani	f96af8526a	Merge branch 'spacy.io' [ci skip]	2019-05-11 23:03:56 +02:00
Ines Montani	7534f7cb44	Fix return value of Language.update (closes #3692 )	2019-05-11 18:40:19 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
Ines Montani	6cfa1e1f47	Fix DependencyParser.predict docs (resolves #3561 )	2019-05-11 15:37:54 +02:00
Ines Montani	25f5592d57	Improve Token.prob and Lexeme.prob docs (resolves #3701 )	2019-05-11 15:23:41 +02:00
Aaron Kub	719a15f23d	fixing regex matcher examples (#3708 ) (#3719 )	2019-05-10 14:23:52 +02:00
Ines Montani	65b55f1aaa	Add version tag to `--base-model` argument (closes #3720 )	2019-05-10 14:06:47 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
张晓飞	ba1ff00370	update response after calling add_pipe (#3661 ) * update response after calling add_pipe component:print_info is appened in the last, so need show it at the end of pipeline * Create henry860916.md	2019-05-01 12:02:18 +02:00
Ramiro Gómez	8ee4100f8f	Remove dangling M (#3657 ) I assume this is a typo. Sorry if it has a meaning that I'm not aware of.	2019-04-29 19:44:43 +02:00
Amit Chaudhary	167d63af31	Fix broken link to Dive Into Python 3 website (#3656 ) * Fix broken link to Dive Into Python 3 website * Sign spaCy Contributor Agreement	2019-04-29 19:44:00 +02:00
Ivan Tham	fa94f83697	Improve redundant variable name (#3643 ) * Improve redundant variable name * Apply suggestions from code review Co-Authored-By: pickfire <pickfire@riseup.net>	2019-04-26 16:50:14 +02:00
Ines Montani	ec0d840ab5	Document early stopping	2019-04-22 14:31:32 +02:00
Ines Montani	1d567913f9	Update spacy evaluate example	2019-04-22 14:28:42 +02:00
Ines Montani	7917ce2f73	Make flag shortcut consistent and document	2019-04-22 14:23:44 +02:00
Ines Montani	52658c80d5	Allow jupyter=False to override Jupyter mode (closes #3598 )	2019-04-22 14:18:32 +02:00
Motoki Wu	8e2cef49f3	Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-04-22 14:10:16 +02:00
Ines Montani	0dce4585b1	Add course to 101	2019-04-19 15:59:51 +02:00
Ines Montani	2efc87c382	Remove unused image	2019-04-19 15:48:12 +02:00
Ines Montani	38395d9518	Merge branch 'spacy.io'	2019-04-19 15:26:20 +02:00
Ines Montani	7ac5bb0a7b	Update landing and feature overview	2019-04-19 15:23:08 +02:00
fizban99	f2f2df6e78	entity types for colors should be in uppercase (#3599 ) although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format.	2019-04-17 11:22:56 +02:00
Ines Montani	5289dd1356	Fix formatting	2019-04-13 17:58:26 +02:00
Ines Montani	9e7deeaf48	Remove Datacamp	2019-04-13 17:46:32 +02:00
Santiago Castro	86e4b68aa9	Fix website docs for Vectors.from_glove (#3565 ) * Fix website docs for Vectors.from_glove * Add myself as a contributor	2019-04-10 15:23:27 +02:00
Bharat Raghunathan	72820896d4	Fix typo in web docs cli.md (#3559 )	2019-04-09 11:40:03 +02:00
pierremonico	0d26bfe677	Removes duplicate in table (#3550 ) * Removes duplicate in table Just fixing typos. * Remove newline Co-authored-by: Ines Montani <ines@ines.io>	2019-04-08 10:30:42 +02:00
Ines Montani	2f0f439c54	Remove non-existent example (closes #3533 )	2019-04-03 09:59:17 +02:00
Samuel Kane	06a1846379	fix(util): fix decaying function output (#3495 ) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text	2019-03-28 13:24:47 +01:00
Bharat Raghunathan	1db3e47509	DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492 )	2019-03-28 12:48:02 +01:00
Ines Montani	200d8bdb3c	Merge branch 'spacy.io' [ci skip]	2019-03-23 16:46:34 +01:00
Ines Montani	1e5b917d75	Fix formatting [ci skip]	2019-03-23 16:45:50 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Ines Montani	b532386a60	Fix typo [ci skip]	2019-03-22 18:36:17 +01:00
Ines Montani	5073ce63fd	Merge branch 'spacy.io' [ci skip]	2019-03-22 15:17:11 +01:00
Ines Montani	0712efc6b3	Update version requirements [ci skip]	2019-03-21 10:23:54 +01:00
Ines Montani	dac8f8ff99	Update Span.__init__ docs (see #3445 ) [ci skip]	2019-03-20 17:24:17 +01:00
Ines Montani	d4eed4a84f	Add note on unicode build to troubleshooting guide (see #3421 ) [ci skip]	2019-03-19 10:27:02 +01:00
Ines Montani	08284f3a11	💫 v2.1.0 launch updates (only merge on launch!) (#3414 ) * Update README.md * Use production docsearch [ci skip] * Add option to exclude pages from search	2019-03-18 16:07:26 +01:00
Ines Montani	a611b32fbf	Update model docs [ci skip]	2019-03-17 11:48:18 +01:00
Matthew Honnibal	62afa64a8d	Expose batch size and length caps on CLI for pretrain (#3417 ) Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`. Also improve CLI output. Closes #3216 ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-16 21:38:45 +01:00
Ines Montani	2c5dd4d602	Update Vectors.find docs [ci skip]	2019-03-16 17:10:57 +01:00
Ines Montani	cbcba699dd	Fix missing ids	2019-03-14 17:56:53 +01:00
Ines Montani	4cfe4aa224	Fix small issues in the docs [ci skip]	2019-03-12 22:57:15 +01:00
Ines Montani	ba7eb2d131	Update section [ci skip]	2019-03-12 16:18:34 +01:00
Ines Montani	cecc31b765	Don't auto-slugify accordion links [ci skip]	2019-03-12 15:30:49 +01:00
Ines Montani	72fb324d95	Add vector training script to bin [ci skip]	2019-03-12 12:07:56 +01:00
Ines Montani	3abf0e6b9f	Replace dev-resources links with real examples	2019-03-12 12:07:40 +01:00
Ines Montani	59c0620487	Auto-format	2019-03-12 12:07:11 +01:00
Ines Montani	cdd418b93e	Auto-format [ci skip]	2019-03-11 17:10:50 +01:00
Matthew Honnibal	b0b990e405	Fix token.conjuncts (closes #795 ) (#3392 ) * Implement conjuncts method * Add span.conjuncts property * Un-xfail token.conjuncts tests * Update docs for token.conjuncts and span.conjuncts * Fix merge error in token.conjuncts	2019-03-11 17:05:45 +01:00
Ines Montani	25cb764e64	Document new API [ci skip]	2019-03-11 15:23:53 +01:00
Ines Montani	ebcf2bb1c3	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
Ines Montani	7c05ca01e8	💫 Support mutable default values for extension attributes (#3389 ) * Support mutable default values in extensions * Update documentation	2019-03-11 12:50:44 +01:00
Matthew Honnibal	98acf5ffe4	💫 Allow passing of config parameters to specific pipeline components (#3386 ) * Add component_cfg kwarg to begin_training * Document component_cfg arg to begin_training * Update docs and auto-format * Support component_cfg across Language * Format * Update docs and docstrings [ci skip] * Fix begin_training	2019-03-10 23:36:47 +01:00
Ines Montani	8dbf1e9037	Also fix #3387 on develop	2019-03-10 23:36:28 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	9a8f169e5c	Update v2-1.md	2019-03-10 18:58:51 +01:00
Ines Montani	0426689db8	💫 Improve Doc.to_json and add Doc.is_nered (#3381 ) * Use default return instead of else * Add Doc.is_nered to indicate if entities have been set * Add properties in Doc.to_json if they were set, not if they're available This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.	2019-03-10 15:24:34 +01:00
Ines Montani	76764fcf59	💫 Improve converters and training data file formats (#3374 ) * Populate converter argument info automatically * Add conversion option for msgpack * Update docs * Allow reading training data from JSONL	2019-03-08 23:15:23 +01:00
Ines Montani	296446a1c8	Tidy up and improve docs and docstrings (#3370 ) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-08 11:42:26 +01:00
Ines Montani	fa7314b221	Clarify train_path and dev_path format (see #3366 ) [ci skip]	2019-03-07 12:23:27 +01:00
Ines Montani	e9babd9973	Update hyperparameters section (see #3352 )	2019-03-06 14:40:30 +01:00
Ines Montani	48a206a95f	Fix displaCy visualizations in docs (closes #3357 ) [ci skip]	2019-03-06 13:20:44 +01:00
Ines Montani	5eadf61327	Update pretraining docs on file format (closes #3354 )	2019-03-04 16:30:13 +00:00
Ines Montani	1d4ba7678f	Auto-format [ci skip]	2019-02-27 12:07:35 +01:00
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	c478a2ccb6	Update backwards incompat [ci skip]	2019-02-27 11:56:56 +01:00
Matthew Honnibal	4a3371acd5	Make doc[0].is_sent_start == True (closes #2869 ) (#3340 ) * Make doc[0] have sent_start True. Closes #2869 * Document that doc[0].is_sent_start defaults True.	2019-02-27 11:17:17 +01:00
Ines Montani	1b6238101a	Add table explaining training metrics [closes #2644 ]	2019-02-25 10:03:43 +01:00
Ines Montani	d0b3af9222	Fix remaining inaccuracies in API docs (closes #2329 )	2019-02-24 22:21:25 +01:00
Ines Montani	62b558ab72	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 ) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop	2019-02-24 21:13:51 +01:00
Ines Montani	aa52305461	Improve pipeline model and meta example [ci skip]	2019-02-24 18:45:39 +01:00
Ines Montani	df19e2bff6	💫 Allow setting of custom attributes during retokenization (closes #3314 ) (#3324 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter and a setter implemented. ```python Token.set_extension('is_musician', default=False) doc = nlp("I like David Bowie.") with doc.retokenize() as retokenizer: attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}} retokenizer.merge(doc[2:4], attrs=attrs) assert doc[2].text == "David Bowie" assert doc[2].lemma_ == "David Bowie" assert doc[2]._.is_musician ``` ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-24 18:38:47 +01:00
Ines Montani	403b9cd58b	Add docs on adding to existing tokenizer rules [ci skip]	2019-02-24 18:35:19 +01:00
Ines Montani	1ea1bc98e7	Document regex utilities [ci skip]	2019-02-24 18:34:10 +01:00
Ines Montani	46ec5cdccc	Update TextCategorizer docs	2019-02-24 13:11:57 +01:00
Ines Montani	c03cb1cc63	Improve built-in component API docs	2019-02-24 13:11:49 +01:00
Ines Montani	383e2e1f12	Update Python versions [ci skip]	2019-02-24 11:49:45 +01:00
Ines Montani	b624cb4b89	Update v2-1.md	2019-02-24 11:49:27 +01:00
Ines Montani	250e88ef55	Fix docs example (see #2728 )	2019-02-21 14:22:06 +01:00
Ines Montani	0fc908d7a5	Add note on merging speed in v2.1 (see #3300 ) [ci skip]	2019-02-21 12:34:18 +01:00
Ines Montani	236aa94ded	Update v2-1.md	2019-02-21 12:33:56 +01:00
Sofie	9a478b6db8	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 ) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib	2019-02-20 22:10:13 +01:00
Ines Montani	57ae71ea95	Add docs on serializing the pipeline (see #3289 ) [ci skip]	2019-02-18 14:13:29 +01:00
Ines Montani	38e4422c0d	Improve matcher example (resolves #3287 )	2019-02-18 13:26:37 +01:00
Ines Montani	660cfe44c5	Fix formatting	2019-02-18 13:26:22 +01:00
Ines Montani	212ff359ef	Fix links [ci skip]	2019-02-17 22:25:50 +01:00
Ines Montani	04b4df0ec9	Remove n_threads	2019-02-17 22:25:42 +01:00

1 2 3 4 5 ...

591 Commits