spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-05 14:59:59 +03:00

Author	SHA1	Message	Date
svlandeg	9abbd0899f	separate entity encoder to get 64D descriptions	2019-06-05 00:09:46 +02:00
svlandeg	fb37cdb2d3	implementing el pipe in pipes.pyx (not tested yet)	2019-06-03 21:32:54 +02:00
svlandeg	d83a1e3052	Merge branch 'master' into feature/nel-wiki	2019-06-03 09:35:10 +02:00
Germán	86eb817b74	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 )) * (#3803) Spanish like_num returning false for number-like token * (#3803) Spanish like_num now returning True for number-like token	2019-06-02 12:22:57 +02:00
Ines Montani	09e78b52cf	Improve E024 text for incorrect GoldParse (closes #3558 )	2019-06-01 14:37:27 +02:00
Ramanan Balakrishnan	26c37c5a4d	fix all references to BILUO annotation format (#3797 )	2019-05-31 12:19:19 +02:00
Ines Montani	a7fd42d937	Make jsonschema dependency optional (#3784 )	2019-05-30 14:34:58 +02:00
Ujwal Narayan	ed7be3f64c	Update norm_exceptions.py (#3778 ) * Update norm_exceptions.py Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound * Fix formatting [ci skip]	2019-05-27 11:52:52 +02:00
estr4ng7d	604acb6ace	Marathi Language Support (#3767 ) * Adding Marathi language details and folder to it * Adding few changes and running tests * Adding few changes and running tests * Update __init__.py mh -> mr * Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py * mh -> mr	2019-05-24 14:29:42 +02:00
Ines Montani	7634812172	Document Language.evaluate	2019-05-24 14:06:36 +02:00
Ines Montani	45e6855550	Update Language.update docs	2019-05-24 14:06:26 +02:00
Ines Montani	b78a8dc1d2	Update Scorer and add API docs	2019-05-24 14:06:04 +02:00
Ujwal Narayan	4d550a3055	Enhancing Kannada language Resources (#3755 ) * Updated stop_words.py Added more stopwords * Create ujwal-narayan.md Enhancing Kannada language resources	2019-05-20 12:56:10 +02:00
svlandeg	dd691d0053	debugging	2019-05-17 17:44:11 +02:00
BreakBB	ed18a6efbd	Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741 )	2019-05-14 16:59:31 +02:00
Ines Montani	8baff1c7c0	💫 Improve introspection of custom extension attributes (#3729 ) * Add custom __dir__ to Underscore (see #3707) * Make sure custom extension methods keep their docstrings (see #3707) * Improve tests * Prepend note on partial to docstring (see #3707) * Remove print statement * Handle cases where docstring is None	2019-05-12 00:53:11 +02:00
Matthew Honnibal	3aceeeaaeb	Set version to v2.1.4	2019-05-11 22:57:53 +02:00
Ines Montani	aea1c93a05	Replace cytoolz.partition_all with util.minibatch	2019-05-11 21:12:09 +02:00
Ines Montani	0bf6441863	Fix .iob converter (closes #3620 )	2019-05-11 19:15:26 +02:00
Matthew Honnibal	a5159ddcf5	Set version to v2.1.4.dev1	2019-05-11 19:03:51 +02:00
Ines Montani	6b3a79ac96	Call rmtree and copytree with strings (closes #3713 )	2019-05-11 15:48:35 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
Luca Dorigo	82d034f976	Update glossary.py to match information found in documentation (#3704 ) (closes ##3679) * Update glossary.py to match information found in documentation I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍 * Adds forgotten colon	2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun	5a14a13f64	fix thai bug (#3693 ) fix tokenize for pythainlp	2019-05-10 14:21:34 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
F0rge1cE	dd1e6b0bc6	Fix offset bug in loading pre-trained word2vec. (#3689 ) * Fix offset bug in loading pre-trained word2vec. * add contributor agreement	2019-05-06 23:00:38 +02:00
Ines Montani	78cb807a9a	Auto-format [ci skip]	2019-05-06 16:58:29 +02:00
Brad Jascob	955b95cb8b	Fix inconsistant lemmatizer issue #3484 (#3646 ) * Fix inconsistant lemmatizer issue #3484 * Remove test case	2019-05-04 18:16:03 +02:00
svlandeg	1ae41daaa9	allow small rounding errors	2019-05-01 23:05:40 +02:00
Dobita21	f95ecedd83	Add Thai lex_attrs (#3655 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words * add lex_attrs * Add lexical attribute getters into the language defaults * fix LEX_ATTRS Co-authored-by: Donut <dobita21@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-05-01 12:03:14 +02:00
BreakBB	8952004dfc	Update French example sents and add two German stop words (#3662 ) * Update french example sentences * Add 'anderem' and 'ihren' to German stop words	2019-05-01 12:01:35 +02:00
svlandeg	60b54ae8ce	bulk entity writing and experiment with regex wikidata reader to speed up processing	2019-05-01 00:00:38 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	3e0cb69065	KB aliases to and from file	2019-04-24 20:24:24 +02:00
svlandeg	ad6c5e581c	writing and reading number of entries to/from header	2019-04-24 15:31:44 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	694fea597a	dumping all entryC entries + (inefficient) reading back in	2019-04-23 18:36:50 +02:00
svlandeg	8e70a564f1	custom reader and writer for _EntryC fields (first stab at it - not complete)	2019-04-23 16:33:40 +02:00
Dobita21	721e1fc86c	update norm_exceptions (#3627 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words	2019-04-23 12:48:03 +02:00
Ines Montani	e0f487f904	Rename early_stopping_iter to n_early_stopping	2019-04-22 14:31:25 +02:00
Ines Montani	9767427669	Auto-format	2019-04-22 14:31:11 +02:00
Ines Montani	7917ce2f73	Make flag shortcut consistent and document	2019-04-22 14:23:44 +02:00
Ines Montani	52658c80d5	Allow jupyter=False to override Jupyter mode (closes #3598 )	2019-04-22 14:18:32 +02:00
Motoki Wu	8e2cef49f3	Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-04-22 14:10:16 +02:00
Dobita21	189c90743c	Add Thai norm_exceptions (#3612 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults	2019-04-20 12:16:03 +02:00
svlandeg	10ee8dfea2	poc with few entities and collecting aliases from the WP links	2019-04-18 14:12:17 +02:00
Matthew Honnibal	83511972d3	Set version to v2.1.4.dev0	2019-04-16 14:17:26 +02:00
Matthew Honnibal	8b5ae0733e	Merge branch 'master' of https://github.com/explosion/spaCy	2019-04-16 12:29:46 +02:00

1 2 3 4 5 ...

5964 Commits