spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-10 17:26:42 +03:00

Author	SHA1	Message	Date
Knut O. Hellan	a54f0cfc2b	Norwegian tweaks (#3894 ) * Norwegian fix Add support for alternative past tense verb form (vaska). * Norwegian months Add all Norwegian months to tokenizer excpetions. * More Norwegian abbreviations Add more Norwegian abbreviations to tokenizer_exceptions. * Contributor agreement khellan Add signed contributor agreement for khellan (Knut O. Hellan).	2019-07-08 10:28:47 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
svlandeg	0ea52c86b8	remove redundancy	2019-07-03 15:02:10 +02:00
svlandeg	668b17ea4a	deuglify kb deserializer	2019-07-03 15:00:42 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
Ines Montani	4f1dae1c6b	Update languages and examples (see #1107 )	2019-06-26 16:19:17 +02:00
svlandeg	dbc53b9870	rename to KBEntryC	2019-06-26 15:55:26 +02:00
Ines Montani	37f744ca00	Auto-format [ci skip]	2019-06-26 14:48:09 +02:00
Ines Montani	6ccdf37574	Exclude user_data when copying doc in displaCy (closes #3882 )	2019-06-26 14:37:05 +02:00
svlandeg	1de61f68d6	improve speed of prediction loop	2019-06-26 13:53:10 +02:00
svlandeg	bee23cd8af	try Tok2Vec instead of SpacyVectors	2019-06-25 16:09:22 +02:00
svlandeg	8608685543	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
svlandeg	58a5b40ef6	clean up duplicate code	2019-06-24 15:19:58 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
svlandeg	f4af47ce4a	Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki	2019-06-24 10:57:07 +02:00
svlandeg	b58bace84b	small fixes	2019-06-24 10:55:04 +02:00
Ines Montani	c833d9b314	Add "v.s." to English tokenizer exceptions (see #3868 )	2019-06-20 17:48:45 +02:00
Ines Montani	ae2c208735	Auto-format [ci skip]	2019-06-20 10:36:38 +02:00
Ines Montani	872121955c	Update error code	2019-06-20 10:35:51 +02:00
Ines Montani	e1be80e3ec	Merge branch 'master' into pr/3864	2019-06-20 10:35:37 +02:00
Björn Böing	ebf5a04d6c	Update pretrain docs and add unsupported loss_func error (#3860 ) * Add error to `get_vectors_loss` for unsupported loss function of `pretrain` * Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs. * Add missing quotation marks	2019-06-20 10:30:44 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	cc9ae28a52	custom error and warning messages	2019-06-19 12:35:26 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
svlandeg	a31648d28b	further code cleanup	2019-06-19 09:15:43 +02:00
svlandeg	478305cd3f	small tweaks and documentation	2019-06-18 18:38:09 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	ffae7d3555	sentence encoder only (removing article/mention encoder)	2019-06-18 00:05:47 +02:00
Kabir Khan	1e19f34e29	Add optional `id` property to EntityRuler patterns (#3591 ) * Adding support for entity_id in EntityRuler pipeline component * Adding Spacy Contributor aggreement * Updating EntityRuler to use string.format instead of f strings * Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity. * Fixing tests * Remove custom extension entity_id and use built in ent_id token attribute. * Changing entity_id to ent_id for consistent naming * entity_ids => ent_ids * Removing kb, cleaning up tests, making util functions private, use rsplit instead of split	2019-06-16 13:29:04 +02:00
Suraj Rajan	46c78d0a41	Dependency tree pattern matcher (#3465 ) * Functional dependency tree pattern matcher * Tests fail due to inconsistent behaviour * Renamed dependencymatcher and added optimizations	2019-06-16 13:25:32 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
svlandeg	b312f2d0e7	redo training data to be independent of KB and entity-level instead of doc-level	2019-06-14 15:55:26 +02:00
Azagh3l	5accfbb938	Update exemples.py (#3838 ) Added missing hyphen and accent.	2019-06-14 09:31:05 +02:00
svlandeg	78dd3e11da	write entity linking pipe to file and keep vocab consistent between kb and nlp	2019-06-13 16:25:39 +02:00
svlandeg	b12001f368	small fixes	2019-06-12 22:05:53 +02:00
Ines Montani	f35ce09776	Add regression test for #3839	2019-06-12 13:38:30 +02:00
Ines Montani	aae9034492	Tidy up [ci skip]	2019-06-12 13:38:23 +02:00
svlandeg	6521cfa132	speeding up training	2019-06-12 13:37:05 +02:00
Motoki Wu	9c064e6ad9	Add resume logic to spacy pretrain (#3652 ) * Added ability to resume training * Add to readmee * Remove duplicate entry	2019-06-12 13:29:23 +02:00
svlandeg	fe1ed432ef	eval on dev set, varying combo's of prior and context scores	2019-06-11 11:40:58 +02:00
Azagh3l	eb3e4263ee	Update lex_attrs.py (#3835 ) Corrected typos, added french (from France) versions of some numbers.	2019-06-11 10:59:16 +02:00
svlandeg	83dc7b46fd	first tests with EL pipe	2019-06-10 21:25:26 +02:00
Matthew Honnibal	7f71cf0b02	Merge branch 'master' of https://github.com/explosion/spaCy	2019-06-07 20:41:00 +02:00
Matthew Honnibal	a931d72459	Add merge_subtokens as parser post-process. Re #3830	2019-06-07 20:40:41 +02:00
svlandeg	7de1ee69b8	training loop in proper pipe format	2019-06-07 15:55:10 +02:00
svlandeg	0486ccabfd	introduce goldparse.links	2019-06-07 13:54:45 +02:00
svlandeg	a5c061f506	storing NEL training data in GoldParse objects	2019-06-07 12:58:42 +02:00
svlandeg	61f0e2af65	code cleanup	2019-06-06 20:22:14 +02:00
svlandeg	d8b435ceff	pretraining description vectors and storing them in the KB	2019-06-06 19:51:27 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	9abbd0899f	separate entity encoder to get 64D descriptions	2019-06-05 00:09:46 +02:00
svlandeg	fb37cdb2d3	implementing el pipe in pipes.pyx (not tested yet)	2019-06-03 21:32:54 +02:00
intrafind	2bba2a3536	Fix for #3811 (#3815 ) Corrected type of seed parameter.	2019-06-03 18:32:47 +02:00
svlandeg	d83a1e3052	Merge branch 'master' into feature/nel-wiki	2019-06-03 09:35:10 +02:00
Germán	86eb817b74	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 )) * (#3803) Spanish like_num returning false for number-like token * (#3803) Spanish like_num now returning True for number-like token	2019-06-02 12:22:57 +02:00
Ines Montani	09e78b52cf	Improve E024 text for incorrect GoldParse (closes #3558 )	2019-06-01 14:37:27 +02:00
Ramanan Balakrishnan	26c37c5a4d	fix all references to BILUO annotation format (#3797 )	2019-05-31 12:19:19 +02:00
Ines Montani	a7fd42d937	Make jsonschema dependency optional (#3784 )	2019-05-30 14:34:58 +02:00
Ujwal Narayan	ed7be3f64c	Update norm_exceptions.py (#3778 ) * Update norm_exceptions.py Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound * Fix formatting [ci skip]	2019-05-27 11:52:52 +02:00
estr4ng7d	604acb6ace	Marathi Language Support (#3767 ) * Adding Marathi language details and folder to it * Adding few changes and running tests * Adding few changes and running tests * Update __init__.py mh -> mr * Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py * mh -> mr	2019-05-24 14:29:42 +02:00
Ines Montani	7634812172	Document Language.evaluate	2019-05-24 14:06:36 +02:00
Ines Montani	45e6855550	Update Language.update docs	2019-05-24 14:06:26 +02:00
Ines Montani	b78a8dc1d2	Update Scorer and add API docs	2019-05-24 14:06:04 +02:00
Ujwal Narayan	4d550a3055	Enhancing Kannada language Resources (#3755 ) * Updated stop_words.py Added more stopwords * Create ujwal-narayan.md Enhancing Kannada language resources	2019-05-20 12:56:10 +02:00
svlandeg	dd691d0053	debugging	2019-05-17 17:44:11 +02:00
BreakBB	ed18a6efbd	Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741 )	2019-05-14 16:59:31 +02:00
Ines Montani	8baff1c7c0	💫 Improve introspection of custom extension attributes (#3729 ) * Add custom __dir__ to Underscore (see #3707) * Make sure custom extension methods keep their docstrings (see #3707) * Improve tests * Prepend note on partial to docstring (see #3707) * Remove print statement * Handle cases where docstring is None	2019-05-12 00:53:11 +02:00
Matthew Honnibal	3aceeeaaeb	Set version to v2.1.4	2019-05-11 22:57:53 +02:00
Ines Montani	aea1c93a05	Replace cytoolz.partition_all with util.minibatch	2019-05-11 21:12:09 +02:00
Ines Montani	0bf6441863	Fix .iob converter (closes #3620 )	2019-05-11 19:15:26 +02:00
Matthew Honnibal	a5159ddcf5	Set version to v2.1.4.dev1	2019-05-11 19:03:51 +02:00
Ines Montani	6b3a79ac96	Call rmtree and copytree with strings (closes #3713 )	2019-05-11 15:48:35 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
Luca Dorigo	82d034f976	Update glossary.py to match information found in documentation (#3704 ) (closes ##3679) * Update glossary.py to match information found in documentation I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍 * Adds forgotten colon	2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun	5a14a13f64	fix thai bug (#3693 ) fix tokenize for pythainlp	2019-05-10 14:21:34 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
F0rge1cE	dd1e6b0bc6	Fix offset bug in loading pre-trained word2vec. (#3689 ) * Fix offset bug in loading pre-trained word2vec. * add contributor agreement	2019-05-06 23:00:38 +02:00
Ines Montani	78cb807a9a	Auto-format [ci skip]	2019-05-06 16:58:29 +02:00
Brad Jascob	955b95cb8b	Fix inconsistant lemmatizer issue #3484 (#3646 ) * Fix inconsistant lemmatizer issue #3484 * Remove test case	2019-05-04 18:16:03 +02:00
svlandeg	1ae41daaa9	allow small rounding errors	2019-05-01 23:05:40 +02:00
Dobita21	f95ecedd83	Add Thai lex_attrs (#3655 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words * add lex_attrs * Add lexical attribute getters into the language defaults * fix LEX_ATTRS Co-authored-by: Donut <dobita21@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-05-01 12:03:14 +02:00
BreakBB	8952004dfc	Update French example sents and add two German stop words (#3662 ) * Update french example sentences * Add 'anderem' and 'ihren' to German stop words	2019-05-01 12:01:35 +02:00
svlandeg	60b54ae8ce	bulk entity writing and experiment with regex wikidata reader to speed up processing	2019-05-01 00:00:38 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	3e0cb69065	KB aliases to and from file	2019-04-24 20:24:24 +02:00
svlandeg	ad6c5e581c	writing and reading number of entries to/from header	2019-04-24 15:31:44 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	694fea597a	dumping all entryC entries + (inefficient) reading back in	2019-04-23 18:36:50 +02:00
svlandeg	8e70a564f1	custom reader and writer for _EntryC fields (first stab at it - not complete)	2019-04-23 16:33:40 +02:00
Dobita21	721e1fc86c	update norm_exceptions (#3627 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words	2019-04-23 12:48:03 +02:00
Ines Montani	e0f487f904	Rename early_stopping_iter to n_early_stopping	2019-04-22 14:31:25 +02:00
Ines Montani	9767427669	Auto-format	2019-04-22 14:31:11 +02:00
Ines Montani	7917ce2f73	Make flag shortcut consistent and document	2019-04-22 14:23:44 +02:00

1 2 3 4 5 ...

6070 Commits