spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-08 05:15:04 +03:00

Author	SHA1	Message	Date
Ines Montani	05a2df6616	Remove not implemented file validation [ci skip]	2019-09-12 15:26:02 +02:00
Ines Montani	72274e83f2	Ensure accordion label is left-aligned [ci skip]	2019-09-12 15:24:17 +02:00
Ines Montani	c0a4cab178	Update "Adding languages" docs [ci skip]	2019-09-12 14:53:06 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	32404e613c	Create directory if it doesn't exist	2019-09-12 14:00:01 +02:00
Ines Montani	aa4ff0baa1	Auto-format [ci skip]	2019-09-12 13:05:53 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	cb41a33d14	Update displaCy API docs [ci skip]	2019-09-12 12:59:20 +02:00
Ines Montani	e7c20ad1d2	Update colors entry points docs [ci skip]	2019-09-12 12:59:10 +02:00
Ines Montani	7b59a919e6	Update entry points docs [ci skip]	2019-09-12 12:52:06 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Ines Montani	4d4b3b0783	Add "labels" to Language.meta	2019-09-12 11:34:25 +02:00
Ines Montani	ac0e27a825	💫 Add Language.pipe_labels (#4276 ) * Add Language.pipe_labels * Update spacy/language.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-12 10:56:28 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Matthew Honnibal	7fbb559045	Set version to v2.2.0.dev6	2019-09-11 18:07:20 +02:00
Matthew Honnibal	f7a096b462	Update morphology	2019-09-11 18:06:43 +02:00
Matthew Honnibal	f8ce9dde0f	Set version to v2.2.0.dev5	2019-09-11 17:41:21 +02:00
Adriane Boyd	b097b0b83d	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-11 15:23:03 +02:00
Matthew Honnibal	c47c0269b1	Update morphology features	2019-09-11 15:16:53 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Matthew Honnibal	af93997993	Fix conllu converter	2019-09-11 13:28:07 +02:00
Matthew Honnibal	178d010b25	Set version to 2.2.0.dev4	2019-09-11 12:28:37 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	8f9f48b04c	Add GreekLemmatizer.lookup (resolves #4272 )	2019-09-11 11:44:40 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Adriane Boyd	104cb93d8b	Remove reinitialized PreshMaps on cache flush	2019-09-10 23:15:14 +02:00
Adriane Boyd	cf7047bbdf	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-10 22:30:41 +02:00
Matthew Honnibal	7b858ba606	Update from master	2019-09-10 20:14:08 +02:00
Matthew Honnibal	c181a94e75	Require thinc 7.1.1	2019-09-10 20:12:24 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
Matthew Honnibal	28741ff5db	Require preshed v3.0.0	2019-09-10 19:13:07 +02:00
adrianeboyd	e367864e59	Update Ukrainian create_lemmatizer kwargs (#4266 ) Allow Ukrainian create_lemmatizer to accept lookups kwarg.	2019-09-10 11:14:46 +02:00
Adriane Boyd	d277b6bc68	Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21	2019-09-10 09:55:28 +02:00
Adriane Boyd	ae52c5eb52	Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case	2019-09-10 09:48:34 +02:00
Adriane Boyd	11ba042aca	Update error code number	2019-09-10 09:09:46 +02:00
Adriane Boyd	cfc318b76c	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-10 09:07:44 +02:00
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Sofie Van Landeghem	482c7cd1b9	pulling tqdm imports in functions to avoid bug (tmp fix) (#4263 )	2019-09-09 16:32:11 +02:00
Mihai Gliga	25aecd504f	adding Romanian tag_map (#4257 ) * adding Romanian tag_map * added SCA file * forgotten import	2019-09-09 11:53:09 +02:00
Adriane Boyd	5eeaffe14f	Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization.	2019-09-08 22:40:08 +02:00
Adriane Boyd	64f86b7e97	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-08 21:30:01 +02:00
Adriane Boyd	d1679819ab	Really remove accidentally added test	2019-09-08 20:58:22 +02:00
Matthew Honnibal	1653b818c5	Update Lithuanian tag map	2019-09-08 20:57:58 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
Adriane Boyd	e4cba2f1ee	Remove accidentally added test case	2019-09-08 20:48:05 +02:00
Adriane Boyd	5861308910	Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher.	2019-09-08 20:35:16 +02:00
Matthew Honnibal	da8830d909	Set version to v2.2.0.dev3	2019-09-08 18:22:03 +02:00

1 2 3 4 5 ...

10791 Commits