spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 12:36:46 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Ines Montani	3100c97e69	Merge pull request #5441 from svlandeg/fix/updating	2020-05-18 10:53:41 +02:00
Ines Montani	e8ff4c1e6a	Pin flake8 version	2020-05-18 10:50:21 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
Ilkyu Ju	72a25c9cef	Very minor issues in Korean example sentences (#5446 ) * Add contributor agreement * Improve ko translation of example sentences I fixed unnatural translations and word spacing errors. * Update osori.md	2020-05-17 13:43:34 +02:00
svlandeg	6fb6a8518c	bump to 3.0.0.dev7 and thinc to 8.0.0a8	2020-05-15 13:25:54 +02:00
svlandeg	047f3d7d94	remove ops argument for Adam	2020-05-15 13:25:00 +02:00
svlandeg	79d4f196e5	pin flak8 to 3.5.0	2020-05-15 11:53:01 +02:00
svlandeg	e0fda2bd81	throw warning when model_cfg is None	2020-05-15 11:02:10 +02:00
adrianeboyd	908dea3939	Skip duplicate lexeme rank setting (#5401 ) Skip duplicate lexeme rank setting within `_fix_pretrained_vectors_name()`.	2020-05-14 18:26:12 +02:00
adrianeboyd	f49e2810e6	Add Polish lemmatizer (#5413 ) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import	2020-05-14 18:23:19 +02:00
adrianeboyd	e63880e081	Use Token.sent_start for Span.sent (#5439 ) Use `Token.sent_start` for sentence boundaries in `Span.sent` so that `Doc.sents` and `Span.sent` return the same sentence boundaries.	2020-05-14 18:22:51 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	ee8fe37474	Add ilivans' contributor agreement	2020-05-14 15:59:06 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	07639dd6ac	Remove TAG from da/sv tokenizer exceptions (#5428 ) Remove `TAG` value from Danish and Swedish tokenizer exceptions because it may not be included in a tag map (and these settings are problematic as tokenizer exceptions anyway).	2020-05-13 10:25:54 +02:00
adrianeboyd	24e7108f80	Modify array type to accommodate OOV_RANK (#5429 ) Modify indices array type in `Vocab.prune_vectors` to accommodate OOV_RANK index as max(uint64).	2020-05-13 10:25:05 +02:00
svlandeg	102c8c7e2f	fix fan_in renaming	2020-05-12 13:56:10 +02:00
svlandeg	9fe1e23512	update to thinc 8.0.0a6	2020-05-12 13:51:25 +02:00
Ines Montani	f333c2a011	Merge pull request #5386 from svlandeg/fix/nel-docs	2020-05-10 12:00:09 +02:00
adrianeboyd	440b81bddc	Improve exceptions for 'd (would/had) in English (#5379 ) Instead of treating `'d` in contractions like `I'd` as `would` in all cases in the tokenizer exceptions, leave the tagging and lemmatization up to later components.	2020-05-08 15:10:57 +02:00
Travis Hoppe	d4cc18b746	Added author information for NLPre (#5414 ) * Add author links for NLPre and update category * Add contributor statement	2020-05-08 11:28:54 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
adrianeboyd	4a15b559ba	Clarify Token.pos as UPOS (#5419 )	2020-05-08 10:36:25 +02:00
adrianeboyd	a2345618f1	Fix Token API docs from #5375 (#5418 )	2020-05-08 10:25:02 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
Matthew Honnibal	eb117e2fce	Add load_config_from_str helper	2020-05-02 14:09:21 +02:00
adrianeboyd	c045a9c7f6	Fix logic in train CLI timing eval on GPU (#5387 ) Run CPU timing in first iteration only	2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
svlandeg	ebaed7dcfa	Few more updates to the EL documentation	2020-04-30 10:17:06 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
adrianeboyd	74da669326	Fix problems with lower and whitespace in variants (#5361 ) * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants * Return the text with original casing if anything goes wrong	2020-04-29 13:01:25 +02:00
adrianeboyd	3f43c73d37	Normalize TokenC.sent_start values for Matcher (#5346 ) Normalize TokenC.sent_start values to booleans for the `Matcher`.	2020-04-29 12:57:30 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Ines Montani	efec28ce70	Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2	2020-04-29 12:55:37 +02:00
Ines Montani	63885c1836	Remove u string and auto-format [ci skip]	2020-04-29 12:54:57 +02:00
Ines Montani	962bf12a20	Merge pull request #5312 from odaxiom/fix/website-documentation-spacy-lookup	2020-04-29 12:54:31 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
Ines Montani	a77754120d	Merge pull request #5177 from nlptechbook/patch-5	2020-04-29 12:52:21 +02:00
Ines Montani	eac47971f1	Merge pull request #5258 from mirfan899/master	2020-04-29 12:51:55 +02:00
Sofie Van Landeghem	1bf2082ac4	update is_new_osx function (#5376 )	2020-04-29 12:51:49 +02:00
Ines Montani	1cbb272a6b	Update website/meta/universe.json	2020-04-29 12:51:44 +02:00

... 4 5 6 7 8 ...

11768 Commits