spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
Ines Montani	c08b3f294c	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
svlandeg	d5ff0fecf8	add docs	2020-12-30 14:01:13 +01:00
svlandeg	c74ab6a313	fix imports	2020-12-30 12:40:12 +01:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Ines Montani	b85bd63eca	Fix test	2020-12-09 11:24:01 +11:00
Ines Montani	febf71af28	Fix test	2020-12-09 11:23:07 +11:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	e1b8090b9b	few more fixes	2020-09-22 12:01:06 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Matthew Honnibal	e8378b57bc	Fix test	2020-09-14 21:21:13 +02:00
Matthew Honnibal	54c40223a1	Improve v3 pretrain command (#6040 ) * Starts to run * Update pretrain script * Update corpus * Update pretrain schema * Remove outdated test * Make JsonlTexts produce Example objects.	2020-09-13 14:05:05 +02:00
svlandeg	115147804a	string_to_list to parse comma-separated string into a list	2020-09-12 14:43:22 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Ines Montani	2bc31e15c9	Tidy up and auto-format [ci skip]	2020-08-29 13:01:10 +02:00
svlandeg	9a8255ffd5	two tests because of different exit type	2020-08-28 10:50:26 +02:00
svlandeg	73baaf330a	update error type	2020-08-28 10:46:21 +02:00
Ines Montani	dd84577a98	Update CLI utils, project.yml schema and add test	2020-08-25 11:54:53 +02:00
Matthew Honnibal	e559867605	Allow spacy project to push and pull to/from remote storage (#5949 ) * Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2020-08-23 18:32:09 +02:00
Ines Montani	e2f2ef3a5a	Update init config and recommendations - As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations - Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date. - Simplify jinja_to_js script and use fewer dependencies	2020-08-19 13:33:15 +02:00
Ines Montani	a570c304df	Update quickstart, template and docs	2020-08-15 14:50:29 +02:00
Ines Montani	88b0a96801	Update for new Thinc and adjust config	2020-08-13 17:38:30 +02:00
Ines Montani	06a97a8766	Support --opt=value format in CLI config overrides	2020-07-28 13:43:15 +02:00
Ines Montani	5cfc3edcaa	Update CLI tests	2020-07-10 18:21:01 +02:00
Ines Montani	37c3bb35e2	Auto-format	2020-07-04 16:25:34 +02:00
Adriane Boyd	abad56db7d	Add conllu2docs converter (#5704 ) Add conllu2docs converter adapted from conllu2json converter	2020-07-03 12:54:32 +02:00
Matthew Honnibal	8c29268749	Improve spacy.gold (no GoldParse, no json format!) (#5555 ) * Update errors * Remove beam for now (maybe) Remove beam_utils Update setup.py Remove beam * Remove GoldParse WIP on removing goldparse Get ArcEager compiling after GoldParse excise Update setup.py Get spacy.syntax compiling after removing GoldParse Rename NewExample -> Example and clean up Clean html files Start updating tests Update Morphologizer * fix error numbers * fix merge conflict * informative error when calling to_array with wrong field * fix error catching * fixing language and scoring tests * start testing get_aligned * additional tests for new get_aligned function * Draft create_gold_state for arc_eager oracle * Fix import * Fix import * Remove TokenAnnotation code from nonproj * fixing NER one-to-many alignment * Fix many-to-one IOB codes * fix test for misaligned * attempt to fix cases with weird spaces * fix spaces * test_gold_biluo_different_tokenization works * allow None as BILUO annotation * fixed some tests + WIP roundtrip unit test * add spaces to json output format * minibatch utiltiy can deal with strings, docs or examples * fix augment (needs further testing) * various fixes in scripts - needs to be further tested * fix test_cli * cleanup * correct silly typo * add support for MORPH in to/from_array, fix morphologizer overfitting test * fix tagger * fix entity linker * ensure test keeps working with non-linked entities * pipe() takes docs, not examples * small bug fix * textcat bugfix * throw informative error when running the components with the wrong type of objects * fix parser tests to work with example (most still failing) * fix BiluoPushDown parsing entities * small fixes * bugfix tok2vec * fix renames and simple_ner labels * various small fixes * prevent writing dummy values like deps because that could interfer with sent_start values * fix the fix * implement split_sent with aligned SENT_START attribute * test for split sentences with various alignment issues, works * Return ArcEagerGoldParse from ArcEager * Update parser and NER gold stuff * Draft new GoldCorpus class * add links to to_dict * clean up * fix test checking for variants * Fix oracles * Start updating converters * Move converters under spacy.gold * Move things around * Fix naming * Fix name * Update converter to produce DocBin * Update converters * Allow DocBin to take list of Doc objects. * Make spacy convert output docbin * Fix import * Fix docbin * Fix compile in ArcEager * Fix import * Serialize all attrs by default * Update converter * Remove jsonl converter * Add json2docs converter * Draft Corpus class for DocBin * Work on train script * Update Corpus * Update DocBin * Allocate Doc before starting to add words * Make doc.from_array several times faster * Update train.py * Fix Corpus * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests * Skip tests that cause crashes * Skip test causing segfault * Remove GoldCorpus * Update imports * Update after removing GoldCorpus * Fix module name of corpus * Fix mimport * Work on parser oracle * Update arc_eager oracle * Restore ArcEager.get_cost function * Update transition system * Update test_arc_eager_oracle * Remove beam test * Update test * Unskip * Unskip tests * add links to to_dict * clean up * fix test checking for variants * Allow DocBin to take list of Doc objects. * Fix compile in ArcEager * Serialize all attrs by default Move converters under spacy.gold Move things around Fix naming Fix name Update converter to produce DocBin Update converters Make spacy convert output docbin Fix import Fix docbin Fix import Update converter Remove jsonl converter Add json2docs converter * Allocate Doc before starting to add words * Make doc.from_array several times faster * Start updating converters * Work on train script * Draft Corpus class for DocBin Update Corpus Fix Corpus * Update DocBin Add missing strings when serializing * Update train.py * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests Skip tests that cause crashes Skip test causing segfault * Remove GoldCorpus Update imports Update after removing GoldCorpus Fix module name of corpus Fix mimport * Work on parser oracle Update arc_eager oracle Restore ArcEager.get_cost function Update transition system * Update tests Remove beam test Update test Unskip Unskip tests * Add get_aligned_parse method in Example Fix Example.get_aligned_parse * Add kwargs to Corpus.dev_dataset to match train_dataset * Update nonproj * Use get_aligned_parse in ArcEager * Add another arc-eager oracle test * Remove Example.doc property Remove Example.doc Remove Example.doc Remove Example.doc Remove Example.doc * Update ArcEager oracle Fix Break oracle * Debugging * Fix Corpus * Fix eg.doc * Format * small fixes * limit arg for Corpus * fix test_roundtrip_docs_to_docbin * fix test_make_orth_variants * fix add_label test * Update tests * avoid writing temp dir in json2docs, fixing 4402 test * Update test * Add missing costs to NER oracle * Update test * Work on Example.get_aligned_ner method * Clean up debugging * Xfail tests * Remove prints * Remove print * Xfail some tests * Replace unseen labels for parser * Update test * Update test * Xfail test * Fix Corpus * fix imports * fix docs_to_json * various small fixes * cleanup * Support gold_preproc in Corpus * Support gold_preproc * Pass gold_preproc setting into corpus * Remove debugging * Fix gold_preproc * Fix json2docs converter * Fix convert command * Fix flake8 * Fix import * fix output_dir (converted to Path by typer) * fix var * bugfix: update states after creating golds to avoid out of bounds indexing * Improve efficiency of ArEager oracle * pull merge_sent into iob2docs to avoid Doc creation for each line * fix asserts * bugfix excl Span.end in iob2docs * Support max_length in Corpus * Fix arc_eager oracle * Filter out uannotated sentences in NER * Remove debugging in parser * Simplify NER alignment * Fix conversion of NER data * Fix NER init_gold_batch * Tweak efficiency of precomputable affine * Update onto-json default * Update gold test for NER * Fix parser test * Update test * Add NER data test * Fix convert for single file * Fix test * Hack scorer to avoid evaluating non-nered data * Fix handling of NER data in Example * Output unlabelled spans from O biluo tags in iob_utils * Fix unset variable * Return kept examples from init_gold_batch * Return examples from init_gold_batch * Dont return Example from init_gold_batch * Set spaces on gold doc after conversion * Add test * Fix spaces reading * Improve NER alignment * Improve handling of missing values in NER * Restore the 'cutting' in parser training * Add assertion * Print epochs * Restore random cuts in parser/ner training * Implement Doc.copy * Implement Example.copy * Copy examples at the start of Language.update * Don't unset example docs * Tweak parser model slightly * attempt to fix _guess_spaces * _add_entities_to_doc first, so that links don't get overwritten * fixing get_aligned_ner for one-to-many * fix indexing into x_text * small fix biluo_tags_from_offsets * Add onto-ner config * Simplify NER alignment * Fix NER scoring for partially annotated documents * fix indexing into x_text * fix test_cli failing tests by ignoring spans in doc.ents with empty label * Fix limit * Improve NER alignment * Fix count_train * Remove print statement * fix tests, we're not having nothing but None * fix clumsy fingers * Fix tests * Fix doc.ents * Remove empty docs in Corpus and improve limit * Update config Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-06-26 19:34:12 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
adrianeboyd	a365359b36	Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports	2020-01-29 17:44:25 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
adrianeboyd	eb9b1858c4	Add NER map option to convert CLI (#4763 ) Instead of a hard-coded NER tag simplification function that was only intended for NorNE, map NER tags in CoNLL-U converter using a dict provided as JSON as a command-line option. Map NER entity types or new tag or to "" for 'O', e.g.: ``` {"PER": "PERSON", "BAD": ""} => B-PER -> B-PERSON B-BAD -> O ```	2019-12-11 18:20:49 +01:00
adrianeboyd	9efd3ccbef	Update conllu2json MISC column handling (#4715 ) Update converter to handle various things in MISC column: * `SpaceAfter=No` and set raw text accordingly * plain NER tag * name=NER (for NorNE)	2019-11-26 16:10:08 +01:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Ryan Ford	00842d7f1b	Merging conversion scripts for conll formats (#3405 ) * merging conllu/conll and conllubio scripts * tabs to spaces * removing conllubio2json from converters/__init__.py * Move not-really-CLI tests to misc * Add converter test using no-ud data * Fix test I broke * removing include_biluo parameter * fixing read_conllx * remove include_biluo from convert.py	2019-03-15 18:14:46 +01:00

1 2

53 Commits