spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 04:08:09 +03:00

Author	SHA1	Message	Date
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Ines Montani	1278161f47	Tidy up and fix issues	2020-02-18 15:17:03 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Tyler Couto	9fa9d7f2cb	Fix for Issue 4665 - conllu2json (#4953 ) * Fix for Issue 4665 - conllu2json - Allowing HEAD to be an underscore * Added contributor agreement	2020-02-03 13:01:48 +01:00
adrianeboyd	a365359b36	Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports	2020-01-29 17:44:25 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
adrianeboyd	eb9b1858c4	Add NER map option to convert CLI (#4763 ) Instead of a hard-coded NER tag simplification function that was only intended for NorNE, map NER tags in CoNLL-U converter using a dict provided as JSON as a command-line option. Map NER entity types or new tag or to "" for 'O', e.g.: ``` {"PER": "PERSON", "BAD": ""} => B-PER -> B-PERSON B-BAD -> O ```	2019-12-11 18:20:49 +01:00
adrianeboyd	68f711b409	Fix conllu2json n_sents and raw text (#4728 ) Update conllu2json converter to include raw text in final batch.	2019-11-29 10:22:03 +01:00
adrianeboyd	9efd3ccbef	Update conllu2json MISC column handling (#4715 ) Update converter to handle various things in MISC column: * `SpaceAfter=No` and set raw text accordingly * plain NER tag * name=NER (for NorNE)	2019-11-26 16:10:08 +01:00
adrianeboyd	9aab0a55e1	Fix conllu2json converter to output all sentences (#4716 ) Make sure that the last batch of sentences is output if n_sents > 1.	2019-11-26 16:05:17 +01:00
adrianeboyd	392c4880d9	Restructure Example with merged sents as default (#4632 ) * Switch to train_dataset() function in train CLI * Fixes for pipe() methods in pipeline components * Don't clobber `examples` variable with `as_example` in pipe() methods * Remove unnecessary traversals of `examples` * Update Parser.pipe() for Examples * Add `as_examples` kwarg to `pipe()` with implementation to return `Example`s * Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from `Pipe`) * Fixes to Example implementation in spacy.gold * Move `make_projective` from an attribute of Example to an argument of `Example.get_gold_parses()` * Head of 0 are not treated as unset * Unset heads are set to self rather than `None` (which causes problems while projectivizing) * Check for `Doc` (not just not `None`) when creating GoldParses for pre-merged example * Don't clobber `examples` variable in `iter_gold_docs()` * Add/modify gold tests for handling projectivity * In JSON roundtrip compare results from `dev_dataset` rather than `train_dataset` to avoid projectivization (and other potential modifications) * Add test for projective train vs. nonprojective dev versions of the same `Doc` * Handle ignore_misaligned as arg rather than attr Move `ignore_misaligned` from an attribute of `Example` to an argument to `Example.get_gold_parses()`, which makes it parallel to `make_projective`. Add test with old and new align that checks whether `ignore_misaligned` errors are raised as expected (only for new align). * Remove unused attrs from gold.pxd Remove `ignore_misaligned` and `make_projective` from `gold.pxd` * Restructure Example with merged sents as default An `Example` now includes a single `TokenAnnotation` that includes all the information from one `Doc` (=JSON `paragraph`). If required, the individual sentences can be returned as a list of examples with `Example.split_sents()` with no raw text available. * Input/output a single `Example.token_annotation` * Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries * Replace `Example.merge_sents()` with `Example.split_sents()` * Modify components to use a single `Example.token_annotation` * Pipeline components * conllu2json converter * Rework/rename `add_token_annotation()` and `add_doc_annotation()` to `set_token_annotation()` and `set_doc_annotation()`, functions that set rather then appending/extending. * Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse` * Add getters to `TokenAnnotation` to supply default values when a given attribute is not available * `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only applied on single examples, so the `GoldParse` is returned saved in the provided `Example` rather than creating a new `Example` with no other internal annotation * Update tests for API changes and `merge_sents()` vs. `split_sents()` * Refer to Example.goldparse in iter_gold_docs() Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold` because a `None` `GoldParse` is generated with ignore_misaligned and generating it on-the-fly can raise an unwanted AlignmentError * Fix make_orth_variants() Fix bug in make_orth_variants() related to conversion from multiple to one TokenAnnotation per Example. * Add basic test for make_orth_variants() * Replace try/except with conditionals * Replace default morph value with set	2019-11-25 16:03:28 +01:00
adrianeboyd	bdfb696677	Fix conllu2json converter to output all sentences (#4656 ) Make sure that the last batch of sentences is output if n_sents > 1.	2019-11-15 17:08:32 +01:00
Sofie Van Landeghem	e48a09df4e	Example class for training data (#4543 ) * OrigAnnot class instead of gold.orig_annot list of zipped tuples * from_orig to replace from_annot_tuples * rename to RawAnnot * some unit tests for GoldParse creation and internal format * removing orig_annot and switching to lists instead of tuple * rewriting tuples to use RawAnnot (+ debug statements, WIP) * fix pop() changing the data * small fixes * pop-append fixes * return RawAnnot for existing GoldParse to have uniform interface * clean up imports * fix merge_sents * add unit test for 4402 with new structure (not working yet) * introduce DocAnnot * typo fixes * add unit test for merge_sents * rename from_orig to from_raw * fixing unit tests * fix nn parser * read_annots to produce text, doc_annot pairs * _make_golds fix * rename golds_to_gold_annots * small fixes * fix encoding * have golds_to_gold_annots use DocAnnot * missed a spot * merge_sents as function in DocAnnot * allow specifying only part of the token-level annotations * refactor with Example class + underlying dicts * pipeline components to work with Example objects (wip) * input checking * fix yielding * fix calls to update * small fixes * fix scorer unit test with new format * fix kwargs order * fixes for ud and conllu scripts * fix reading data for conllu script * add in proper errors (not fixed numbering yet to avoid merge conflicts) * fixing few more small bugs * fix EL script	2019-11-11 17:35:27 +01:00
Ines Montani	cc05d9dad6	Auto-format [ci skip]	2019-10-24 16:21:08 +02:00
adrianeboyd	8d3de90bc4	Suppress convert output if writing to stdout (#4472 )	2019-10-18 18:12:59 +02:00
Matthew Honnibal	fd4a5341b0	Fix ner_jsonl2json converter (fix #4389 ) (#4394 )	2019-10-08 00:52:45 +02:00
Matthew Honnibal	af93997993	Fix conllu converter	2019-09-11 13:28:07 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
Ines Montani	aea1c93a05	Replace cytoolz.partition_all with util.minibatch	2019-05-11 21:12:09 +02:00
Ines Montani	0bf6441863	Fix .iob converter (closes #3620 )	2019-05-11 19:15:26 +02:00
Matthew Honnibal	47e110375d	Fix jsonl to json conversion (#3419 ) * Fix spacy.gold.docs_to_json function * Fix jsonl2json converter	2019-03-17 22:12:54 +01:00
Ryan Ford	00842d7f1b	Merging conversion scripts for conll formats (#3405 ) * merging conllu/conll and conllubio scripts * tabs to spaces * removing conllubio2json from converters/__init__.py * Move not-really-CLI tests to misc * Add converter test using no-ud data * Fix test I broke * removing include_biluo parameter * fixing read_conllx * remove include_biluo from convert.py	2019-03-15 18:14:46 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Gavriel Loria	9a5003d5c8	iob converter: add 'exception' for error 'too many values' (#3159 ) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com>	2019-01-16 13:44:16 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Ines Montani	ffdd5e964f	Small CLI improvements (#3030 ) * Add todo * Auto-format * Update wasabi pin * Format training results with wasabi * Remove loading animation from model saving Currently behaves weirdly * Inline messages * Remove unnecessary path2str Already taken care of by printer * Inline messages in CLI * Remove unused function * Move loading indicator into loading function * Check for invalid whitespace entities	2018-12-08 11:49:43 +01:00
Matthew Honnibal	0a60726215	Remove cytoolz usage in CLI	2018-12-06 20:37:00 +01:00
Gavriel Loria	9c8c4287bf	Accept iob2 and allow generic whitespace (#2999 ) * accept non-pipe whitespace as delimiter; allow iob2 filename * added small documentation note for IOB2 allowance * added contributor agreement	2018-12-06 15:50:25 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	6ea981c839	Add converter for jsonl NER data	2018-08-14 14:04:32 +02:00
Kaisa (Katarzyna) Korsak	e531a827db	Changed conllu2json to be able to extract NER tags (#2594 ) * extract ner tags from conllu file if available * fixed a bug in regex	2018-07-25 22:21:31 +02:00
Ole Henrik Skogstrøm	6e2930a4a2	Conll(u)-bio converter (#2525 ) * Started simple conllxbiluo converter * Fix missing BIO to BILUO conversion	2018-07-18 18:55:42 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	eca41f0cf6	Fix filename conversion for conllu	2017-11-01 21:26:49 +01:00
Matthew Honnibal	e237472cdc	Fix tag and filename conversion for conllu	2017-11-01 21:25:33 +01:00
ines	d941fc3667	Tidy up CLI	2017-10-27 14:38:39 +02:00
Matthew Honnibal	a635240398	Add conll_ner2json converter	2017-10-09 22:03:26 -05:00
Matthew Honnibal	735d18654d	Add NER converter for CoNLL 2003 data	2017-10-09 20:06:28 -05:00
Matthew Honnibal	f942903429	Improve sentence merging in iob2json	2017-10-02 17:02:10 +02:00
Matthew Honnibal	31681d20e0	Fix concatenation in iob2json converter	2017-10-02 16:50:26 +02:00
György Orosz	62dbf9025c	Fixed conllu converter	2017-06-09 22:53:56 +02:00
Matthew Honnibal	5cf47b847b	Handle iob with no tag in converter	2017-05-28 08:11:39 -05:00
Matthew Honnibal	5a87bcf35f	Fix converters	2017-05-26 11:32:34 -05:00

1 2

60 Commits