Commit Graph

6676 Commits

Author SHA1 Message Date
adrianeboyd
923a453449
Modifications/updates to Portuguese tokenization (#5203)
Modifications to Portuguese tokenization for UD_Portuguese-Bosque.
Instead of splitting contactions as exceptions, they are kept as merged
tokens.
2020-03-25 11:27:53 +01:00
adrianeboyd
4117a5c705
Improve French tokenization (#5202)
Improve French tokenization for UD_French-Sequoia.
2020-03-25 11:27:42 +01:00
Ines Montani
a3d09ffe61
Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2
Improved tokenization for UD_Norwegian-Bokmaal
2020-03-25 11:27:31 +01:00
Adriane Boyd
09d442f5ad Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da 2020-03-25 09:41:52 +01:00
Adriane Boyd
cba2d1d972 Disable failing abbreviation test
UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
2020-03-25 09:39:26 +01:00
Adriane Boyd
79737adb90 Improved tokenization for UD_Norwegian-Bokmaal 2020-03-25 08:54:02 +01:00
Ines Montani
5f2afa0479
Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style
Improve German tokenizer settings style
2020-03-24 16:38:32 +01:00
Adriane Boyd
2897a73559 Improve German tokenizer settings style 2020-03-23 19:23:47 +01:00
Baciccin
3b53617a69 Add Ligurian language 2020-03-19 21:37:01 -07:00
Ines Montani
c68f20b398
Merge pull request #5146 from adrianeboyd/bugfix/assert-docs-equal-sents
Fix sents comparison in test util
2020-03-16 14:59:32 +01:00
Adriane Boyd
423849f94a Fix sents comparison in test util
Due to changes to `Span` (#5005), spans from different documents are now
never equal. Check `Token.is_sent_start` values instead.
2020-03-13 09:25:23 +01:00
Matthew Honnibal
26a90f011b Set version to v2.2.4 2020-03-12 11:30:41 +01:00
svlandeg
c4d030dbf6 remove accidental commit 2020-03-09 18:10:54 +01:00
svlandeg
1724a4f75b additional information if doc is empty 2020-03-09 18:08:18 +01:00
Ines Montani
1d6aec805d Fix formatting and update docs for v2.2.4 2020-03-09 11:17:20 +01:00
Mark Abraham
0345135167
Tokenizer to_disk and from_disk now ensure paths (#5116)
* Tokenizer to_disk and from_disk now ensure strings are converted to paths

Fixes #5115

* Sign contributor agreement
2020-03-08 13:25:56 +01:00
adrianeboyd
993758c58f
Remove unnecessary iterator in Language.pipe (#5101)
Remove iterator over `raw_texts` with `iterator.tee()` in
`Language.pipe` that is never consumed and consumes memory
unnecessarily.
2020-03-08 13:22:25 +01:00
Sofie Van Landeghem
1a2b8fc264
set vector of merged entity (#5085)
* merge_entities sets the vector in the vocab for the merged token

* add unit test

* import unicode_literals

* move code to _merge function

* only set vector if vocab has non-zero vectors
2020-03-06 14:45:28 +01:00
Muhammad Irfan
224a7f8e94 examples 2020-03-04 15:49:06 +05:00
Muhammad Irfan
03376c9d9b Basque language added and tested. 2020-03-04 11:58:56 +05:00
adrianeboyd
9be90dbca3
Improve token head verification (#5079)
* Improve token head verification

Improve the verification for valid token heads when heads are set:

* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document

* Improve error message
2020-03-03 21:44:51 +01:00
adrianeboyd
8c20dae6f7
Fix model-final/model-best meta from train CLI (#5093)
* Fix model-final/model-best meta

* include speed and accuracy from final iteration
* combine with speeds from base model if necessary

* Include token_acc metric for all components
2020-03-03 21:43:25 +01:00
Sofie Van Landeghem
a0998868ff
prevent updating cfg if the Model was already defined (#5078) 2020-03-03 13:58:56 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
adrianeboyd
d078b47c81
Break out of infinite loop as intended (#5077) 2020-03-03 12:29:05 +01:00
adrianeboyd
697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
adrianeboyd
2281c4708c
Restore empty tokenizer properties (#5026)
* Restore empty tokenizer properties

* Check for types in tokenizer.from_bytes()

* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem
c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
adrianeboyd
65d7bab10f
Initialize all values in a2b/b2a in new align (#5063) 2020-02-27 18:43:00 +01:00
Adriane Boyd
9f740a9891 Add a few more Danish tokenizer exceptions 2020-02-26 14:59:03 +01:00
Ines Montani
1c212215cd
Merge pull request #5064 from adrianeboyd/feature/german-tokenization
Improve German tokenization
2020-02-26 13:41:44 +01:00
Adriane Boyd
d1f703d78d Improve German tokenization
Improve German tokenization with respect to Tiger.
2020-02-26 13:06:52 +01:00
Ines Montani
ed9358420e Merge branch 'master' into pr/5060 2020-02-26 12:51:29 +01:00
adrianeboyd
ff184b7a9c
Add tag_map argument to CLI debug-data and train (#4750) (#5038)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2020-02-26 12:10:38 +01:00
svlandeg
18ff97589d update spacy to 2.2.4.dev0 2020-02-26 10:50:05 +01:00
Ines Montani
d50152b917
Merge pull request #5019 from questoph/master
Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)
2020-02-25 14:48:50 +01:00
Ines Montani
4440a072d2
Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore
load Underscore state when multiprocessing
2020-02-25 14:46:02 +01:00
svlandeg
b49a3afd0c use clean_underscore fixture 2020-02-23 15:49:20 +01:00
Tom Keefe
ddf63b97a8
make idx available via to_array (#5030) 2020-02-22 14:13:06 +01:00
Sofie Van Landeghem
44f4142ce4
add two abbreviations and some additional unit tests (#5040) 2020-02-22 14:12:32 +01:00
Sofie Van Landeghem
479bd8d09f
add lemma option to displacy 'dep' visualiser (#5041)
* add lemma option to displacy 'dep' visualiser

* more compact list comprehension

* add option to doc

* fix test and add lemmas to util.get_doc

* fix capital

* remove lemma from get_doc

* cleanup
2020-02-22 14:11:51 +01:00
adrianeboyd
2164e71ea8
Improved Romanian tokenization for UD RRT (#5036)
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-02-19 16:15:59 +01:00
Jan Jessewitsch
c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Sofie Van Landeghem
72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
adrianeboyd
3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
adrianeboyd
0c47a53b5e
Use int only in key2row for better performance (#4990)
Cast all keys and rows to `int` in `vectors.key2row` for more efficient
access and serialization.
2020-02-16 17:19:41 +01:00
adrianeboyd
5b102963bf
Require HEAD for is_parsed in Doc.from_array() (#5011)
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.

Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
2020-02-16 17:17:09 +01:00
Sofie Van Landeghem
2572460175
add tok2vec parameters to train script to facilitate init_tok2vec (#5021) 2020-02-16 17:16:41 +01:00
Sofie Van Landeghem
a27c77ce62
add message when cli train script throws exception (#5009)
* add message when cli train script throws exception

* fix formatting
2020-02-15 15:50:17 +01:00