Matthew Honnibal
c47c0269b1
Update morphology features
2019-09-11 15:16:53 +02:00
Ines Montani
af25323653
Tidy up and auto-format
2019-09-11 14:00:36 +02:00
Matthew Honnibal
af93997993
Fix conllu converter
2019-09-11 13:28:07 +02:00
Matthew Honnibal
178d010b25
Set version to 2.2.0.dev4
2019-09-11 12:28:37 +02:00
Ines Montani
e82a8d0d7a
Merge branch 'master' into develop
2019-09-11 11:52:38 +02:00
Ines Montani
8f9f48b04c
Add GreekLemmatizer.lookup ( resolves #4272 )
2019-09-11 11:44:40 +02:00
Ines Montani
6279d74c65
Tidy up and auto-format
2019-09-11 11:38:22 +02:00
Matthew Honnibal
7b858ba606
Update from master
2019-09-10 20:14:08 +02:00
Matthew Honnibal
c181a94e75
Require thinc 7.1.1
2019-09-10 20:12:24 +02:00
Ines Montani
669a7d37ce
Exclude vocab when testing to_bytes
2019-09-10 19:45:16 +02:00
Matthew Honnibal
28741ff5db
Require preshed v3.0.0
2019-09-10 19:13:07 +02:00
adrianeboyd
e367864e59
Update Ukrainian create_lemmatizer kwargs ( #4266 )
...
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
adrianeboyd
c32126359a
Allow period as suffix following punctuation ( #4248 )
...
Addresses rare cases (such as `_MATH_.`, see #1061 ) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani
3e8f136ba7
💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data ( #4178 )
...
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Fix serialization for lookups
* Fix lookups
* Fix lookups
* Fix lookups
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Give up on serialization test
* Xfail more serialization tests for 3.5
* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem
482c7cd1b9
pulling tqdm imports in functions to avoid bug (tmp fix) ( #4263 )
2019-09-09 16:32:11 +02:00
Mihai Gliga
25aecd504f
adding Romanian tag_map ( #4257 )
...
* adding Romanian tag_map
* added SCA file
* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal
1653b818c5
Update Lithuanian tag map
2019-09-08 20:57:58 +02:00
adrianeboyd
3780e2ff50
Flush tokenizer cache when necessary ( #4258 )
...
Flush tokenizer cache when affixes, token_match, or special cases are
modified.
Fixes #4238 , same issue as in #1250 .
2019-09-08 20:52:46 +02:00
Matthew Honnibal
da8830d909
Set version to v2.2.0.dev3
2019-09-08 18:22:03 +02:00
Matthew Honnibal
1a65c5b7af
Update develop from master
2019-09-08 18:21:41 +02:00
Matthew Honnibal
aec6174ae6
Fix lemmatizer
2019-09-08 18:09:53 +02:00
Matthew Honnibal
fde4f8ac8e
Create lookups if not passed in
2019-09-08 18:08:09 +02:00
Pavle Vidanović
d03401f532
Lemmatizer lookup dictionary for Serbian and basic tag set adde… ( #4251 )
...
* Serbian stopwords added. (cyrillic alphabet)
* spaCy Contribution agreement included.
* Test initialize updated
* Serbian language code update. --bugfix
* Tokenizer exceptions added. Init file updated.
* Norm exceptions and lexical attributes added.
* Examples added.
* Tests added.
* sr_lang examples update.
* Tokenizer exceptions updated. (Serbian)
* Lemmatizer created. Licence included.
* Test updated.
* Tag map basic added.
* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06
adds Croatian lemma_lookup.json, license file and corresponding tests ( #4252 )
2019-09-08 13:40:45 +02:00
adrianeboyd
aec755d3a3
Modify retokenizer to use span root attributes ( #4219 )
...
* Modify retokenizer to use span root attributes
* tag/pos/morph are set to root tag/pos/morph
* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)
* Also handle individual merge case
* Add test
* Attempt to handle ent_iob and ent_type in merges
* Fix check for whether B-ENT should become I-ENT
* Move IOB consistency check to after attrs
Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.
* Move IOB consistency check for single merge
Move IOB consistency check after the token array is compressed for the
single merge case.
* Update spacy/tokens/_retokenize.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Remove single vs. multiple merge distinction
Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.
* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Sofie Van Landeghem
53a9ca45c9
Docs: bufsize instead of buffsize ( #4247 )
2019-09-06 11:11:54 +02:00
Sofie Van Landeghem
6b012cebff
Make pos/tag distinction more clear in docs ( #4246 )
...
* make distinction between tag and pos more prominent in docs
* out of the 101
2019-09-06 10:31:21 +02:00
Bae Yong-Ju
a55f5a744f
Fix ValueError exception on empty Korean text. ( #4245 )
2019-09-06 10:29:40 +02:00
Ines Montani
232a029de6
Send referrer for internal links [ci skip]
2019-09-05 10:41:46 +02:00
Matthew Honnibal
d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
...
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Matthew Honnibal
b94c34ec8f
Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061
...
Add regression test for #1061 back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd
0f28418446
Add regression test for #1061 back to test suite
2019-09-04 20:42:24 +02:00
Adriane Boyd
c39c13f26b
Add guillemets/chevrons to German orth variants
...
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Ines Montani
2f31f96fce
Update languages.json [ci skip]
2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d
Update languages.json [ci skip]
2019-09-04 17:11:40 +02:00
Matthew Honnibal
17c039406b
Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229
...
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00
Adriane Boyd
6b0fec76fd
Fix handling of preset entities in NER
...
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani
419ae59c79
Make flaky test test_issue_1971_4 more explicit
2019-08-31 14:08:05 +02:00
Ines Montani
dad5621166
Tidy up and auto-format [ci skip]
2019-08-31 13:39:31 +02:00
Ines Montani
cd90752193
Tidy up and auto-format [ci skip]
2019-08-31 13:39:06 +02:00
Ines Montani
bcd1b12f43
Add contributor agreement [ci skip]
2019-08-30 17:02:43 +02:00
Matthew Honnibal
67c3d03905
Revert morphology serialisation
2019-08-30 13:13:07 +02:00
Matthew Honnibal
efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
...
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd
893f11a9e3
Serialize tag_map directly
...
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317
English tag map without unsupported features/values
2019-08-30 11:29:19 +02:00
Matthew Honnibal
516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
...
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal
f3c3ce7f1e
Update vocab
2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38
Add morphology serialization
2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9
Fix noise addition
2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4
Disable whitespace corruption
2019-08-29 15:01:58 +02:00