1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-01-18 05:24:12 +03:00
Commit Graph

10540 Commits

Author SHA1 Message Date
Adriane Boyd
e7e7c942c7 Merge branch 'feature/ud-script-update' into bugfix/tokenizer-special-cases-matcher 2019-09-16 14:24:33 +02:00
Adriane Boyd
33946d2ef8 Use special Matcher only for cases with affixes
* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes
2019-09-16 14:16:30 +02:00
Adriane Boyd
6585418804 Update UD bin scripts
* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)
2019-09-15 20:42:53 +02:00
Ines Montani
04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani
76d26a3d5e Update site.json [ci skip] 2019-09-14 16:32:24 +02:00
adrianeboyd
6942a6a69b Extend default punct for sentencizer ()
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with .

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes .
2019-09-14 15:25:48 +02:00
adrianeboyd
bee7961927 Add Kannada, Tamil, and Telugu unicode blocks ()
Add Kannada, Tamil, and Telugu unicode blocks to uncased character
classes so that period is recognized as a suffix during tokenization.

(I'm sure a few symbols in the code blocks should not be ALPHA, but this
is mainly relevant for suffix detection and seems to be an improvement
in practice.)
2019-09-14 14:23:06 +02:00
Euan Dowers
a6830d60e8 Changes to wiki_entity_linker ()
* Changes to wiki_entity_linker

* No more f-strings

* Make some requested changes

* Add back option to get descriptions from wd not wp

* Fix logs

* Address comments and clean evaluation

* Remove type hints

* Refactor evaluation, add back metrics by label

* Address comments

* Log training performance as well as dev
2019-09-13 17:03:57 +02:00
Sofie Van Landeghem
2ae5db580e dim bugfix when incl_prior is False () 2019-09-13 16:30:05 +02:00
Ines Montani
228bbf506d Improve label properties on pipes 2019-09-12 18:02:44 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc ()
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Ines Montani
0760c41393 Change st_ctime to st_mtime 2019-09-12 15:35:01 +02:00
Ines Montani
4d4b3b0783 Add "labels" to Language.meta 2019-09-12 11:34:25 +02:00
Ines Montani
ac0e27a825
💫 Add Language.pipe_labels ()
* Add Language.pipe_labels

* Update spacy/language.py

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-12 10:56:28 +02:00
tamuhey
71909cdf22 Fix iss4278 ()
* fix: len(tuple) == 2

* () add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Ines Montani
8ebc3711dc Fix bug in Parser.labels and add test () 2019-09-11 18:29:35 +02:00
Adriane Boyd
b097b0b83d Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher 2019-09-11 15:23:03 +02:00
Matthew Honnibal
af93997993 Fix conllu converter 2019-09-11 13:28:07 +02:00
Ines Montani
8f9f48b04c Add GreekLemmatizer.lookup (resolves ) 2019-09-11 11:44:40 +02:00
Ines Montani
6279d74c65 Tidy up and auto-format 2019-09-11 11:38:22 +02:00
Adriane Boyd
104cb93d8b Remove reinitialized PreshMaps on cache flush 2019-09-10 23:15:14 +02:00
Adriane Boyd
cf7047bbdf Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher 2019-09-10 22:30:41 +02:00
Ines Montani
669a7d37ce Exclude vocab when testing to_bytes 2019-09-10 19:45:16 +02:00
adrianeboyd
e367864e59 Update Ukrainian create_lemmatizer kwargs ()
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
Adriane Boyd
d277b6bc68 Improve cache flushing in tokenizer
* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21
2019-09-10 09:55:28 +02:00
Adriane Boyd
ae52c5eb52 Fix offset and whitespace in Matcher special cases
* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case
2019-09-10 09:48:34 +02:00
Adriane Boyd
11ba042aca Update error code number 2019-09-10 09:09:46 +02:00
Adriane Boyd
cfc318b76c Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher 2019-09-10 09:07:44 +02:00
adrianeboyd
c32126359a Allow period as suffix following punctuation ()
Addresses rare cases (such as `_MATH_.`, see ) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani
3e8f136ba7 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data ()
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Fix serialization for lookups

* Fix lookups

* Fix lookups

* Fix lookups

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Give up on serialization test

* Xfail more serialization tests for 3.5

* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem
482c7cd1b9 pulling tqdm imports in functions to avoid bug (tmp fix) () 2019-09-09 16:32:11 +02:00
Mihai Gliga
25aecd504f adding Romanian tag_map ()
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Adriane Boyd
5eeaffe14f Reload special cases when necessary
Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.
2019-09-08 22:40:08 +02:00
Adriane Boyd
64f86b7e97 Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher 2019-09-08 21:30:01 +02:00
Adriane Boyd
d1679819ab Really remove accidentally added test 2019-09-08 20:58:22 +02:00
adrianeboyd
3780e2ff50 Flush tokenizer cache when necessary ()
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes , same issue as in .
2019-09-08 20:52:46 +02:00
Adriane Boyd
e4cba2f1ee Remove accidentally added test case 2019-09-08 20:48:05 +02:00
Adriane Boyd
5861308910 Generalize handling of tokenizer special cases
Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with  (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.
2019-09-08 20:35:16 +02:00
Pavle Vidanović
d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… ()
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests () 2019-09-08 13:40:45 +02:00
adrianeboyd
aec755d3a3 Modify retokenizer to use span root attributes ()
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Sofie Van Landeghem
53a9ca45c9 Docs: bufsize instead of buffsize () 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem
6b012cebff Make pos/tag distinction more clear in docs ()
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Bae Yong-Ju
a55f5a744f Fix ValueError exception on empty Korean text. () 2019-09-06 10:29:40 +02:00
Ines Montani
232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Matthew Honnibal
b94c34ec8f
Merge pull request from adrianeboyd/bugfix/tokenizer-cache-test-1061
Add regression test for  back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd
0f28418446 Add regression test for back to test suite 2019-09-04 20:42:24 +02:00
Ines Montani
2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
Matthew Honnibal
17c039406b
Merge pull request from adrianeboyd/bugfix/entityruler-ner-4229
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00