Commit Graph

6428 Commits

Author SHA1 Message Date
Ines Montani
2e5ab5b59c Make except more explicit 2019-09-18 19:57:08 +02:00
Ines Montani
1f648ecb76 Auto-format 2019-09-18 19:56:55 +02:00
Ines Montani
0f7fe5e7a7 Auto-format and fix typo and consistency 2019-09-18 19:18:30 +02:00
Matthew Honnibal
e53b86751f DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
Matthew Honnibal
fa9a283128 Fix name 2019-09-18 13:40:03 +02:00
Matthew Honnibal
88a23cf49a Fix name 2019-09-18 13:38:29 +02:00
Matthew Honnibal
3507943b15 Add docstring for DocPallet 2019-09-18 13:25:47 +02:00
Matthew Honnibal
1c8de6b2e5 Rename DocBox->DocPallet 2019-09-18 13:13:51 +02:00
Ines Montani
691e0088cf Remove duplicate tok2vec property (closes #4302) 2019-09-17 11:22:03 +02:00
Ines Montani
a84025d70b Remove --no-deps from default pip args on download
Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded
2019-09-16 23:32:41 +02:00
Matthew Honnibal
84c65f9455 Merge branch 'master' into develop 2019-09-16 22:12:20 +02:00
Matthew Honnibal
47055d5988 Fix type declarations in _merge method 2019-09-16 22:10:13 +02:00
Sofie Van Landeghem
03ac29f437 Ensure that doc.ents preserves kb_id annotations (#4294)
* bugfix: ensure doc.ents preserves kb_id annotations

* fix backward compatibility

* additional test
2019-09-16 15:18:37 +02:00
Ines Montani
139428c20f Set unique vector names in tests 2019-09-16 15:16:54 +02:00
Ines Montani
bf06d9d537 Allow passing vectors_name to Vocab 2019-09-16 15:16:41 +02:00
Ines Montani
cb6c68a573 Pass vectors name correctly in prune_vectors 2019-09-16 15:16:29 +02:00
Ines Montani
3ba5238282 Make "unnamed vectors" warning a real warning 2019-09-16 15:16:12 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani
88a9d87f6f Fix test 2019-09-15 18:04:44 +02:00
Ines Montani
23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00
Ines Montani
c7e4ea7154 Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
Ines Montani
aa3c59a2f3 Include Norwegian NER entity types in glossary [ci skip]
See https://github.com/ltgoslo/norne
2019-09-15 17:16:21 +02:00
Ines Montani
7194845234 Skip tests properly instead of xfailing them 2019-09-15 17:00:17 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
adrianeboyd
6942a6a69b Extend default punct for sentencizer (#4290)
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with #4288.

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes #4269.
2019-09-14 15:25:48 +02:00
adrianeboyd
bee7961927 Add Kannada, Tamil, and Telugu unicode blocks (#4288)
Add Kannada, Tamil, and Telugu unicode blocks to uncased character
classes so that period is recognized as a suffix during tokenization.

(I'm sure a few symbols in the code blocks should not be ALPHA, but this
is mainly relevant for suffix detection and seems to be an improvement
in practice.)
2019-09-14 14:23:06 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani
27106d6528 Merge branch 'master' into develop 2019-09-13 17:07:17 +02:00
Sofie Van Landeghem
2ae5db580e dim bugfix when incl_prior is False (#4285) 2019-09-13 16:30:05 +02:00
Paul O'Leary McCann
29a9e636eb Fix half-width space handling in JA (#4284) (closes #4262)
Before this patch, half-width spaces between words were simply lost in
Japanese text. This wasn't immediately noticeable because much Japanese
text never uses spaces at all.
2019-09-13 16:28:12 +02:00
Ines Montani
3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani
228bbf506d Improve label properties on pipes 2019-09-12 18:02:44 +02:00
Paul O'Leary McCann
7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Matthew Honnibal
7d782aa97b Add more docstrings for MorphAnalysis 2019-09-12 16:48:30 +02:00
Ines Montani
b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani
05a2df6616 Remove not implemented file validation [ci skip] 2019-09-12 15:26:02 +02:00
Ines Montani
10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani
32404e613c Create directory if it doesn't exist 2019-09-12 14:00:01 +02:00
Ines Montani
625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani
655b434553 Merge branch 'master' into develop 2019-09-12 11:39:18 +02:00
Sofie Van Landeghem
0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Ines Montani
4d4b3b0783 Add "labels" to Language.meta 2019-09-12 11:34:25 +02:00
Ines Montani
ac0e27a825
💫 Add Language.pipe_labels (#4276)
* Add Language.pipe_labels

* Update spacy/language.py

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-12 10:56:28 +02:00
tamuhey
71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Ines Montani
8ebc3711dc Fix bug in Parser.labels and add test (#4275) 2019-09-11 18:29:35 +02:00
Matthew Honnibal
7fbb559045 Set version to v2.2.0.dev6 2019-09-11 18:07:20 +02:00
Matthew Honnibal
f7a096b462 Update morphology 2019-09-11 18:06:43 +02:00
Matthew Honnibal
f8ce9dde0f Set version to v2.2.0.dev5 2019-09-11 17:41:21 +02:00
Matthew Honnibal
c47c0269b1 Update morphology features 2019-09-11 15:16:53 +02:00
Ines Montani
af25323653 Tidy up and auto-format 2019-09-11 14:00:36 +02:00
Matthew Honnibal
af93997993 Fix conllu converter 2019-09-11 13:28:07 +02:00
Matthew Honnibal
178d010b25 Set version to 2.2.0.dev4 2019-09-11 12:28:37 +02:00
Ines Montani
e82a8d0d7a Merge branch 'master' into develop 2019-09-11 11:52:38 +02:00
Ines Montani
8f9f48b04c Add GreekLemmatizer.lookup (resolves #4272) 2019-09-11 11:44:40 +02:00
Ines Montani
6279d74c65 Tidy up and auto-format 2019-09-11 11:38:22 +02:00
Matthew Honnibal
7b858ba606 Update from master 2019-09-10 20:14:08 +02:00
Ines Montani
669a7d37ce Exclude vocab when testing to_bytes 2019-09-10 19:45:16 +02:00
adrianeboyd
e367864e59 Update Ukrainian create_lemmatizer kwargs (#4266)
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
adrianeboyd
c32126359a Allow period as suffix following punctuation (#4248)
Addresses rare cases (such as `_MATH_.`, see #1061) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani
3e8f136ba7 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Fix serialization for lookups

* Fix lookups

* Fix lookups

* Fix lookups

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Give up on serialization test

* Xfail more serialization tests for 3.5

* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem
482c7cd1b9 pulling tqdm imports in functions to avoid bug (tmp fix) (#4263) 2019-09-09 16:32:11 +02:00
Mihai Gliga
25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal
1653b818c5 Update Lithuanian tag map 2019-09-08 20:57:58 +02:00
adrianeboyd
3780e2ff50 Flush tokenizer cache when necessary (#4258)
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
2019-09-08 20:52:46 +02:00
Matthew Honnibal
da8830d909 Set version to v2.2.0.dev3 2019-09-08 18:22:03 +02:00
Matthew Honnibal
1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Matthew Honnibal
aec6174ae6 Fix lemmatizer 2019-09-08 18:09:53 +02:00
Matthew Honnibal
fde4f8ac8e Create lookups if not passed in 2019-09-08 18:08:09 +02:00
Pavle Vidanović
d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
adrianeboyd
aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Bae Yong-Ju
a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Adriane Boyd
0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Adriane Boyd
c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Adriane Boyd
6b0fec76fd Fix handling of preset entities in NER
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani
419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
Ines Montani
cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
Matthew Honnibal
67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Adriane Boyd
893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317 English tag map without unsupported features/values 2019-08-30 11:29:19 +02:00
Matthew Honnibal
516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal
f3c3ce7f1e Update vocab 2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9 Fix noise addition 2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4 Disable whitespace corruption 2019-08-29 15:01:58 +02:00
Matthew Honnibal
3c1c0ec18e Add tests for NER oracle with whitespace 2019-08-29 14:33:39 +02:00
Matthew Honnibal
6511e1d8d3 Fix NER gold-standard around whitespace 2019-08-29 14:33:07 +02:00
adrianeboyd
82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
adrianeboyd
5feb342f5e Add more token attributes to token pattern schema (#4210)
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Adriane Boyd
f3906950d3 Add separate noise vs orth level to train CLI 2019-08-29 09:10:35 +02:00
Matthew Honnibal
7d6d438566 Set version to v2.2.0.dev2 2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859 Fix 'noise_level' in train cmd 2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
svlandeg
c54aabc3cd fix loading custom tokenizer rules/exceptions from file 2019-08-28 14:17:44 +02:00
svlandeg
7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00
Adriane Boyd
0a26e94d02 Modify raw to match orth variant annotation tuples
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Adriane Boyd
47af3f676e Single and paired orth variants for German 2019-08-28 09:19:18 +02:00