Commit Graph

14575 Commits

Author SHA1 Message Date
Ines Montani
fe87ccc8d1 Update languages.json [ci skip] 2019-09-14 16:23:50 +02:00
Ines Montani
5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani
bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
adrianeboyd
6942a6a69b Extend default punct for sentencizer (#4290)
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with #4288.

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes #4269.
2019-09-14 15:25:48 +02:00
adrianeboyd
bee7961927 Add Kannada, Tamil, and Telugu unicode blocks (#4288)
Add Kannada, Tamil, and Telugu unicode blocks to uncased character
classes so that period is recognized as a suffix during tokenization.

(I'm sure a few symbols in the code blocks should not be ALPHA, but this
is mainly relevant for suffix detection and seems to be an improvement
in practice.)
2019-09-14 14:23:06 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani
bcbb9f5119 Update README.md [ci skip] 2019-09-14 12:57:45 +02:00
Ines Montani
27106d6528 Merge branch 'master' into develop 2019-09-13 17:07:17 +02:00
Euan Dowers
a6830d60e8 Changes to wiki_entity_linker (#4235)
* Changes to wiki_entity_linker

* No more f-strings

* Make some requested changes

* Add back option to get descriptions from wd not wp

* Fix logs

* Address comments and clean evaluation

* Remove type hints

* Refactor evaluation, add back metrics by label

* Address comments

* Log training performance as well as dev
2019-09-13 17:03:57 +02:00
Sofie Van Landeghem
2ae5db580e dim bugfix when incl_prior is False (#4285) 2019-09-13 16:30:05 +02:00
Paul O'Leary McCann
29a9e636eb Fix half-width space handling in JA (#4284) (closes #4262)
Before this patch, half-width spaces between words were simply lost in
Japanese text. This wasn't immediately noticeable because much Japanese
text never uses spaces at all.
2019-09-13 16:28:12 +02:00
Ines Montani
3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani
228bbf506d Improve label properties on pipes 2019-09-12 18:02:44 +02:00
Ines Montani
03809b82b7 Support label schemes in model directory 2019-09-12 18:01:46 +02:00
Paul O'Leary McCann
7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Matthew Honnibal
7d782aa97b Add more docstrings for MorphAnalysis 2019-09-12 16:48:30 +02:00
Ines Montani
ff51fba96a Update lemmaitzer docs [ci skip] 2019-09-12 16:26:33 +02:00
Ines Montani
25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani
82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani
7e3ac2cd41 Merge branch 'master' into develop 2019-09-12 15:35:25 +02:00
Ines Montani
0760c41393 Change st_ctime to st_mtime 2019-09-12 15:35:01 +02:00
Ines Montani
38037d6816 Update landing [ci skip] 2019-09-12 15:33:39 +02:00
Ines Montani
a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani
b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani
05a2df6616 Remove not implemented file validation [ci skip] 2019-09-12 15:26:02 +02:00
Ines Montani
72274e83f2 Ensure accordion label is left-aligned [ci skip] 2019-09-12 15:24:17 +02:00
Ines Montani
c0a4cab178 Update "Adding languages" docs [ci skip] 2019-09-12 14:53:06 +02:00
Ines Montani
10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani
32404e613c Create directory if it doesn't exist 2019-09-12 14:00:01 +02:00
Ines Montani
aa4ff0baa1 Auto-format [ci skip] 2019-09-12 13:05:53 +02:00
Ines Montani
625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani
cb41a33d14 Update displaCy API docs [ci skip] 2019-09-12 12:59:20 +02:00
Ines Montani
e7c20ad1d2 Update colors entry points docs [ci skip] 2019-09-12 12:59:10 +02:00
Ines Montani
7b59a919e6 Update entry points docs [ci skip] 2019-09-12 12:52:06 +02:00
Ines Montani
655b434553 Merge branch 'master' into develop 2019-09-12 11:39:18 +02:00
Sofie Van Landeghem
0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Ines Montani
4d4b3b0783 Add "labels" to Language.meta 2019-09-12 11:34:25 +02:00
Ines Montani
ac0e27a825
💫 Add Language.pipe_labels (#4276)
* Add Language.pipe_labels

* Update spacy/language.py

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-12 10:56:28 +02:00
tamuhey
71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Ines Montani
8ebc3711dc Fix bug in Parser.labels and add test (#4275) 2019-09-11 18:29:35 +02:00
Matthew Honnibal
7fbb559045 Set version to v2.2.0.dev6 2019-09-11 18:07:20 +02:00
Matthew Honnibal
f7a096b462 Update morphology 2019-09-11 18:06:43 +02:00
Matthew Honnibal
f8ce9dde0f Set version to v2.2.0.dev5 2019-09-11 17:41:21 +02:00
Matthew Honnibal
c47c0269b1 Update morphology features 2019-09-11 15:16:53 +02:00
Ines Montani
af25323653 Tidy up and auto-format 2019-09-11 14:00:36 +02:00
Matthew Honnibal
af93997993 Fix conllu converter 2019-09-11 13:28:07 +02:00
Matthew Honnibal
178d010b25 Set version to 2.2.0.dev4 2019-09-11 12:28:37 +02:00
Ines Montani
e82a8d0d7a Merge branch 'master' into develop 2019-09-11 11:52:38 +02:00
Ines Montani
8f9f48b04c Add GreekLemmatizer.lookup (resolves #4272) 2019-09-11 11:44:40 +02:00