1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-04-01 15:54:16 +03:00
Commit Graph

6575 Commits

Author SHA1 Message Date
adrianeboyd
2d8c6e1124 Iterate over lr_edges until sents are correct ()
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
2019-11-25 13:06:36 +01:00
Matt Maybeno
c9f1e99787 Agnostic vocab array fix ()
* Use get_array_module instead of numpy

* add contributor agreement
2019-11-23 14:59:52 +01:00
adrianeboyd
46250f60ac Add missing tags to el/es/pt tag maps ()
* Add missing tags to pt tag map

* Add missing tags to es tag map

* Add missing tags to el tag map

* Add missing symbol in el tag map
2019-11-23 14:57:21 +01:00
Paul O'Leary McCann
f0e3e606a6 Replace python-mecab3 with fugashi for Japanese ()
* Switch from mecab-python3 to fugashi

mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.

* Change mecab-python3 to fugashi in setup.cfg

* Change "mecab tags" to "unidic tags"

The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.

* Update conftest

* Add fugashi link to external deps list for Japanese
2019-11-23 14:31:04 +01:00
Ines Montani
a0fb1acb10 Update version [ci skip] 2019-11-21 18:19:37 +01:00
Ines Montani
b570d5d2ed Increment version [ci skip] 2019-11-21 17:02:32 +01:00
Matthew Honnibal
50f89cb85d Make vectors.find() return keys in correct order ()
* Make vectors.find() return keys in correct order

* Update spacy/vectors.pyx
2019-11-21 16:58:32 +01:00
Ines Montani
5d4eede1e4 Fix test util imports 2019-11-21 16:28:29 +01:00
GuiGel
8f7ab70870 Bugfix/fix entity ruler from disk ()
* fix EntityRuler from_disk bug

* add contributor file

* Test EntityRuler PhraseMatcher deserialization ()

* newline at end of file

* fix copy paste error

* serializing the EntityRuler by itself

* Add unicode declarations for Python 2 and auto-format
2019-11-21 16:26:37 +01:00
adrianeboyd
054df5d90a Add error for non-string labels ()
Add error when attempting to add non-string labels to `Tagger` or
`TextCategorizer`.
2019-11-21 16:24:10 +01:00
adrianeboyd
d7f32b285c Detect more empty matches in tokenizer.explain() ()
* Detect more empty matches in tokenizer.explain()

* Include a few languages in explain non-slow tests

Mark a few languages in tokenizer.explain() tests as not slow so they're
run by default.
2019-11-20 16:31:29 +01:00
Ines Montani
5bf9ab5b03 Tidy up and auto-format 2019-11-20 13:16:33 +01:00
Ines Montani
7f3b00164a Re-add slow marker 2019-11-20 13:15:59 +01:00
Ines Montani
6e303de717 Auto-format 2019-11-20 13:15:24 +01:00
Ines Montani
2e7c896fe5 Update Tokenizer.explain tests 2019-11-20 13:14:11 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method ()
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Matthew Honnibal
a3c43a1692
Support no hidden layer in parser and NER ()
* Support no hidden layers for parser

* Fix parser model for depth 1

* Fix parser for hidden depth=0

* Add option of non-blocking to CUDA stream
2019-11-19 15:54:34 +01:00
Matthew Honnibal
4b123952aa
Add option for improved NER feature extraction ()
* Support option of three NER features

* Expose nr_feature parser model setting

* Give feature tokens better name

* Test nr_feature=3 for NER

* Format
2019-11-19 15:03:14 +01:00
Elijah Rippeth
5ad5c4b44a Add initial Korean support ()
* add hangul and jamo char classes.

* add initial Korean lexical attributes.

* add contributor agreement
2019-11-18 12:56:07 +01:00
Ines Montani
74b951fe61
Fix xpassing tests ()
* Ignore internal warnings

* Un-xfail passing tests

* Skip instead of xfail
2019-11-16 20:20:53 +01:00
Ines Montani
3bd15055ce
Fix bug in Language.evaluate for components without .pipe () 2019-11-16 20:20:37 +01:00
adrianeboyd
bdfb696677 Fix conllu2json converter to output all sentences ()
Make sure that the last batch of sentences is output if n_sents > 1.
2019-11-15 17:08:32 +01:00
Ines Montani
d64cfce546 Remove unnecessary newline replace 2019-11-15 16:19:01 +01:00
Christoph Purschke
433748e867 Fix basic language support for Luxembourgish (by adding punctuation.py) ()
* Update __init__.py

* Create punctuation.py

* Update tokenizer_exceptions.py

* Create questoph.md

* Update questoph.md

* Update test_text.py

* Update test_text.py

* Update test_text.py

* Update test_text.py
2019-11-15 16:16:47 +01:00
adrianeboyd
91f89f9693 Fix realloc in retokenizer.split() ()
Always realloc to a size larger than `doc.max_length` in
`retokenizer.split()` (or cymem will throw errors).
2019-11-11 16:26:46 +01:00
adrianeboyd
0b9a5f4074 Rework Chinese language initialization and tokenization ()
* Rework Chinese language initialization

* Create a `ChineseTokenizer` class
  * Modify jieba post-processing to handle whitespace correctly
  * Modify non-jieba character tokenization to handle whitespace correctly

* Add a `create_tokenizer()` method to `ChineseDefaults`

* Load lexical attributes

* Update Chinese tag_map for UD v2

* Add very basic Chinese tests

* Test tokenization with and without jieba

* Test `like_num` attribute

* Fix try_jieba_import()

* Fix zh code formatting
2019-11-11 14:23:21 +01:00
adrianeboyd
4d85f67eee Minor updates to language example sentences ()
* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples
2019-11-07 22:34:58 +01:00
Priscilla de Abreu Lopes
39e79fcc86 Bugfix/dep matcher issue 4590 ()
* add contributor agreement for prilopes

* add test for issue 

* fix on_match params for DependencyMacther ()
2019-11-07 12:01:06 +01:00
Ines Montani
09cec3e41b
Replace function registries with catalogue ()
* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]
2019-11-07 11:45:22 +01:00
Matthew Honnibal
4e43c0ba93 Fix multiprocessing for as_tuples=True () 2019-11-04 20:29:03 +01:00
Ines Montani
cf4ec88b38 Use latest wasabi 2019-11-04 02:38:45 +01:00
Ines Montani
6ec119d976 Add error in debug-data if no dev docs are available (see ) 2019-11-02 16:08:11 +01:00
adrianeboyd
56ad3a3988 Add LAS per dependency to Scorer () 2019-10-31 21:18:16 +01:00
Matthew Honnibal
de98d66f87 Set version to v2.2.2 2019-10-31 15:53:31 +01:00
Matthw Honnibal
55f2241d72 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-31 15:37:52 +01:00
Ines Montani
df4c9ae3dc Fix formatting [ci skip] 2019-10-31 15:10:25 +01:00
Ines Montani
59358d9b71
Remove box-decoration-break from entities in displacy () 2019-10-31 15:09:43 +01:00
Matthw Honnibal
8b9954d1b7 Set version to v2.2.2.dev5 2019-10-31 15:06:19 +01:00
Ines Montani
2c107f02a4 Auto-format [ci skip] 2019-10-31 15:01:56 +01:00
Matthew Honnibal
e82306937e Put Tok2Vec refactor behind feature flag ()
* Add back pre-2.2.2 tok2vec

* Add simple tok2vec tests

* Add simple tok2vec tests

* Reformat

* Fix CharacterEmbed in new tok2vec

* Fix legacy tok2vec

* Resolve circular imports

* Fix test for Python 2
2019-10-31 15:01:15 +01:00
Ines Montani
5e9849b60f Auto-format [ci skip] 2019-10-30 19:27:18 +01:00
Ines Montani
afe4a428f7
Fix pipeline analysis on remove pipe ()
Validate *after* component is removed, not before
2019-10-30 19:04:17 +01:00
Matthew Honnibal
6b874ef096 Set version to v2.2.2.dev4 2019-10-30 17:36:20 +01:00
Ines Montani
85f2b04c45
Support span._. in component decorator attrs ()
* Support span._. in component decorator attrs

* Adjust error [ci skip]
2019-10-30 17:19:36 +01:00
Matthew Honnibal
c2f5f9f572 Set version to v2.2.2.dev3 2019-10-29 16:37:58 +01:00
Sofie Van Landeghem
33ba9ff464 set encodings explicitly to utf8 () 2019-10-29 13:16:55 +01:00
Matthew Honnibal
9e210fa7fd
Fix tok2vec structure after model registry refactor ()
The model registry refactor of the Tok2Vec function broke loading models
trained with the previous function, because the model tree was slightly
different. Specifically, the new function wrote:

    concatenate(norm, prefix, suffix, shape)

To build the embedding layer. In the previous implementation, I had used
the operator overloading shortcut:

    ( norm | prefix | suffix | shape )

This actually gets mapped to a binary association, giving something
like:

    concatenate(norm, concatenate(prefix, concatenate(suffix, shape)))

This is a different tree, so the layers iterate differently and we
loaded the weights wrongly.
2019-10-28 23:59:03 +01:00
Matthew Honnibal
bade60fe64 Set version to v2.2.2.dev1 2019-10-28 19:09:34 +01:00
Matthew Honnibal
b1505380ff Fix training with vectors 2019-10-28 18:06:38 +01:00
Matthew Honnibal
a927b3a21e Put new alignment behind flag for v2.2.2 release ()
* Xfail new tokenization test

* Put new alignment behind feature flag

* Move USE_ALIGN to top of the file [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 16:12:32 +01:00