Commit Graph

11612 Commits

Author SHA1 Message Date
Adriane Boyd
e4acb28658
Fix norm in retokenizer split ()
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Adriane Boyd
9b4979407d
Fix overlapping German noun chunks ()
Add a similar fix as in  to prevent the German noun chunks iterator
from producing overlapping spans.
2020-09-22 21:52:42 +02:00
Adriane Boyd
4625029370
Add pin for pyrsistent<0.17.0 ()
Add pin for pyrsistent<0.17.0 since pyrsistent>=0.17.1 is only
compatible with python3.5+.
2020-09-22 19:04:49 +02:00
Marek Grzenkowicz
a26f864ed3
Clarify how to choose pretrained weights files (closes ) [ci skip] () 2020-09-08 21:13:50 +02:00
Ines Montani
33d9c64977 Fix outbound link and update package lock [ci skip] 2020-09-04 14:44:38 +02:00
Ines Montani
ba6cf9821f Replace docs analytics [ci skip] 2020-09-04 14:28:28 +02:00
holubvl3
0a27fca557
Create examples.py ()
* Create examples.py

* Create tag_map.py

* Delete tag_map.py

* Update examples.py

formatting: add empty line

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-09-04 11:00:14 +02:00
Brad Jascob
2160aafec6
Updates spaCy Universe for amrlib ()
* Updates spaCy Universe for amrlib

* Updates to doc based on feedback
2020-09-04 10:03:35 +02:00
Marek Grzenkowicz
92d7832a86
Fix off-by-one error for best iteration calculation (closes ) () 2020-09-02 15:15:45 +02:00
Sofie Van Landeghem
f7a25d69f7
Bugfix in merge_entities ()
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Juan Gutiérrez
9002bea29f
Update suffixes example ()
* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
2020-08-31 12:44:56 +02:00
Adriane Boyd
caf23462eb
Add 3rd party licenses () 2020-08-26 15:23:59 +02:00
Adriane Boyd
7d7b65ffd4
Fix raw strings in URL pattern ()
Add missing raw string specifiers.
2020-08-26 04:00:49 +02:00
Hiroshi Matsuda
332803eda9
fix ja leading spaces ()
* change condition for space after

* add NAUGHTY_STRINGS test example
2020-08-25 14:16:24 +02:00
Shashank
450720aca2
Added support for Sanskrit language ()
* Added support for Sanskrit language

* Added tests for lexical attribute like_num
2020-08-25 10:56:29 +02:00
idoshr
b10c7bc56e
Hebrew like num ()
* Update stop_words.py

Hebrew STOP WORDS

* Update stop_words.py

* contributor

* contributor

* add some common domain extentions
support human number 1K/1M....

* support human number 1K/1M....

* hebrew number tokenize
1K/1M implement in EN

* test human tokenize fix

* test

* heb like num
revert human number change

* heb like num
2020-08-24 14:30:05 +02:00
Sofie Van Landeghem
56eabcb2f2
Adding num_like test for Czech ()
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

* add like_num testing for czech

Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 17:06:33 +02:00
holubvl3
a341b4ef09
Adding support for Czech language ()
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 16:17:53 +02:00
Ines Montani
99d2a25687
Make sure sys.argv exists ()
* Make sure sys.argv exists (resolves )

* Fix typo
2020-08-20 16:30:11 +02:00
Sofie Van Landeghem
071c09ff35
add coding () 2020-08-20 11:08:38 +02:00
Attila Szász
669dc70822
Create tilusnet.md () 2020-08-12 22:46:08 +02:00
Adam Bittlingmayer
7b33b2854f
Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct ()
* Add Armenian sentence-final verchaket

* Add Greek and Arabic question marks, and contributor agreement

* Check box
2020-08-12 15:36:14 +02:00
graue70
49e690bde1
Fix typos in comments ()
* Fix typo in comment

* Fix typo

* Add spaCy Contributor Agreement
2020-08-12 15:35:25 +02:00
Adriane Boyd
4193402c47
Add warning when Matcher subpattern is discarded ()
* Add a warning when a subpattern is not processed and discarded

* Normalize subpattern attribute/operator keys to upper case like
top-level attributes
2020-08-05 14:56:14 +02:00
Bram Vanroy
9e45d064bb
Update universe details spacy_conll () 2020-08-05 14:34:12 +02:00
Adriane Boyd
c62fd878a3
Allow Doc.char_span to snap to token boundaries ()
* Allow Doc.char_span to snap to token boundaries

Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:

* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span

Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.

* Remove unused import

* Rename mode to alignment_mode

Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
Adriane Boyd
b841248589
Add Span index boundary checks ()
* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else
2020-08-04 13:35:25 +02:00
Adriane Boyd
cd59979ab4
Fix span boundary handling in Spanish noun_chunks () 2020-08-03 13:53:15 +02:00
Adriane Boyd
ac14ce7c30
Prefer earlier spans in EntityRuler ()
Similar to , update the sorting in EntityRuler to prefer the first
span in overlapping spans.
2020-07-31 16:09:32 +02:00
holubvl3
d16c0f2c3a
Create holubvl3 ()
* Create holubvl3

* Rename holubvl3 to holubvl3.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-07-30 17:40:31 +02:00
Rahul Gupta
f76fae0e8d
English: adds ordinal numbers () 2020-07-29 20:22:47 +02:00
Gustavo Zadrozny Leyendecker
90b958fd01
Fix on EntityRendered to support break lines (after last entity) (closes ) 2020-07-29 18:48:39 +02:00
oculusrepairo
03ab518f28
Update examples.py ()
* Update examples.py

adding factual sentences to the list

* Add missing comma separators

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-29 10:28:56 +02:00
graue70
b97dbab998
Fix typo in unit tests () 2020-07-27 20:18:48 +02:00
Adriane Boyd
2880d8a555
Normalize spelling for spaCy () 2020-07-27 10:09:33 +02:00
Martino Mensio
2f6b8132ef
Sentence transformers added to spaCy universe ()
* fix details for spacy-universal-sentence-encoder

* added sentence-transformers
2020-07-27 09:44:33 +02:00
Nipun Sadvilkar
a66ad89fcb
✏️ typo in pysbd code example () 2020-07-27 09:43:39 +02:00
Li Zhe
a69eb445dc
fix the wrong hash url in adding-languages.md file ()
* fix the wrong hash url in adding-languages.md file

change the  url hash path to #language-data

* filled in the spaCy Contributor Agreement 

filled in the spaCy Contributor Agreement
2020-07-25 13:13:38 +02:00
Adriane Boyd
19dc42776a
Remove hard-coded GPU ID from pretrain () 2020-07-24 09:26:26 +02:00
Joshua Olson
6d4d5c074c
Mark Japanese documents as tagged. ()
Mark the document as tagged before returning it to the user from the JapaneseTokenizer.
Fixes 
2020-07-23 08:57:01 +02:00
Adriane Boyd
038ff1a811
Improve warnings around normalization tables ()
Provide more customized normalization table warnings when training a new
model. Only suggest installing `spacy-lookups-data` if it's not already
installed and it includes a table for this language (currently checked
in a hard-coded list).
2020-07-22 16:04:58 +02:00
Adriane Boyd
bf24f7f672
Update invalid tag maps ()
* Remove copy of (old?) PTB tag map for: bn, eu
* Remove unsupported features from: hy, pl, ro, ru
2020-07-22 16:02:51 +02:00
Alec Chapman
a8978ca285
Add VA COVID-19 NLP project to spaCy Universe ()
* Update universe.json

Add cov-bsv to "resources"

* Update universe.json

* add contributor agreement
2020-07-19 13:35:31 +02:00
Adriane Boyd
597bcc629e
Improve tag map initialization and updating ()
* Improve tag map initialization and updating

Generalize tag map initialization and updating so that a provided tag
map can be loaded correctly in the CLI.

* normalize provided tag map as necessary
* use the same method for initializing and overwriting the tag map

* Reinitialize cache after loading new tag map

Reinitialize the cache with the right size after loading a new tag map.
2020-07-19 11:13:39 +02:00
Adriane Boyd
7e14272096
Lower upper pin for cupy to 8.0.0 () 2020-07-19 11:10:11 +02:00
Adriane Boyd
cd5af72c9a
Update pkuseg version ()
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Ines Montani
6f4e4aceb3 Add Plausible [ci skip] 2020-07-18 23:50:29 +02:00
Adriane Boyd
5228920e2f
Clarify warning W030 for misaligned BILUO tags () 2020-07-14 14:09:48 +02:00
Adriane Boyd
7ea2cc7650
Set version to 2.3.2 () 2020-07-13 14:55:56 +02:00
Mark Neumann
27a1cd3c63
fix meta serialization in train ()
Co-authored-by: Mark Neumann <markng@allenai.org>
2020-07-12 22:06:46 +02:00