1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-04-04 17:24:16 +03:00
Commit Graph

11660 Commits

Author SHA1 Message Date
Adriane Boyd
4625029370
Add pin for pyrsistent<0.17.0 ()
Add pin for pyrsistent<0.17.0 since pyrsistent>=0.17.1 is only
compatible with python3.5+.
2020-09-22 19:04:49 +02:00
Marek Grzenkowicz
a26f864ed3
Clarify how to choose pretrained weights files (closes ) [ci skip] () 2020-09-08 21:13:50 +02:00
Ines Montani
33d9c64977 Fix outbound link and update package lock [ci skip] 2020-09-04 14:44:38 +02:00
Ines Montani
ba6cf9821f Replace docs analytics [ci skip] 2020-09-04 14:28:28 +02:00
holubvl3
0a27fca557
Create examples.py ()
* Create examples.py

* Create tag_map.py

* Delete tag_map.py

* Update examples.py

formatting: add empty line

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-09-04 11:00:14 +02:00
Brad Jascob
2160aafec6
Updates spaCy Universe for amrlib ()
* Updates spaCy Universe for amrlib

* Updates to doc based on feedback
2020-09-04 10:03:35 +02:00
Marek Grzenkowicz
92d7832a86
Fix off-by-one error for best iteration calculation (closes ) () 2020-09-02 15:15:45 +02:00
Sofie Van Landeghem
f7a25d69f7
Bugfix in merge_entities ()
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Juan Gutiérrez
9002bea29f
Update suffixes example ()
* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
2020-08-31 12:44:56 +02:00
Adriane Boyd
caf23462eb
Add 3rd party licenses () 2020-08-26 15:23:59 +02:00
Adriane Boyd
7d7b65ffd4
Fix raw strings in URL pattern ()
Add missing raw string specifiers.
2020-08-26 04:00:49 +02:00
Hiroshi Matsuda
332803eda9
fix ja leading spaces ()
* change condition for space after

* add NAUGHTY_STRINGS test example
2020-08-25 14:16:24 +02:00
Shashank
450720aca2
Added support for Sanskrit language ()
* Added support for Sanskrit language

* Added tests for lexical attribute like_num
2020-08-25 10:56:29 +02:00
idoshr
b10c7bc56e
Hebrew like num ()
* Update stop_words.py

Hebrew STOP WORDS

* Update stop_words.py

* contributor

* contributor

* add some common domain extentions
support human number 1K/1M....

* support human number 1K/1M....

* hebrew number tokenize
1K/1M implement in EN

* test human tokenize fix

* test

* heb like num
revert human number change

* heb like num
2020-08-24 14:30:05 +02:00
Sofie Van Landeghem
56eabcb2f2
Adding num_like test for Czech ()
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

* add like_num testing for czech

Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 17:06:33 +02:00
holubvl3
a341b4ef09
Adding support for Czech language ()
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 16:17:53 +02:00
Ines Montani
99d2a25687
Make sure sys.argv exists ()
* Make sure sys.argv exists (resolves )

* Fix typo
2020-08-20 16:30:11 +02:00
Sofie Van Landeghem
071c09ff35
add coding () 2020-08-20 11:08:38 +02:00
Attila Szász
669dc70822
Create tilusnet.md () 2020-08-12 22:46:08 +02:00
Adam Bittlingmayer
7b33b2854f
Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct ()
* Add Armenian sentence-final verchaket

* Add Greek and Arabic question marks, and contributor agreement

* Check box
2020-08-12 15:36:14 +02:00
graue70
49e690bde1
Fix typos in comments ()
* Fix typo in comment

* Fix typo

* Add spaCy Contributor Agreement
2020-08-12 15:35:25 +02:00
Adriane Boyd
4193402c47
Add warning when Matcher subpattern is discarded ()
* Add a warning when a subpattern is not processed and discarded

* Normalize subpattern attribute/operator keys to upper case like
top-level attributes
2020-08-05 14:56:14 +02:00
Bram Vanroy
9e45d064bb
Update universe details spacy_conll () 2020-08-05 14:34:12 +02:00
Adriane Boyd
c62fd878a3
Allow Doc.char_span to snap to token boundaries ()
* Allow Doc.char_span to snap to token boundaries

Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:

* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span

Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.

* Remove unused import

* Rename mode to alignment_mode

Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
Adriane Boyd
b841248589
Add Span index boundary checks ()
* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else
2020-08-04 13:35:25 +02:00
Adriane Boyd
cd59979ab4
Fix span boundary handling in Spanish noun_chunks () 2020-08-03 13:53:15 +02:00
Adriane Boyd
ac14ce7c30
Prefer earlier spans in EntityRuler ()
Similar to , update the sorting in EntityRuler to prefer the first
span in overlapping spans.
2020-07-31 16:09:32 +02:00
holubvl3
d16c0f2c3a
Create holubvl3 ()
* Create holubvl3

* Rename holubvl3 to holubvl3.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-07-30 17:40:31 +02:00
Rahul Gupta
f76fae0e8d
English: adds ordinal numbers () 2020-07-29 20:22:47 +02:00
Gustavo Zadrozny Leyendecker
90b958fd01
Fix on EntityRendered to support break lines (after last entity) (closes ) 2020-07-29 18:48:39 +02:00
oculusrepairo
03ab518f28
Update examples.py ()
* Update examples.py

adding factual sentences to the list

* Add missing comma separators

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-29 10:28:56 +02:00
graue70
b97dbab998
Fix typo in unit tests () 2020-07-27 20:18:48 +02:00
Adriane Boyd
2880d8a555
Normalize spelling for spaCy () 2020-07-27 10:09:33 +02:00
Martino Mensio
2f6b8132ef
Sentence transformers added to spaCy universe ()
* fix details for spacy-universal-sentence-encoder

* added sentence-transformers
2020-07-27 09:44:33 +02:00
Nipun Sadvilkar
a66ad89fcb
✏️ typo in pysbd code example () 2020-07-27 09:43:39 +02:00
Li Zhe
a69eb445dc
fix the wrong hash url in adding-languages.md file ()
* fix the wrong hash url in adding-languages.md file

change the  url hash path to #language-data

* filled in the spaCy Contributor Agreement 

filled in the spaCy Contributor Agreement
2020-07-25 13:13:38 +02:00
Adriane Boyd
19dc42776a
Remove hard-coded GPU ID from pretrain () 2020-07-24 09:26:26 +02:00
Joshua Olson
6d4d5c074c
Mark Japanese documents as tagged. ()
Mark the document as tagged before returning it to the user from the JapaneseTokenizer.
Fixes 
2020-07-23 08:57:01 +02:00
Adriane Boyd
038ff1a811
Improve warnings around normalization tables ()
Provide more customized normalization table warnings when training a new
model. Only suggest installing `spacy-lookups-data` if it's not already
installed and it includes a table for this language (currently checked
in a hard-coded list).
2020-07-22 16:04:58 +02:00
Adriane Boyd
bf24f7f672
Update invalid tag maps ()
* Remove copy of (old?) PTB tag map for: bn, eu
* Remove unsupported features from: hy, pl, ro, ru
2020-07-22 16:02:51 +02:00
Alec Chapman
a8978ca285
Add VA COVID-19 NLP project to spaCy Universe ()
* Update universe.json

Add cov-bsv to "resources"

* Update universe.json

* add contributor agreement
2020-07-19 13:35:31 +02:00
Adriane Boyd
597bcc629e
Improve tag map initialization and updating ()
* Improve tag map initialization and updating

Generalize tag map initialization and updating so that a provided tag
map can be loaded correctly in the CLI.

* normalize provided tag map as necessary
* use the same method for initializing and overwriting the tag map

* Reinitialize cache after loading new tag map

Reinitialize the cache with the right size after loading a new tag map.
2020-07-19 11:13:39 +02:00
Adriane Boyd
7e14272096
Lower upper pin for cupy to 8.0.0 () 2020-07-19 11:10:11 +02:00
Adriane Boyd
cd5af72c9a
Update pkuseg version ()
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Ines Montani
6f4e4aceb3 Add Plausible [ci skip] 2020-07-18 23:50:29 +02:00
Adriane Boyd
5228920e2f
Clarify warning W030 for misaligned BILUO tags () 2020-07-14 14:09:48 +02:00
Adriane Boyd
7ea2cc7650
Set version to 2.3.2 () 2020-07-13 14:55:56 +02:00
Mark Neumann
27a1cd3c63
fix meta serialization in train ()
Co-authored-by: Mark Neumann <markng@allenai.org>
2020-07-12 22:06:46 +02:00
Adriane Boyd
0a62098c5f
Fix lemmatizer is_base_form for python2.7 ()
* Fix lemmatizer init args for python2.7

* Move English is_base_form to a class method

* Skip test pickling PhraseMatcher for python2
2020-07-09 22:11:24 +02:00
Adriane Boyd
923affd091
Remove is_base_form from French lemmatizer ()
Remove English-specific is_base_form from French lemmatizer.
2020-07-09 22:11:13 +02:00