Commit Graph

11723 Commits

Author SHA1 Message Date
Antonio Miras
b4bd8f347a
spaCy Universe: New project; SpacyDotNet (#6702)
* Universe: SpacyDotNet a .NET Core spaCy wrapper

* Signed contributor agreement

Co-authored-by: Antonio Miras <antonio@amiras.net>
2021-01-13 12:47:30 +11:00
Alex Combessie
9cc880014c
Remove questionable French stopwords (#6310)
* Remove questionable French stopwords

* Create alexcombessie.md
2021-01-08 11:36:22 +11:00
Cristiana S Parada
7a0222f260
Update stop_words.py in Portuguese (a,o,e) (#6345)
* Update stop_words.py

Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"

* Create cristianasp.md

* zero edit to push CI

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-08 11:35:38 +11:00
Lorena Ciutacu
f11002f1f1
add new Romanian stopwords (#6621)
* add contributor agreement

* update ro stopwords list

* add new stopwords
2021-01-08 11:34:47 +11:00
ophelielacroix
e3222fdec9
Add (noun chunks) syntax iterators for Danish (#6246)
* add syntax iterators for danish

* add test noun chunks for danish syntax iterators

* add contributor agreement

* update da syntax iterators to remove nested chunks

* add tests for da noun chunks

* Fix test

* add missing import
* fix example

* Prevent overlapping noun chunks

Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-01-07 16:33:00 +11:00
Sofie Van Landeghem
6f7e7d88b9
remove cause without apostrophe from norm exceptions (#6636) 2021-01-06 12:30:30 +08:00
Sofie Van Landeghem
87562e470d
fix backticks in docs (#6635) 2020-12-27 22:12:37 +01:00
Sofie Van Landeghem
8df5b7f513
fix documentation of 'path' in tokenizer.to_disk (#6634) 2020-12-27 22:01:06 +01:00
Yosi
cf52510631
Add Amharic አማርኛ Language support (#6583)
* Add Amharic to space

* clean up

* Add some PRON_LEMMA

* add Tigrinya support

* remove text_noun_chunks

* Tigrinya Support

* added some more details for ti

* fix unit test

* add amharic char range

* changes from review

* amharic and tigrinya share same unicode block

* get rid of _amharic/_tigrinya in char_classes

Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
2020-12-22 16:50:34 +01:00
Tim Gates
292c1d6a73
docs: fix simple typo, speficied -> specified (#6611)
There is a small typo in spacy/cli/info.py.

Should read `specified` rather than `speficied`.
2020-12-22 09:14:10 +01:00
Gareth Sparks
efc229c3f4
Doc.char_span arg: alignment_mode (#6591)
Currently labeled "mode", actually "alignment_mode"
2020-12-18 09:54:56 +01:00
Ines Montani
7c9a2f298c
Merge pull request #6578 from jenojp/master [ci skip] 2020-12-16 17:31:55 +11:00
Ines Montani
d8aa113d16
Merge pull request #6566 from rafguns/cite-zenodo [ci skip] 2020-12-16 16:40:50 +11:00
Ines Montani
4feef6bf9f
Update citation 2020-12-16 15:59:57 +11:00
Jeno Pizarro
a6fe35a0f9
Update universe.json 2020-12-15 21:53:20 -05:00
Jeno Pizarro
343a44abe9 Merge branch 'master' of https://github.com/explosion/spaCy 2020-12-15 21:49:46 -05:00
Thomas Bird
f6e4378942
Add SCA for @thomasbird (#6576) 2020-12-15 20:59:47 +01:00
Raf Guns
ec876c9713 Merge branch 'master' of https://github.com/explosion/spaCy into cite-zenodo 2020-12-14 22:03:58 +01:00
Raf Guns
db2a34d610 Update CITATION to Zenodo 2020-12-14 22:01:24 +01:00
Raf Guns
a90ca0e1fb Add contributor agreement 2020-12-14 22:01:14 +01:00
Ines Montani
1d4b1dea25 Update contributing guide and issue template [ci skip] 2020-12-11 13:39:26 +11:00
Ines Montani
37c5d7e826
Merge pull request #6542 from adrianeboyd/chore/prepare-v2.3.5
Set version to v2.3.5
2020-12-11 10:33:18 +11:00
Ines Montani
fb43a30a71
Merge pull request #6545 from svlandeg/feature/discussions [ci skip] 2020-12-11 10:20:35 +11:00
Ines Montani
76cfd89dea Update site.json 2020-12-11 10:19:42 +11:00
Ines Montani
c9b67b02f8 Update issue templates 2020-12-11 10:05:47 +11:00
Ines Montani
43a69eecb7 Update site.json 2020-12-11 10:05:21 +11:00
Ines Montani
73896fcbc8 Update README.md 2020-12-11 10:05:19 +11:00
Ines Montani
25186fa431
Merge pull request #6543 from adrianeboyd/docs/install-v2
Docs and extras updates for v2.3.5
2020-12-11 09:53:53 +11:00
svlandeg
4afcd9567e refer to GH discussions 2020-12-10 20:56:12 +01:00
svlandeg
d156b423ae remove gitter and reddit links 2020-12-10 20:41:02 +01:00
svlandeg
5afa567767 replace gitter with discussions in 101 2020-12-10 20:17:36 +01:00
svlandeg
ae1ccf2b04 update link to discussion forum 2020-12-10 20:02:49 +01:00
svlandeg
52cdb12d26 add GH discussions to readme 2020-12-10 19:58:43 +01:00
Adriane Boyd
27bb75e2a0 Docs and extras updates for v2.3.5
* Update install instructions for updated packages

* Add `cuda110` and `cuda111` extras, remove upper `cupy` pins (only
compatible with `thinc>=7.4.4`)
2020-12-10 15:34:34 +01:00
Adriane Boyd
7b277661f6 Set version to v2.3.5 2020-12-10 13:32:10 +01:00
Koichi Yasuoka
0afb54ac93
JapaneseTokenizer.pipe added (#6515)
* JapaneseTokenizer.pipe added

For [spacymoji](https://spacy.io/universe/project/spacymoji)  with `Japanese()`.

* DummyTokenizer.pipe added instead
2020-12-08 20:02:23 +01:00
Adriane Boyd
df4891bed1
Remove blis python version constraints (#6522)
* Remove blis version constraints

After updating the blis sdist in v0.7.4, remove python version
constraints for blis build and install dependencies.

* Install sdist with --prefer-binary for python 3.5

* Fix duplicate sdist install steps

* Fix sdist install step types

* Fix blis pins in requirements.txt

* Remove wheel hack for python 3.5 from CI
2020-12-08 15:25:19 +01:00
Ines Montani
4e77349106
Merge pull request #6524 from adrianeboyd/bugfix/entity-ruler-subsequent
Fix subsequent pipe detection in EntityRuler
2020-12-08 22:17:28 +11:00
Adriane Boyd
6c221d4841 Fix subsequent pipe detection in EntityRuler
Fix subsequent pipe detection to detect the position of the current
object by comparing the component itself rather than from the factory
name.
2020-12-08 10:01:30 +01:00
Ines Montani
b87793a89a
Merge pull request #6523 from adrianeboyd/bugfix/remove-use-chars
Remove non-working --use-chars from train CLI
2020-12-08 09:30:48 +01:00
Adriane Boyd
5ceac425ee Remove non-working --use-chars from train CLI
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
2020-12-08 08:30:00 +01:00
Adriane Boyd
dcecc75270
Improve blis and numpy build dependencies (#6455)
* Fix blis build dependencies

* Add blis with python_version constraints to pyproject.toml
* Add blis to setup_requires

* Remove --only-binary from CI

* Reduce number of builds to speed up CI

* Add hack to install wheel for python 3.5 in linux

* Remove os spec from CI

* Remove detailed numpy build constraints

* Remove detailed numpy build constraints from `pyproject.toml` because
  it is too difficult to maintain for many architectures
  * These constraints are more a reflection of what is available on
    pypi as binary wheels rather than any real build requirements that
    it is necessary for users to follow when building from source
  * Users building their own binary packages will need to enforce the
    constraints that make sense in their environments, e.g., the `conda`
    compatible numpy pins

* Keep the build constraints in `build-constraints.txt` for use with our
  builds
  * Our builds with wheelwright are built against the earliest
    compatible binary versions of numpy on pypi
  * These constraints are documented within the distribution

* Revert "Remove os spec from CI"

This reverts commit 7489476688.
2020-12-08 14:29:34 +08:00
Adriane Boyd
e931d3f72b
Move max_length to nlp.make_doc() (#6512)
Move max_length check to `nlp.make_doc()` so that's it's also checked
for `nlp.pipe()`.
2020-12-08 14:24:02 +08:00
Sofie Van Landeghem
52fa46dd58
tested EL scripts with 2.3.4 (#6517) 2020-12-07 20:46:38 +01:00
Adriane Boyd
53c0fb7431
Only set NORM on Token in retokenizer (#6464)
* Only set NORM on Token in retokenizer

Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.

The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.

* Fix test
2020-11-30 09:35:42 +08:00
Adriane Boyd
03ae77e603
Add SPACY as a Matcher attribute (#6463) 2020-11-30 09:34:50 +08:00
Adriane Boyd
3a5cc5f8b4 Set version to v2.3.4 2020-11-26 08:48:52 +01:00
Adriane Boyd
e0f5646a4a
Restore cleanup_beam method (#6446) 2020-11-25 13:21:48 +01:00
Jacob Bortell
fe9009911a Update rule-based-matching.md (#6421)
* Update rule-based-matching.md

Clarified case-sensititivy of dictionary-referencing attributes (POS/TAG/DEP/etc).

Clarified "Type" column header to "Value Type"

* Update rule-based-matching.md

Improved clarity of wording
2020-11-24 16:20:19 +01:00
Jacob Bortell
992723dfac
Add jabortell to the contributors (#6422)
* Add jabortell to the contributors

* Update jabortell.md

Added tick to applicable statement
2020-11-24 16:15:31 +01:00