Commit Graph

282 Commits

Author SHA1 Message Date
YohannesDatasci
beef184e53
Armenian language support (#5246)
* add Armenian language and test cases

* agreement submission
2020-04-03 13:02:18 +02:00
Michael Leichtfried
2b14997b68
Remove duplicated branch in if/else-if statement (#5234)
* Remove duplicated branch in if-elif-statement

* Add contributor agreement for leicmi
2020-04-02 14:47:42 +02:00
Jacob Lauritzen
0b76212831
Extend and fix Danish examples (#5227)
* Extend and fix Danish examples

This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation.

The two changed examples are:
* "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word.
* "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby".

* Sign contrib agreement
2020-04-02 10:42:35 +02:00
Nikhil Saldanha
4f27a24f5b
Add kannada examples (#5162)
* Add example sentences for Kannada

* sign contributor agreement
2020-03-29 13:54:42 +02:00
Tom Milligan
e904958115
Limit to cupy-cuda v8, so as not to pull in v9 automatically. (#5194) 2020-03-29 13:52:08 +02:00
Tiljander
e53232533b
Describing priority rules for overlapping matches (#5197)
* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
Ines Montani
3fc2309c48
Merge pull request #5174 from Baciccin/master
Add Ligurian language
2020-03-24 16:33:59 +01:00
Philip Gillißen
128acb9ee1
Update guerda.md 2020-03-24 10:42:30 +01:00
Philip Gillißen
5d067bcc5e
Add SCA for guerda 2020-03-24 10:42:10 +01:00
Baciccin
3b53617a69 Add Ligurian language 2020-03-19 21:37:01 -07:00
Ines Montani
17bd9ed84f
Merge pull request #5153 from pinealan/fix/website-docs
Fix website typos and weird sentences
2020-03-16 15:03:01 +01:00
Alan Chan
1ae01684cf Fill in contributor agreement 2020-03-15 03:45:20 +08:00
nihil
9cde7eb08c add spacy_syllables to universe + sign contributor agreement 2020-03-13 18:09:42 +01:00
Himanshu Garg
27d1300bdb
Create merrcury.md 2020-03-10 15:11:07 +05:30
Mark Abraham
0345135167
Tokenizer to_disk and from_disk now ensure paths (#5116)
* Tokenizer to_disk and from_disk now ensure strings are converted to paths

Fixes #5115

* Sign contributor agreement
2020-03-08 13:25:56 +01:00
David Pollack
80004930ed fix typo in svg file 2020-03-05 17:04:33 +01:00
Tom Keefe
ddf63b97a8
make idx available via to_array (#5030) 2020-02-22 14:13:06 +01:00
Jan Jessewitsch
c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Filip Bednárik
d4f4060bf3
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language

* Add SNK Tags

* Disable formatting lint for TAGS

* Add example sentences for Slovak language

* Add slovak numerals in base form

* Add lex_attrs to sk init

* Add contributor agreement
2020-02-03 13:03:59 +01:00
Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
Paco Nathan
49fefb6139 Submitting PyTextRank for inclusion in the spaCy uniVerse (#4942)
* submitting PyTextRank for consideration of including in the spaCy uniVerse

* including SCA
2020-01-28 11:37:54 +01:00
Anastasiia Iurshina
1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7 sign contributor agreement for AlJohri (#4839) [ci skip] 2019-12-29 14:17:28 +01:00
Olamilekan Wahab
a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Nicolai Bjerre Pedersen
de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
Antti Ajanki
e626a011cc Improvements to the Finnish language data (#4738)
* Enable lex_attrs on Finnish

* Copy the Danish tokenizer rules to Finnish

Specifically, don't break hyphenated compound words

* Contributor agreement

* A new file for Finnish tokenizer rules instead of including the Danish ones
2019-12-03 12:55:28 +01:00
Matt Maybeno
c9f1e99787 Agnostic vocab array fix (#4680)
* Use get_array_module instead of numpy

* add contributor agreement
2019-11-23 14:59:52 +01:00
GuiGel
8f7ab70870 Bugfix/fix entity ruler from disk (#4670)
* fix EntityRuler from_disk bug

* add contributor file

* Test EntityRuler PhraseMatcher deserialization (#4651)

* newline at end of file

* fix copy paste error

* serializing the EntityRuler by itself

* Add unicode declarations for Python 2 and auto-format
2019-11-21 16:26:37 +01:00
Elijah Rippeth
5ad5c4b44a Add initial Korean support (#4660)
* add hangul and jamo char classes.

* add initial Korean lexical attributes.

* add contributor agreement
2019-11-18 12:56:07 +01:00
Christoph Purschke
433748e867 Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648)
* Update __init__.py

* Create punctuation.py

* Update tokenizer_exceptions.py

* Create questoph.md

* Update questoph.md

* Update test_text.py

* Update test_text.py

* Update test_text.py

* Update test_text.py
2019-11-15 16:16:47 +01:00
Priscilla de Abreu Lopes
39e79fcc86 Bugfix/dep matcher issue 4590 (#4601)
* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)
2019-11-07 12:01:06 +01:00
Neel Kamath
6c036ab57d Add "spaCy Server" to spaCy Universe (#4553)
* Add "spaCy Server" to spaCy Universe

* Accept the spaCy Contributor Agreement
2019-10-30 13:20:46 +01:00
Ines Montani
1185702993 Port over contributor agreement from spacy-lookups-data [ci skip] 2019-10-25 13:06:10 +02:00
Zhuoru Lin
10d88b09bb Bugfix/fix wikidata train entity linker (#4509)
* Fix labels_discard Nonetype iteration error

* Contributor agreement for Zhuoru Lin

* Enhance EntityLinker.predict() to handle labels_discard is None case.
2019-10-24 12:52:59 +02:00
gustavengstrom
050e2445a8 Adding noun_chunks to the Swedish language model (sv) (#4422)
* Create syntax_iterators.py

Replica of spacy/lang/fr/syntax_iterators.py

* Added import statements for SYNTAX_ITERATORS

* Create gustavengstrom.md

* Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the  Swedish language model.

* Delete README-checkpoint.md


Co-authored-by: Gustav <gustav@davcon.se>
Co-authored-by: Ines Montani <ines@ines.io>
2019-10-21 12:57:06 +02:00
Pepe Berba
7772d5d3c5 Update vocab.get_vector docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Peter Gilles
428887b8f2 Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md
2019-10-14 12:27:50 +02:00
Ben Taylor
1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Rahul Soni
ed620daa5c Fix example sentences in Hindi for grammatical errors (#4343)
* Fix grammar for hindi

* Fix grammar for hindi

* Submit contributor agreement
2019-09-30 23:32:49 +02:00
EarlGreyT
1e9e2d8aa1 fix typo in first token (#4327)
* fix typo in first token

The head of 'in' is review which has an offset of 4 and not 44

* added contributor agreement
2019-09-27 14:49:36 +02:00
Jaydeep Borkar
6a06a3fa6a Update stop_words.py and add name in contributors (#4325)
* Update stop_words.py and add name in contributors

* add jaydeepborkar.md in contributors directory

* Reset template [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-27 11:57:27 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Sean Löfgren
31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Moshe Hazoom
72463b062f Improve speed of _merge method (#4300)
* make merge more efficient

* fix offsets

* merge works with relative indices

* remove printing

* Add the SCA

* fix SCA date

* more cythonize _retokenize.pyx

* more cythonize _retokenize.pyx

* fix only declaration in _retokenize.pyx

* switch back to absolute head

* switch back to absolute head

* fix comment

* merge from origin repo
2019-09-18 21:34:34 +02:00
tamuhey
71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Mihai Gliga
25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Ines Montani
bcd1b12f43 Add contributor agreement [ci skip] 2019-08-30 17:02:43 +02:00
Andrei-Marius Avram
199589228e Added RONEC to spaCy Universe (#4151)
* Added RONEC to spaCy Universe

* Added contributor file

* Corrected date from .github/contributors/avramandrei.md

* Convert tabs to spaces

* Remove duplicate keys

Can only have one GitHub link unfortunately

* Also add models category

* Adjust ID

This is used to generate the URL, so a simpler string is better
2019-08-20 14:46:07 +02:00
Ivan Šarić
434f6fa6c1 Issue #1107 - adds examples.py for Croatian language (#4143)
* adds contributor agreement for isaric

* adds examples.py for croatian language
2019-08-18 23:04:41 +02:00