Commit Graph

8972 Commits

Author SHA1 Message Date
Tim
40d33261d6 Fixed typo in example of html visualizer (#3387)
* Fixed typo

* Add contributor agreement for tmetzl
2019-03-10 23:36:13 +01:00
Ines Montani
db03558288 Fix flake8 2019-03-09 02:59:29 +01:00
Ines Montani
40def86fdf Try running flake8 first 2019-03-09 02:56:20 +01:00
Ines Montani
9531213846 Remove other CI 2019-03-09 02:56:08 +01:00
Ines Montani
5a2e2b9db7
Update README.rst 2019-03-09 02:13:34 +01:00
Ines Montani
14a9b9753e Update README.rst 2019-03-09 01:42:17 +01:00
Ines Montani
47bf549f95 Update azure-pipelines.yml 2019-03-09 01:36:22 +01:00
Ines Montani
b7f9cbdc83 Fix undefined names 2019-03-09 01:35:36 +01:00
Ines Montani
78aa663f79 Fix flake8 2019-03-09 01:30:40 +01:00
Ines Montani
400c9eecb6 Re-add flake8 to CI 2019-03-09 01:20:42 +01:00
Ines Montani
3d08bf9514 Update azure-pipelines.yml
Try to work around "conflict with the backend dependencies: wheel==0.33.1 is incompatible with wheel<0.33.0,>=0.32.0"
2019-03-09 00:48:08 +01:00
Ines Montani
5bb8f123ca Update azure-pipelines.yml 2019-03-09 00:43:36 +01:00
Ines Montani
70da2097b4 Update azure-pipelines.yml 2019-03-09 00:35:41 +01:00
Ines Montani
7342348fc7 Update azure-pipelines.yml 2019-03-09 00:26:54 +01:00
Ines Montani
db9512f9e1 Update azure-pipelines.yml 2019-03-09 00:24:40 +01:00
Ines Montani
70511ba965 Update .gitignore 2019-03-09 00:24:20 +01:00
Ines Montani
086b267bdb Update azure-pipelines.yml for Azure Pipelines 2019-03-09 00:20:44 +01:00
Ines Montani
ec93b42353 Set up CI with Azure Pipelines 2019-03-09 00:19:03 +01:00
Adrien Ball
88909a9adb Fix egg fragments in direct download (#3369)
## Description
The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`.
One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically.

I'm not sure how this should be tested properly. 
Here is what I had before the fix when running the same direct download twice:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.6MB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 919kB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages
```

And after the fix:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.1MB/s
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0)
```

### Types of change
This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded.

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-07 21:07:19 +01:00
Daniel King
5f40229397 Don't use numpy directly for similarity (#3362)
* Don't use numpy directly for similarity

* Contributor agreement
2019-03-06 22:58:38 +00:00
Julia Makogon
f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Michael Liberman
386cec1979 - Json fix in comment (#3294) 2019-02-19 18:01:35 +01:00
Roshni Biswas
e09f1347fa updates for Bengali language (#3286)
* Update morph_rules.py

* contributor agreement for roshni-b

* created example sentences
2019-02-18 10:02:28 +01:00
Marc Puig
51268e9f21 Typo error fixed (#3284) 2019-02-17 17:51:02 +01:00
Roshni Biswas
e26d923726 Update morph_rules.py (#3283) 2019-02-17 12:21:47 +01:00
Grivaz
39815513e2 Add split one token into several (resolves #2838) (#3253)
* Add split one token into several (resolves #2838)

* Improve error message for token splitting

* Make retokenizer.split() tests use a Token object

Change retokenizer.split() to use a Token object, instead of an index.

* Pass Token into retokenize.split()

Tweak retokenize.split() API so that we pass the `Token` object, not the index.

* Fix token.idx in retokenize.split()

* Test that token.idx is correct after split

* Fix token.idx for split tokens

* Fix retokenize.split()

* Fix retokenize.split

* Fix retokenize.split() test
2019-02-15 01:27:13 +11:00
Ines Montani
11d6b874db
Update stop_words.py 2019-02-14 12:25:19 +01:00
Akhilesh
a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani
5dd39d8697
Update universe.json 2019-02-12 18:05:51 +01:00
Abhijit Balaji
75a40f56fc added spacy-langdetect to universe.json (#3266) 2019-02-12 18:04:38 +01:00
Ines Montani
7a985cba24 Fix typo (closes #3232) [ci skip] 2019-02-08 17:29:18 +01:00
Björn Lennartsson
647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński
1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Julia Makogon
b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Ines Montani
4684195822
Rename contributer_agreement.md to .github/contributors/lauraBaakman.md 2019-02-07 20:55:53 +01:00
Laura Baakman
04aa041c9e Update Example input JSON file to adhere to specification. (#3243)
* Example file does not adhere to json input spec.

According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`.

* Add spaCy Contributor Agreement.
2019-02-07 16:18:01 +01:00
PierreMonico
114d64c4b5 Fix typo (#3223) 2019-02-04 11:37:29 +01:00
Amandine Périnet
d570e75dbb Improving the French lookup dictionnary for ambiguous words (#3185)
* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files

* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files

* updating the contributor agreement for amperinet
2019-01-31 23:53:45 +01:00
Ines Montani
e9a6dbe4f3
Don't check for Jupyter in global scope and fix check (#3213)
Resolves #3208.

Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method.

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-31 23:49:13 +01:00
Amandine Périnet
b34bc9d2e9 add small fix for French lemmatizer (#3206) 2019-01-31 23:44:10 +01:00
mak
8fc6aaf134 Updated main to make use of lang variable (#3220)
Updated main to make use of language variable when initializing spacy.
2019-01-31 23:43:22 +01:00
adrianeboyd
03d58f9feb Update TIGER/German dependency relations in documentation (#3204)
* Add missing dependency relations for TIGER/German

* Contributor agreement for adrianeboyd
2019-01-30 14:23:12 +01:00
Loghi
5ca8e2b269 Tamil (#3194)
* Tamil language support
*stop wors, examples and numerical attribite supports added

* Contributor agreement signed

* Create Loghijiaha.md

Added contributor agreement

* Update CONTRIBUTOR_AGREEMENT.md

Adjusted contributor_agreement.md

* Norm exceptions added
2019-01-27 06:02:04 +01:00
foufaster
8bd85fd9d5 Fix french lemmatization (#3180) 2019-01-27 06:01:30 +01:00
Jo
f9ca09caa0 Create PolyglotOpenstreetmap.md (#3198)
* Create PolyglotOpenstreetmap.md

* forgot to tick that box
2019-01-26 14:02:54 +01:00
foufaster
8b61b6a6b5 Create foufaster.md (#3179) 2019-01-21 15:45:54 +01:00
Paul Ganssle
021d04069a Build metadata modernization - pyproject.toml and python_requires (#3167)
* Added pyproject.toml

This adds the build requirements metadata to the repo, which can be used
with any build tools that implement PEP 517 and PEP 518 (e.g. pip, tox).
It is no longer necessary to have the build dependencies installed when
installing from source.

* Add python_requires for 2.7, 3.4+

This directive specifies in the build metadata which version of CPython
is supported by this version of spaCy, which pip will take into account
when determining what version to download. This will allow you to safely
drop old versions of Python without `pip install spaCy` breaking for those
versions.

* Add Python 3.7 to the trove classifiers
2019-01-16 17:42:09 +01:00
Bram Vanroy
11cee62644 Updated spacy_conll information (#3158) 2019-01-16 13:46:16 +01:00
Björn Lennartsson
b892b446cc Updates to Swedish Language (#3164)
* Added the same punctuation rules as danish language.

* Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too

* Added test for long texts in swedish

* Added morph rules, infixes and suffixes to __init__.py for swedish

* Added some tests for prefixes, infixes and suffixes

* Added tests for lemma

* Renamed files to follow convention

* [sv] Removed ambigious abbreviations

* Added more tests for tokenizer exceptions

* Added test for problem with punctuation in issue #2578

* Contributor agreement

* Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')
2019-01-16 13:45:50 +01:00
Gavriel Loria
9a5003d5c8 iob converter: add 'exception' for error 'too many values' (#3159)
* added contributor agreement

* issue #3128 throw exception on bad IOB/2 formatting

* Update spacy/cli/converters/iob2json.py with ValueError

Co-Authored-By: gavrieltal <gtloria@protonmail.com>
2019-01-16 13:44:16 +01:00