Commit Graph

1268 Commits

Author SHA1 Message Date
Ines Montani
82045aac8a Merge regression tests 2019-07-10 12:49:18 +02:00
Ines Montani
570ab1f481 Fix handling of old entity ruler files
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
2019-07-10 12:14:12 +02:00
Ines Montani
874d914a44 Tidy up test 2019-07-10 12:13:23 +02:00
Ines Montani
6ba5ddbd5f
Merge pull request #3864 from svlandeg/feature/nel-wiki
Entity linking using Wikipedia & Wikidata
2019-07-10 11:25:41 +02:00
cedar101
58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Ines Montani
f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki 2019-07-09 21:57:47 +02:00
Joshua Smith
2eb925bd05 Added an argument to EntityRuler constructor to pass attrs to… (#3919)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* Adds `phrase_matcher_attr` to allow args to PhraseMatcher

This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created.  References explosion/spaCy#3822

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* updated docstring for new argument

* updated docs to reflect new argument to the EntityRuler constructor

* change tempdir handling to be compatible with python 2.7

* return conflicted code to entityruler

Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.

* fixed typo in the code added back after conflicts

* flake8 compliance

When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.

* test changes:  attempts to fix flaky test in python3.5

These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
2019-07-09 20:09:17 +02:00
Joshua Smith
e8420ab2b7 Added support for serializing overwrite and ent_id_sep (#3918)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* change tempdir handling to be compatible with python 2.7

* Adds code to handle item saved before this change.

This code chanes how the save files are handled and how the bytes
are stored as well.  This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).

* use util function for tempdir management

Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
2019-07-08 17:28:28 +02:00
Rokas Ramanauskas
61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
svlandeg
1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
Ines Montani
6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
svlandeg
b76a43bee4 unicode strings 2019-06-19 13:26:33 +02:00
svlandeg
0b0959b363 UTF8 encoding 2019-06-19 13:11:39 +02:00
svlandeg
791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
Kabir Khan
1e19f34e29 Add optional id property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Suraj Rajan
46c78d0a41 Dependency tree pattern matcher (#3465)
* Functional dependency tree pattern matcher

* Tests fail due to inconsistent behaviour

* Renamed dependencymatcher and added optimizations
2019-06-16 13:25:32 +02:00
BreakBB
d8573ee715 Update error raising for CLI pretrain to fix #3840 (#3843)
* Add check for empty input file to CLI pretrain

* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key

* Skip empty values for correct pretrain keys and log a counter as warning

* Add tests for CLI pretrain core function make_docs.

* Add a short hint for the `tokens` key to the CLI pretrain docs

* Add success message to CLI pretrain

* Update model loading to fix the tests

* Skip empty values and do not create docs out of it
2019-06-16 13:22:57 +02:00
Ines Montani
f35ce09776 Add regression test for #3839 2019-06-12 13:38:30 +02:00
Ines Montani
aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
svlandeg
5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg
d83a1e3052 Merge branch 'master' into feature/nel-wiki 2019-06-03 09:35:10 +02:00
Germán
86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
BreakBB
ed18a6efbd Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741) 2019-05-14 16:59:31 +02:00
Ines Montani
8baff1c7c0
💫 Improve introspection of custom extension attributes (#3729)
* Add custom __dir__ to Underscore (see #3707)

* Make sure custom extension methods keep their docstrings (see #3707)

* Improve tests

* Prepend note on partial to docstring (see #3707)

* Remove print statement

* Handle cases where docstring is None
2019-05-12 00:53:11 +02:00
Ines Montani
505c9e0e19 Add util.filter_spans helper (#3686) 2019-05-08 02:33:40 +02:00
svlandeg
19e8f339cb deduce entity freq from WP corpus and serialize vocab in WP test 2019-04-29 17:37:29 +02:00
svlandeg
387263d618 simplify chains 2019-04-29 13:58:07 +02:00
svlandeg
54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
BreakBB
5b8dbe4975 Fix symlink creation to show error message on failure (#3589) (resolves #3307))
* Fix symlink creation to show error message on failure. Update tests to reflect those changes.

* Fix test to succeed on non windows systems.
2019-04-16 11:58:31 +02:00
svlandeg
9a7d534b1b enable nogil for cython functions in kb.pxd 2019-04-10 17:25:10 +02:00
Ines Montani
4d198a7e92 Ensure match pattern error isn't raised on empty errors (closes #3549) 2019-04-09 12:50:43 +02:00
Ines Montani
145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Ines Montani
5f005adf61 Add xfailing test for #3555 2019-04-09 11:07:14 +02:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Ines Montani
6a4575a56c Don't make "settings" or "title" required in displaCy data (closes #3531) 2019-04-03 10:13:16 +02:00
svlandeg
85b4319f33 specify encoding in files 2019-04-02 15:05:31 +02:00
svlandeg
673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg
eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
svlandeg
e7062cf699 failing test for Issue #3521 2019-04-02 13:15:35 +02:00
svlandeg
1424b12b09 failing test for Issue #3449 2019-04-02 13:06:37 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani
68900066e0
Merge pull request #3459 from svlandeg/feature/el-framework
Basic framework and APIs for entity linker
2019-03-29 14:02:22 +01:00
Hiromu Hota
914b9ff3d2 Tags are joined with a comma and padded with asterisks (#3491)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-28 16:17:31 +01:00
Samuel Kane
06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Duygu Altinok
5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00