Ines Montani
9b42e2d5dd
Experiment with escaping hyphens
2019-03-09 02:05:26 +01:00
Ines Montani
76764fcf59
💫 Improve converters and training data file formats ( #3374 )
...
* Populate converter argument info automatically
* Add conversion option for msgpack
* Update docs
* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings ( #3370 )
...
<!--- Provide a general summary of your changes in the title. -->
## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs
### Types of change
enhancement, docs
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani
daaeeb7a2b
Merge branch 'master' into develop
2019-03-07 22:07:31 +01:00
Adrien Ball
88909a9adb
Fix egg fragments in direct download ( #3369 )
...
## Description
The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`.
One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically.
I'm not sure how this should be tested properly.
Here is what I had before the fix when running the same direct download twice:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
100% |████████████████████████████████| 37.4MB 1.6MB/s
Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Installing collected packages: en-core-web-sm
Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
100% |████████████████████████████████| 37.4MB 919kB/s
Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages
```
And after the fix:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
100% |████████████████████████████████| 37.4MB 1.1MB/s
Installing collected packages: en-core-web-sm
Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0)
```
### Types of change
This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded.
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-07 21:07:19 +01:00
Ines Montani
96b91a8898
Fix noqa [ci skip]
2019-03-07 12:25:00 +01:00
Ines Montani
9d6ca18a10
Tidy up and only use self.vector once
2019-03-07 01:06:12 +01:00
Ines Montani
a8f1efd2f5
Merge branch 'master' into develop
2019-03-07 00:56:31 +01:00
Daniel King
5f40229397
Don't use numpy directly for similarity ( #3362 )
...
* Don't use numpy directly for similarity
* Contributor agreement
2019-03-06 22:58:38 +00:00
Ines Montani
6bd34e9d54
Expose Japanese stop words ( closes #3346 )
2019-03-06 14:21:15 +01:00
Ines Montani
85deb96278
Fix whitespace
2019-03-06 14:20:34 +01:00
Ines Montani
23f6ebf0f3
Add missing " ( closes #3343 )
2019-02-27 16:37:03 +01:00
Ines Montani
533b580c19
Add test for stray print statements in languages (see #3342 )
2019-02-27 16:04:30 +01:00
Ines Montani
48a2046d1c
Remove stray print statement ( closes #3342 )
2019-02-27 15:35:04 +01:00
Ines Montani
07d7c0a1af
Fix whitespace
2019-02-27 15:34:21 +01:00
Ines Montani
9b62639d19
Auto-format [ci skip]
2019-02-27 14:24:55 +01:00
Matthew Honnibal
656edcb984
Set version to v2.1.0a10
2019-02-27 12:26:13 +01:00
Matthew Honnibal
f1d77eb140
💫 Improve handling of missing NER tags ( closes #2603 ) ( #3341 )
...
* Improve handling of missing NER tags
GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.
Fix bug that occurred when first tag was a None value. Closes #2603 .
* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Ines Montani
e359bdd0e3
Auto-format
2019-02-27 11:56:45 +01:00
Matthew Honnibal
4a3371acd5
Make doc[0].is_sent_start == True ( closes #2869 ) ( #3340 )
...
* Make doc[0] have sent_start True. Closes #2869
* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Matthew Honnibal
2d3ce89b78
Improve matcher tests re issue #3328
2019-02-27 10:25:56 +01:00
Matthew Honnibal
8d6954e0e7
Fix matcher bug #3328
2019-02-27 10:25:39 +01:00
Ines Montani
aadf586789
Add xfailing test for #3331
2019-02-25 22:33:30 +01:00
Matthew Honnibal
3cdd3eb518
Set version to v2.1.0a9
2019-02-25 21:55:19 +01:00
Matthew Honnibal
b449be0f04
Add comment re issue #3170
2019-02-25 21:24:03 +01:00
Matthew Honnibal
9ccd6a3062
Fix head-outside-sentence bug. Fixes #3170
2019-02-25 21:21:44 +01:00
Matthew Honnibal
f2fae1f186
Add batch size argument to Language.evaluate(). Closes #3263
2019-02-25 19:30:33 +01:00
Ines Montani
f135d663f7
Update conftest.py
2019-02-25 15:55:29 +01:00
Ines Montani
76ce8b2662
Merge branch 'master' into develop
2019-02-25 15:54:55 +01:00
Julia Makogon
f1c3108d52
Fixing pymorphy2 dependency issue ( #3329 ) ( closes #3327 )
...
* Classes for Ukrainian; small fix in Russian.
* Contributor agreement
* pymorphy2 initialization split for ru and uk (#3327 )
* stop-words fixed
* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani
1a735e0f1f
Add regression test for #3328
2019-02-25 10:12:58 +01:00
Ines Montani
dfbed07d3b
Remove unused temp errors
2019-02-24 22:26:08 +01:00
Ines Montani
62b558ab72
💫 Support lexical attributes in retokenizer attrs ( closes #2390 ) ( #3325 )
...
* Fix formatting and whitespace
* Add support for lexical attributes (closes #2390 )
* Document lexical attribute setting during retokenization
* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani
a48deb4081
Merge regression tests
2019-02-24 21:03:39 +01:00
Ines Montani
8f6c193a4d
Delete _test_issue1622.py
2019-02-24 20:33:31 +01:00
Ines Montani
c8e967c78d
Try include previously segfaulting test
2019-02-24 20:32:46 +01:00
Ines Montani
328b589deb
Merge regression tests
2019-02-24 20:31:38 +01:00
Ines Montani
3bc53905cc
Remove print statements from test
2019-02-24 20:31:15 +01:00
Ines Montani
1ae0df3da9
Un-x-fail passing test
2019-02-24 20:24:15 +01:00
Ines Montani
399a5803d0
Tidy up tests [ci skip]
2019-02-24 19:02:16 +01:00
Ines Montani
2011563c51
Update docstrings [ci skip]
2019-02-24 18:39:59 +01:00
Ines Montani
df19e2bff6
💫 Allow setting of custom attributes during retokenization ( closes #3314 ) ( #3324 )
...
<!--- Provide a general summary of your changes in the title. -->
## Description
This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.
```python
Token.set_extension('is_musician', default=False)
doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
retokenizer.merge(doc[2:4], attrs=attrs)
assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani
1ea1bc98e7
Document regex utilities [ci skip]
2019-02-24 18:34:10 +01:00
Matthew Honnibal
1f7c56cd93
Fix parser.add_label()
2019-02-24 16:53:22 +01:00
Matthew Honnibal
893aa40d73
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2019-02-24 16:43:01 +01:00
Matthew Honnibal
5882d82915
Set version to v2.1.0a9.dev2
2019-02-24 16:42:06 +01:00
Matthew Honnibal
0367f864fe
Fix handling of added labels. Resolves #3189
2019-02-24 16:41:41 +01:00
Matthew Honnibal
d74dbde828
Fix order of actions when labels added to parser
...
When labels were added to the parser or NER, we weren't loading back the
classes in the correct order. Re issue #3189
2019-02-24 16:36:29 +01:00
Ines Montani
6de81ae310
Fix formatting of errors
2019-02-24 15:11:28 +01:00
Ines Montani
d8f69d592f
Tidy up retokenizer tests
2019-02-24 14:14:11 +01:00
Ines Montani
723e27cb8c
Tidy up tests
2019-02-24 14:11:23 +01:00
Ines Montani
2982f82934
Auto-format
2019-02-24 14:09:15 +01:00
Matthew Honnibal
909a9d9932
Set version to v2.1.0a9.dev1
2019-02-23 13:10:42 +01:00
Matthew Honnibal
6b0008afc6
Clean up TextCategorizer slightly
2019-02-23 12:28:06 +01:00
Matthew Honnibal
d13b9373bf
Improve initialization for mutually textcat
2019-02-23 12:27:45 +01:00
Matthew Honnibal
e9dd5943b9
Support exclusive_classes setting for textcat models
2019-02-23 11:57:16 +01:00
Matthew Honnibal
ce1e4eace2
Default to former TextCategorizer model
...
* Keep TextCategorizer default model same as v2.0
* Add option 'architecture' that allows "simple_cnn" to switch to
simpler model.
* Add option exclusive_classes, defaulting to False. If set to True,
the model treats classes as mutually exclusive, i.e. only one class can
be true per instance.
2019-02-23 11:55:16 +01:00
Matthew Honnibal
829c9091a4
Set version to v2.1.0a9.dev0
2019-02-21 17:13:34 +01:00
Matthew Honnibal
d396a69c7b
More fixes for issue #3112
2019-02-21 17:12:23 +01:00
Ines Montani
80bdcb99c5
Fix escaping of HTML in displacy ENT ( closes #2728 )
2019-02-21 14:30:39 +01:00
Matthew Honnibal
7d529ebdfb
Set version to v2.1.0a8
2019-02-21 12:09:34 +01:00
Matthew Honnibal
f75be6e7be
Set version to v2.1.0a8.dev1
2019-02-21 11:57:06 +01:00
Matthew Honnibal
c5f947f194
Fix regex deprecation warnings
2019-02-21 11:56:47 +01:00
Matthew Honnibal
7f02464494
Set version to v2.1.0a8.dev0
2019-02-21 11:42:23 +01:00
Matthew Honnibal
f31dbec528
More fixes for #3112
2019-02-21 11:10:10 +01:00
Matthew Honnibal
80195bc2d1
Fix issue #3288 ( #3308 )
2019-02-21 09:48:53 +01:00
Matthew Honnibal
a137e8b418
Fix Pipe.to_bytes() when model uninitialized
...
Closes #3289
2019-02-21 09:42:02 +01:00
Matthew Honnibal
6574e4f2d3
Fix issue #3112 part 1
2019-02-21 09:27:38 +01:00
Matthew Honnibal
b21481eeca
Load token_match regex with .match, not .search
2019-02-21 09:09:03 +01:00
Sofie
9a478b6db8
Clean up of char classes, few tokenizer fixes and faster default French tokenizer ( #3293 )
...
* splitting up latin unicode interval
* removing hyphen as infix for French
* adding failing test for issue 1235
* test for issue #3002 which now works
* partial fix for issue #2070
* keep the hyphen as infix for French (as it was)
* restore french expressions with hyphen as infix (as it was)
* added succeeding unit test for Issue #2656
* Fix issue #2822 with custom Italian exception
* Fix issue #2926 by allowing numbers right before infix /
* splitting up latin unicode interval
* removing hyphen as infix for French
* adding failing test for issue 1235
* test for issue #3002 which now works
* partial fix for issue #2070
* keep the hyphen as infix for French (as it was)
* restore french expressions with hyphen as infix (as it was)
* added succeeding unit test for Issue #2656
* Fix issue #2822 with custom Italian exception
* Fix issue #2926 by allowing numbers right before infix /
* remove duplicate
* remove xfail for Issue #2179 fixed by Matt
* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Matthew Honnibal
0d1ca15b13
💫 Fix bugs in matcher extensions. Closes #1971 ( #3301 )
...
* Fix matching on extension attrs and predicates
* Fix detection of match_id when using extension attributes. The match
ID is stored as the last entry in the pattern. We were checking for this
with nr_attr == 0, which didn't account for extension attributes.
* Fix handling of predicates. The wrong count was being passed through,
so even patterns that didn't have a predicate were being checked.
* Fix regex pattern
* Fix matcher set value test
2019-02-20 21:30:39 +01:00
Ines Montani
3b667787a9
Add xfailing test for #3289
2019-02-18 16:45:04 +01:00
Ines Montani
91f260f2c4
Add another test for #1971
2019-02-18 13:36:20 +01:00
Ines Montani
f30aac324c
Update test_issue1971.py
2019-02-18 13:36:15 +01:00
Ines Montani
8fa26ca97e
Fix tensor shape in test for #3288
2019-02-18 11:01:54 +01:00
Ines Montani
c32290557f
Add xfailing test for #3288
2019-02-18 10:59:31 +01:00
Ines Montani
3fdcdec6a0
Merge branch 'master' into develop
2019-02-18 10:03:32 +01:00
Roshni Biswas
e09f1347fa
updates for Bengali language ( #3286 )
...
* Update morph_rules.py
* contributor agreement for roshni-b
* created example sentences
2019-02-18 10:02:28 +01:00
Ines Montani
043e8186f3
Merge branch 'master' into develop
2019-02-17 17:51:17 +01:00
Marc Puig
51268e9f21
Typo error fixed ( #3284 )
2019-02-17 17:51:02 +01:00
Ines Montani
3af0b2dd1c
Add xfailing test for #1971 [ci skip]
2019-02-17 13:04:47 +01:00
Ines Montani
19a002bfd3
Merge branch 'master' into develop
2019-02-17 12:22:54 +01:00
Ines Montani
1e252b129c
Auto-format
2019-02-17 12:22:07 +01:00
Roshni Biswas
e26d923726
Update morph_rules.py ( #3283 )
2019-02-17 12:21:47 +01:00
Matthew Honnibal
7d4a52a4d0
Set version to v2.1.0a7
2019-02-16 17:48:34 +01:00
Matthew Honnibal
07617b6b7f
Set version to v2.1.0a7.dev12
2019-02-16 17:30:29 +01:00
Matthew Honnibal
1dc314bada
Set version to v2.1.0a7.dev11
2019-02-16 17:02:49 +01:00
Matthew Honnibal
2ef227c313
Set version to v2.1.0a7.dev1
2019-02-16 16:22:46 +01:00
Matthew Honnibal
22923b9cb1
Set version to v2.1.0a7.dev9
2019-02-16 15:47:19 +01:00
Matthew Honnibal
e0c91a4c8d
Set version to 2.1.0a7
2019-02-16 14:43:38 +01:00
Matthew Honnibal
92b6bd2977
Refinements to retokenize.split() function ( #3282 )
...
* Change retokenize.split() API for heads
* Pass lists as values for attrs in split
* Fix test_doc_split filename
* Add error for mismatched tokens after split
* Raise error if new tokens don't match text
* Fix doc test
* Fix error
* Move deps under attrs
* Fix split tests
* Fix retokenize.split
2019-02-15 17:32:31 +01:00
Matthew Honnibal
2dbc61bc26
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2019-02-15 14:03:54 +01:00
Ines Montani
1aa57690dc
Add xfailing test for orth mismatch in retokenizer.split
2019-02-15 13:55:04 +01:00
Ines Montani
819768483f
Add xfailing test for out-of-bounds heads
2019-02-15 13:09:07 +01:00
Ines Montani
d8051e89ca
Tidy up tests
2019-02-15 12:56:51 +01:00
Matthew Honnibal
58aac58631
Set version to v2.1.0a7.dev8
2019-02-15 12:39:26 +01:00
Matthew Honnibal
5f1abe2cc7
Set version to v2.1.0a7.dev7
2019-02-15 10:30:53 +01:00
Matthew Honnibal
a66e8e0c8a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2019-02-15 10:30:22 +01:00
Ines Montani
c31a9dabd5
💫 Add en/em dash to prefixes and suffixes ( #3281 )
...
* Auto-format
* Add en/em dash to prefixes and suffixes
2019-02-15 10:29:59 +01:00
Ines Montani
5651a0d052
💫 Replace {Doc,Span}.merge with Doc.retokenize ( #3280 )
...
* Add deprecation warning to Doc.merge and Span.merge
* Replace {Doc,Span}.merge with Doc.retokenize
2019-02-15 10:29:44 +01:00