Commit Graph

1164 Commits

Author SHA1 Message Date
Ines Montani
e359bdd0e3 Auto-format 2019-02-27 11:56:45 +01:00
Matthew Honnibal
4a3371acd5
Make doc[0].is_sent_start == True (closes #2869) (#3340)
* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Matthew Honnibal
2d3ce89b78 Improve matcher tests re issue #3328 2019-02-27 10:25:56 +01:00
Matthew Honnibal
8d6954e0e7 Fix matcher bug #3328 2019-02-27 10:25:39 +01:00
Ines Montani
aadf586789 Add xfailing test for #3331 2019-02-25 22:33:30 +01:00
Ines Montani
f135d663f7 Update conftest.py 2019-02-25 15:55:29 +01:00
Ines Montani
76ce8b2662 Merge branch 'master' into develop 2019-02-25 15:54:55 +01:00
Julia Makogon
f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani
1a735e0f1f Add regression test for #3328 2019-02-25 10:12:58 +01:00
Ines Montani
62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani
a48deb4081 Merge regression tests 2019-02-24 21:03:39 +01:00
Ines Montani
8f6c193a4d Delete _test_issue1622.py 2019-02-24 20:33:31 +01:00
Ines Montani
c8e967c78d Try include previously segfaulting test 2019-02-24 20:32:46 +01:00
Ines Montani
328b589deb Merge regression tests 2019-02-24 20:31:38 +01:00
Ines Montani
3bc53905cc Remove print statements from test 2019-02-24 20:31:15 +01:00
Ines Montani
1ae0df3da9 Un-x-fail passing test 2019-02-24 20:24:15 +01:00
Ines Montani
399a5803d0 Tidy up tests [ci skip] 2019-02-24 19:02:16 +01:00
Ines Montani
df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani
d8f69d592f Tidy up retokenizer tests 2019-02-24 14:14:11 +01:00
Ines Montani
723e27cb8c Tidy up tests 2019-02-24 14:11:23 +01:00
Ines Montani
80bdcb99c5 Fix escaping of HTML in displacy ENT (closes #2728) 2019-02-21 14:30:39 +01:00
Matthew Honnibal
c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Matthew Honnibal
80195bc2d1
Fix issue #3288 (#3308) 2019-02-21 09:48:53 +01:00
Matthew Honnibal
a137e8b418 Fix Pipe.to_bytes() when model uninitialized
Closes #3289
2019-02-21 09:42:02 +01:00
Sofie
9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Matthew Honnibal
0d1ca15b13 💫 Fix bugs in matcher extensions. Closes #1971 (#3301)
* Fix matching on extension attrs and predicates

* Fix detection of match_id when using extension attributes. The match
ID is stored as the last entry in the pattern. We were checking for this
with nr_attr == 0, which didn't account for extension attributes.

* Fix handling of predicates. The wrong count was being passed through,
so even patterns that didn't have a predicate were being checked.

* Fix regex pattern

* Fix matcher set value test
2019-02-20 21:30:39 +01:00
Ines Montani
3b667787a9 Add xfailing test for #3289 2019-02-18 16:45:04 +01:00
Ines Montani
91f260f2c4 Add another test for #1971 2019-02-18 13:36:20 +01:00
Ines Montani
f30aac324c Update test_issue1971.py 2019-02-18 13:36:15 +01:00
Ines Montani
8fa26ca97e Fix tensor shape in test for #3288 2019-02-18 11:01:54 +01:00
Ines Montani
c32290557f Add xfailing test for #3288 2019-02-18 10:59:31 +01:00
Ines Montani
3af0b2dd1c Add xfailing test for #1971 [ci skip] 2019-02-17 13:04:47 +01:00
Ines Montani
1e252b129c Auto-format 2019-02-17 12:22:07 +01:00
Matthew Honnibal
92b6bd2977
Refinements to retokenize.split() function (#3282)
* Change retokenize.split() API for heads

* Pass lists as values for attrs in split

* Fix test_doc_split filename

* Add error for mismatched tokens after split

* Raise error if new tokens don't match text

* Fix doc test

* Fix error

* Move deps under attrs

* Fix split tests

* Fix retokenize.split
2019-02-15 17:32:31 +01:00
Ines Montani
1aa57690dc Add xfailing test for orth mismatch in retokenizer.split 2019-02-15 13:55:04 +01:00
Ines Montani
819768483f Add xfailing test for out-of-bounds heads 2019-02-15 13:09:07 +01:00
Ines Montani
d8051e89ca Tidy up tests 2019-02-15 12:56:51 +01:00
Ines Montani
c31a9dabd5 💫 Add en/em dash to prefixes and suffixes (#3281)
* Auto-format

* Add en/em dash to prefixes and suffixes
2019-02-15 10:29:59 +01:00
Ines Montani
5651a0d052 💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280)
* Add deprecation warning to Doc.merge and Span.merge

* Replace {Doc,Span}.merge with Doc.retokenize
2019-02-15 10:29:44 +01:00
Ines Montani
f146121092 💫 Make handling of [Pipe].labels consistent (#3273)
* Make handling of [Pipe].labels consistent

* Un-xfail passing test

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/tests/pipeline/test_pipe_methods.py

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Move error message to spacy.errors

* Fix textcat labels and test

* Make EntityRuler.labels return tuple as well
2019-02-15 06:03:19 +11:00
Ines Montani
3d577b77c6 Auto-formatting 2019-02-14 19:56:38 +01:00
Ines Montani
e104e47c21 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-14 15:35:34 +01:00
Ines Montani
0cd01a8c5e Merge branch 'master' into develop 2019-02-14 15:35:20 +01:00
Ines Montani
2e31921d0a 💫 Add base Language classes for more languages (#3276)
* Add base classes for more languages

* Add test for language class initialization

Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
2019-02-15 01:31:19 +11:00
Grivaz
39815513e2 Add split one token into several (resolves #2838) (#3253)
* Add split one token into several (resolves #2838)

* Improve error message for token splitting

* Make retokenizer.split() tests use a Token object

Change retokenizer.split() to use a Token object, instead of an index.

* Pass Token into retokenize.split()

Tweak retokenize.split() API so that we pass the `Token` object, not the index.

* Fix token.idx in retokenize.split()

* Test that token.idx is correct after split

* Fix token.idx for split tokens

* Fix retokenize.split()

* Fix retokenize.split

* Fix retokenize.split() test
2019-02-15 01:27:13 +11:00
Ines Montani
743ecf728c Tidy up conftest 2019-02-14 13:27:13 +01:00
Ines Montani
4d2438f985 Tidy up and auto-format 2019-02-13 15:29:08 +01:00
Ines Montani
fbf9f1edf1 Also raise error in Span.__reduce__ 2019-02-13 13:22:05 +01:00
Ines Montani
2d0c3c73f4
Raise better error if token is pickled (resolves #2833) (#3267) 2019-02-13 11:27:04 +01:00
Ines Montani
b589b945db
Fix PhraseMatcher pickling and length (resolves #3248) (#3252) 2019-02-12 18:27:54 +01:00