Matthew Honnibal
dbc276e3b2
Fix 'toupper()' -> 'upper()'
2017-10-20 13:02:13 +02:00
Matthew Honnibal
7a46792376
Fix compile error
...
Closures not allowed in cpdef
2017-10-20 11:53:47 +02:00
Matthew Honnibal
658536b5ce
Fix to_array compile error
2017-10-20 11:35:10 +02:00
Matthew Honnibal
c0799430a7
Make small changes to Doc.to_array
...
* Change type-check logic to 'hasattr' (Python type-checking is brittle)
* Small 'house style' edits, mostly making code more terse.
2017-10-20 11:17:00 +02:00
Ramanan Balakrishnan
5941aa96a1
Support strings for attribute list in doc.to_array
2017-10-20 11:59:34 +05:30
Ramanan Balakrishnan
b47b4e2654
Support single value for attribute list in doc.to_scalar conversion
2017-10-18 14:43:47 +05:30
Matthew Honnibal
cd9378c8f1
Merge pull request #1423 from yuukos/master
...
Fixed Russian tokenizer
2017-10-16 11:45:53 +02:00
yuukos
92931a2efd
Merge branch 'russian_language'
2017-10-16 13:46:28 +07:00
yuukos
241d19a3e6
fixed Russian Tokenizer
...
- added trailing space flags for tokens
2017-10-16 13:37:05 +07:00
Paul O'Leary McCann
71ae8013ec
[ja] Use user_details instead of a wrapper class
...
Instead of using a JapaneseDoc wrapper class to store Mecab output,
stash it in `user_data`. -POLM
2017-10-16 00:24:34 +09:00
Paul O'Leary McCann
43eedf73f2
[ja] Stash tokenizer output for speed
...
Before this commit, the Mecab tokenizer had to be called twice when
creating a Doc- once during tokenization and once during tagging. This
creates a JapaneseDoc wrapper class for Doc that stashes the parsed
tokenizer output to remove redundant processing. -POLM
2017-10-15 23:33:25 +09:00
yuukos
6fb9d75bd2
fixed test with creating tokenizer
2017-10-13 15:51:03 +07:00
yuukos
a229b6e0de
added tests for Russian language
...
added tests of creating Russian Language instance and Russian tokenizer
2017-10-13 14:04:37 +07:00
yuukos
622b6d6270
updated Russian tokenizer
...
moved the trying to import pymorph into __init__
2017-10-13 13:57:29 +07:00
yuukos
f81dd284eb
updated spacy/__init__.py
...
registered russian language via set_lang_class
2017-10-12 22:28:34 +07:00
yuukos
7b9491679f
added russian language support
2017-10-12 22:24:20 +07:00
Raphaël Bournhonesque
3452d6ce52
Resolve issue #1078 by simplifying URL pattern
...
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
2017-10-11 11:24:00 +02:00
Matthew Honnibal
331d338b8b
Merge pull request #1246 from polm/ja-pos-tagger
...
[wip] Sample implementation of Japanese Tagger (ref #1214 )
2017-10-09 04:00:53 +02:00
Orion Montoya
b0d271809d
Unit test for lemmatizer exceptions -- copied from regression test for #1387
2017-10-05 10:49:28 -04:00
Orion Montoya
ffb50d21a0
Lemmatizer honors exceptions: Fix #1387
2017-10-05 10:49:02 -04:00
Orion Montoya
e81a608173
Regression test for lemmatizer exceptions -- demonstrate issue #1387
2017-10-05 10:47:48 -04:00
Matthew Honnibal
eb72eae258
Merge pull request #1364 from Destygo/master
...
Fixed NER model loading bug
2017-09-29 12:29:43 +02:00
Vincent Genty
259ed027af
Fixed NER model loading bug
2017-09-26 15:46:04 +02:00
Ines Montani
361211fe26
Merge pull request #1342 from wannaphongcom/master
...
Add Thai language
2017-09-26 15:40:55 +02:00
Yam
923c4c2fb2
Update punctuation.py
...
add `……`
2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun
1abf472068
add th test
2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun
39bb5690f0
update th
2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun
44291f6697
add thai
2017-09-20 23:26:34 +07:00
Yam
978b24ccd4
Update punctuation.py
...
In Chinese, `~` and `——` is hyphens,
`·` is intermittent symbol
2017-09-20 23:02:22 +08:00
Yu-chun Huang
188b439b25
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:58:42 +08:00
Yu-chun Huang
1f1f35dcd0
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:57:24 +08:00
Yu-chun Huang
7692b8c071
Update __init__.py
...
Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-12 16:23:47 +08:00
Matthew Honnibal
ddaff6ca56
Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
...
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e
Fix customized-tokenizer tests
2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a
Merge pull request #1294 from Vimos/master
...
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0
Fix error on test
2017-09-04 18:59:36 +02:00
Matthew Honnibal
e8a26ebfab
Add efficiency note to new get_lca_matrix() method
2017-09-04 15:43:52 +02:00
Eric Zhao
d61c117081
Lowest common ancestor matrix for spans and docs
...
Added functionality for spans and docs to get lowest common ancestor
matrix by simply calling: doc.get_lca_matrix() or
doc[:3].get_lca_matrix().
Corresponding unit tests were also added under spacy/tests/doc and
spacy/tests/spans.
Designed to address: https://github.com/explosion/spaCy/issues/969 .
2017-09-03 12:22:19 -07:00
Matthew Honnibal
9bffcaa73d
Update test to make it slightly more direct
...
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6
fix issue #1292
2017-08-30 14:49:14 +08:00
Paul O'Leary McCann
8b3e1f7b5b
Handle out-of-vocab words
...
Wasn't handling words out of the tokenizer dictionary vocabulary
properly. This adds a fix and test for that. -POLM
2017-08-29 23:58:42 +09:00
Jeffrey Gerard
884ba168a8
Capture more noun chunks
2017-08-23 21:18:53 -07:00
Paul O'Leary McCann
95050201ce
Add importorskip for Japanese fixture
2017-08-22 21:30:59 +09:00
Paul O'Leary McCann
bcf2b9b4f5
Update tagger & tokenizer tests
...
Tagger is now parametrized and has two sentences with more tag coverage.
The tokenizer tests are updated to reflect differences in tokenization
between IPAdic and Unidic. -POLM
2017-08-22 00:03:11 +09:00
Paul O'Leary McCann
adfd987316
Update the TAG_MAP
2017-08-22 00:02:55 +09:00
Paul O'Leary McCann
53e17296e9
Fix pronoun handling
...
Missed this case earlier.
連体詞 have three classes for UD purposes:
- その -> DET
- それ -> PRON
- 同じ -> ADJ
-POLM
2017-08-22 00:01:49 +09:00
Paul O'Leary McCann
c435f748d7
Put Mecab import in utility function
2017-08-22 00:01:28 +09:00
ines
dcff10abe9
Add regression test for #1281
2017-08-21 16:11:47 +02:00
ines
edc596d9a7
Add missing tokenizer exceptions ( resolves #1281 )
2017-08-21 16:11:36 +02:00
Paul O'Leary McCann
234a8a7591
Change default tag for 動詞,非自立可能
...
Example of this is いる in these sentences:
彼はそこにいる。# should be VERB
彼は底に立っている。# should be AUX
Unclear which case is more numerous - need to check a large corpus - but
in keeping with the other ambiguous tags, this is mapped to the
"dominant" or first part of the tag. -POLM
2017-08-21 00:21:45 +09:00