Matthew Honnibal
c0799430a7
Make small changes to Doc.to_array
...
* Change type-check logic to 'hasattr' (Python type-checking is brittle)
* Small 'house style' edits, mostly making code more terse.
2017-10-20 11:17:00 +02:00
Ramanan Balakrishnan
5941aa96a1
Support strings for attribute list in doc.to_array
2017-10-20 11:59:34 +05:30
Ramanan Balakrishnan
b47b4e2654
Support single value for attribute list in doc.to_scalar conversion
2017-10-18 14:43:47 +05:30
Matthew Honnibal
cd9378c8f1
Merge pull request #1423 from yuukos/master
...
Fixed Russian tokenizer
2017-10-16 11:45:53 +02:00
yuukos
92931a2efd
Merge branch 'russian_language'
2017-10-16 13:46:28 +07:00
yuukos
241d19a3e6
fixed Russian Tokenizer
...
- added trailing space flags for tokens
2017-10-16 13:37:05 +07:00
Paul O'Leary McCann
71ae8013ec
[ja] Use user_details instead of a wrapper class
...
Instead of using a JapaneseDoc wrapper class to store Mecab output,
stash it in `user_data`. -POLM
2017-10-16 00:24:34 +09:00
Paul O'Leary McCann
43eedf73f2
[ja] Stash tokenizer output for speed
...
Before this commit, the Mecab tokenizer had to be called twice when
creating a Doc- once during tokenization and once during tagging. This
creates a JapaneseDoc wrapper class for Doc that stashes the parsed
tokenizer output to remove redundant processing. -POLM
2017-10-15 23:33:25 +09:00
yuukos
6fb9d75bd2
fixed test with creating tokenizer
2017-10-13 15:51:03 +07:00
yuukos
a229b6e0de
added tests for Russian language
...
added tests of creating Russian Language instance and Russian tokenizer
2017-10-13 14:04:37 +07:00
yuukos
622b6d6270
updated Russian tokenizer
...
moved the trying to import pymorph into __init__
2017-10-13 13:57:29 +07:00
yuukos
f81dd284eb
updated spacy/__init__.py
...
registered russian language via set_lang_class
2017-10-12 22:28:34 +07:00
yuukos
7b9491679f
added russian language support
2017-10-12 22:24:20 +07:00
Raphaël Bournhonesque
3452d6ce52
Resolve issue #1078 by simplifying URL pattern
...
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
2017-10-11 11:24:00 +02:00
Matthew Honnibal
331d338b8b
Merge pull request #1246 from polm/ja-pos-tagger
...
[wip] Sample implementation of Japanese Tagger (ref #1214 )
2017-10-09 04:00:53 +02:00
Orion Montoya
b0d271809d
Unit test for lemmatizer exceptions -- copied from regression test for #1387
2017-10-05 10:49:28 -04:00
Orion Montoya
ffb50d21a0
Lemmatizer honors exceptions: Fix #1387
2017-10-05 10:49:02 -04:00
Orion Montoya
e81a608173
Regression test for lemmatizer exceptions -- demonstrate issue #1387
2017-10-05 10:47:48 -04:00
Matthew Honnibal
eb72eae258
Merge pull request #1364 from Destygo/master
...
Fixed NER model loading bug
2017-09-29 12:29:43 +02:00
Vincent Genty
259ed027af
Fixed NER model loading bug
2017-09-26 15:46:04 +02:00
Ines Montani
361211fe26
Merge pull request #1342 from wannaphongcom/master
...
Add Thai language
2017-09-26 15:40:55 +02:00
Yam
923c4c2fb2
Update punctuation.py
...
add `……`
2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun
1abf472068
add th test
2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun
39bb5690f0
update th
2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun
44291f6697
add thai
2017-09-20 23:26:34 +07:00
Yam
978b24ccd4
Update punctuation.py
...
In Chinese, `~` and `——` is hyphens,
`·` is intermittent symbol
2017-09-20 23:02:22 +08:00
Yu-chun Huang
188b439b25
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:58:42 +08:00
Yu-chun Huang
1f1f35dcd0
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:57:24 +08:00
Yu-chun Huang
7692b8c071
Update __init__.py
...
Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-12 16:23:47 +08:00
Matthew Honnibal
ddaff6ca56
Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
...
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e
Fix customized-tokenizer tests
2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a
Merge pull request #1294 from Vimos/master
...
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0
Fix error on test
2017-09-04 18:59:36 +02:00
Matthew Honnibal
e8a26ebfab
Add efficiency note to new get_lca_matrix() method
2017-09-04 15:43:52 +02:00
Eric Zhao
d61c117081
Lowest common ancestor matrix for spans and docs
...
Added functionality for spans and docs to get lowest common ancestor
matrix by simply calling: doc.get_lca_matrix() or
doc[:3].get_lca_matrix().
Corresponding unit tests were also added under spacy/tests/doc and
spacy/tests/spans.
Designed to address: https://github.com/explosion/spaCy/issues/969 .
2017-09-03 12:22:19 -07:00
Matthew Honnibal
9bffcaa73d
Update test to make it slightly more direct
...
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6
fix issue #1292
2017-08-30 14:49:14 +08:00
Paul O'Leary McCann
8b3e1f7b5b
Handle out-of-vocab words
...
Wasn't handling words out of the tokenizer dictionary vocabulary
properly. This adds a fix and test for that. -POLM
2017-08-29 23:58:42 +09:00
Jeffrey Gerard
884ba168a8
Capture more noun chunks
2017-08-23 21:18:53 -07:00
Paul O'Leary McCann
95050201ce
Add importorskip for Japanese fixture
2017-08-22 21:30:59 +09:00
Paul O'Leary McCann
bcf2b9b4f5
Update tagger & tokenizer tests
...
Tagger is now parametrized and has two sentences with more tag coverage.
The tokenizer tests are updated to reflect differences in tokenization
between IPAdic and Unidic. -POLM
2017-08-22 00:03:11 +09:00
Paul O'Leary McCann
adfd987316
Update the TAG_MAP
2017-08-22 00:02:55 +09:00
Paul O'Leary McCann
53e17296e9
Fix pronoun handling
...
Missed this case earlier.
連体詞 have three classes for UD purposes:
- その -> DET
- それ -> PRON
- 同じ -> ADJ
-POLM
2017-08-22 00:01:49 +09:00
Paul O'Leary McCann
c435f748d7
Put Mecab import in utility function
2017-08-22 00:01:28 +09:00
ines
dcff10abe9
Add regression test for #1281
2017-08-21 16:11:47 +02:00
ines
edc596d9a7
Add missing tokenizer exceptions ( resolves #1281 )
2017-08-21 16:11:36 +02:00
Paul O'Leary McCann
234a8a7591
Change default tag for 動詞,非自立可能
...
Example of this is いる in these sentences:
彼はそこにいる。# should be VERB
彼は底に立っている。# should be AUX
Unclear which case is more numerous - need to check a large corpus - but
in keeping with the other ambiguous tags, this is mapped to the
"dominant" or first part of the tag. -POLM
2017-08-21 00:21:45 +09:00
Paul O'Leary McCann
6e9e686568
Sample implementation of Japanese Tagger (ref #1214 )
...
This is far from complete but it should be enough to check some things.
1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.
2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)
3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.
Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
Delirious Lettuce
d3b03f0544
Fix typos:
...
* `auxillary` -> `auxiliary`
* `consistute` -> `constitute`
* `earlist` -> `earliest`
* `prefered` -> `preferred`
* `direcory` -> `directory`
* `reuseable` -> `reusable`
* `idiosyncracies` -> `idiosyncrasies`
* `enviroment` -> `environment`
* `unecessary` -> `unnecessary`
* `yesteday` -> `yesterday`
* `resouces` -> `resources`
2017-08-06 21:31:39 -06:00
Matthew Honnibal
d51d55bba6
Increment version
2017-07-22 15:43:16 +02:00