Commit Graph

5236 Commits

Author SHA1 Message Date
yuukos
241d19a3e6 fixed Russian Tokenizer
- added trailing space flags for tokens
2017-10-16 13:37:05 +07:00
yuukos
a229b6e0de added tests for Russian language
added tests of creating Russian Language instance and Russian tokenizer
2017-10-13 14:04:37 +07:00
yuukos
622b6d6270 updated Russian tokenizer
moved the trying to import pymorph into __init__
2017-10-13 13:57:29 +07:00
yuukos
f81dd284eb updated spacy/__init__.py
registered russian language via set_lang_class
2017-10-12 22:28:34 +07:00
yuukos
7b9491679f added russian language support 2017-10-12 22:24:20 +07:00
yuukos
2a78f4d634 updated .gitignore file
added excluding PyCharm's idea directory
2017-10-12 22:23:19 +07:00
Ines Montani
a06b84e7cc Merge pull request #1407 from hscspring/patch-6
Update training.jade
2017-10-11 14:25:38 +02:00
Ines Montani
ffc2fef13c Merge pull request #1411 from raphael0202/issue_1078
Resolve issue #1078 by simplifying URL pattern
2017-10-11 11:54:57 +02:00
Raphaël Bournhonesque
3452d6ce52 Resolve issue #1078 by simplifying URL pattern
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
2017-10-11 11:24:00 +02:00
Yam
efe0800f91 Update training.jade
fix several changes
2017-10-09 21:39:15 -05:00
Matthew Honnibal
331d338b8b Merge pull request #1246 from polm/ja-pos-tagger
[wip] Sample implementation of Japanese Tagger (ref #1214)
2017-10-09 04:00:53 +02:00
Ines Montani
d33899b60b Merge pull request #1393 from yuukos/patch-1
Update adding-languages.jade
2017-10-06 18:03:31 +02:00
Ines Montani
e89689a31d Update CONTRIBUTORS.md 2017-10-06 18:02:40 +02:00
Alex
763b54cbc3 Update adding-languages.jade
Fixed misspellings
2017-10-06 16:30:44 +07:00
Matthew Honnibal
0e1adacaff Merge pull request #1390 from mdcclv/contributor-mdcclv
Contributor agreement for Orion Montoya @mdcclv
2017-10-06 02:39:08 +02:00
Orion Montoya
e04e11070f Contributor agreement for Orion Montoya @mdcclv 2017-10-05 17:45:45 -04:00
Ines Montani
e77d8886f7 Update CONTRIBUTORS.md 2017-10-05 22:22:04 +02:00
Matthew Honnibal
dea81f113d Merge pull request #1389 from mdcclv/lemmatizer_obey_exceptions
Lemmatizer obey exceptions
2017-10-05 22:11:21 +02:00
Orion Montoya
b0d271809d Unit test for lemmatizer exceptions -- copied from regression test for #1387 2017-10-05 10:49:28 -04:00
Orion Montoya
ffb50d21a0 Lemmatizer honors exceptions: Fix #1387 2017-10-05 10:49:02 -04:00
Orion Montoya
e81a608173 Regression test for lemmatizer exceptions -- demonstrate issue #1387 2017-10-05 10:47:48 -04:00
Ines Montani
678651ca98 Merge pull request #1386 from kokes/patch-1
Fixing links to SyntaxNet
2017-10-04 13:35:01 +02:00
Ondrej Kokes
a9362f1c73 Fixing links to SyntaxNet 2017-10-04 12:55:07 +02:00
Matthew Honnibal
eb72eae258 Merge pull request #1364 from Destygo/master
Fixed NER model loading bug
2017-09-29 12:29:43 +02:00
Ines Montani
58bfe30a12 Merge pull request #1362 from IamJeffG/docs/custom-tokenizer
Document Tokenizer(token_match) and clarify tokenizer_pseudo_code
2017-09-26 15:51:15 +02:00
Vincent Genty
259ed027af Fixed NER model loading bug 2017-09-26 15:46:04 +02:00
Ines Montani
361211fe26 Merge pull request #1342 from wannaphongcom/master
Add Thai language
2017-09-26 15:40:55 +02:00
Jeffrey Gerard
b6ebedd09c Document Tokenizer(token_match) and clarify tokenizer_pseudo_code
Closes #835

In the `tokenizer_pseudo_code` I put the `special_cases` kwarg
before `find_prefix` because this now matches the order the args
are used in the pseudocode, and it also matches spacy's actual code.
2017-09-25 13:13:25 -07:00
Matthew Honnibal
2f8d535f65 Merge pull request #1351 from hscspring/patch-4
Update punctuation.py
2017-09-24 12:16:39 +02:00
Matthew Honnibal
9177313063 Merge pull request #1352 from hscspring/patch-5
Update customizing-tokenizer.jade
2017-09-22 16:11:49 +02:00
Matthew Honnibal
1dbc2285b8 Merge pull request #1350 from hscspring/patch-3
Update word-vectors-similarities.jade
2017-09-22 16:11:05 +02:00
Yam
54855f0eee Update customizing-tokenizer.jade 2017-09-22 12:15:48 +08:00
Yam
6f450306c3 Update customizing-tokenizer.jade
update some codes:    
- `me` -> `-PRON`
- `TAG` -> `POS`
- `create_tokenizer` function
2017-09-22 10:53:22 +08:00
Yam
923c4c2fb2 Update punctuation.py
add `……`
2017-09-22 09:50:46 +08:00
Yam
425c09488d Update word-vectors-similarities.jade
add
```    
import spacy
nlp = spacy.load('en') ```
2017-09-22 08:56:34 +08:00
Wannaphong Phatthiyaphaibun
1abf472068 add th test 2017-09-21 12:56:58 +07:00
Matthew Honnibal
ea2732469b Merge pull request #1340 from hscspring/patch-1
Update punctuation.py
2017-09-20 23:57:00 +02:00
Wannaphong Phatthiyaphaibun
39bb5690f0 update th 2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun
44291f6697 add thai 2017-09-20 23:26:34 +07:00
Yam
978b24ccd4 Update punctuation.py
In Chinese, `~` and `——` is hyphens,   
`·` is intermittent symbol
2017-09-20 23:02:22 +08:00
Matthew Honnibal
aa728b33ca Merge pull request #1333 from galaxyh/master
Add Chinese punctuation
2017-09-19 15:09:30 +02:00
Yu-chun Huang
188b439b25 Add Chinese punctuation
Add Chinese punctuation.
2017-09-19 16:58:42 +08:00
Yu-chun Huang
1f1f35dcd0 Add Chinese punctuation
Add Chinese punctuation.
2017-09-19 16:57:24 +08:00
Ines Montani
4bee26188d Merge pull request #1323 from galaxyh/master
Set the "cut_all" parameter in jieba.cut() to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-14 15:23:41 +02:00
Yu-chun Huang
7692b8c071 Update __init__.py
Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-12 16:23:47 +08:00
Matthew Honnibal
ddaff6ca56 Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e Fix customized-tokenizer tests 2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a Merge pull request #1294 from Vimos/master
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0 Fix error on test 2017-09-04 18:59:36 +02:00
Matthew Honnibal
33313c01ad Merge pull request #1298 from ericzhao28/master
Lowest common ancestor matrix for spans and docs
2017-09-04 18:57:54 +02:00