Orion Montoya
e81a608173
Regression test for lemmatizer exceptions -- demonstrate issue #1387
2017-10-05 10:47:48 -04:00
Matthew Honnibal
eb72eae258
Merge pull request #1364 from Destygo/master
...
Fixed NER model loading bug
2017-09-29 12:29:43 +02:00
Vincent Genty
259ed027af
Fixed NER model loading bug
2017-09-26 15:46:04 +02:00
Ines Montani
361211fe26
Merge pull request #1342 from wannaphongcom/master
...
Add Thai language
2017-09-26 15:40:55 +02:00
Yam
923c4c2fb2
Update punctuation.py
...
add `……`
2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun
1abf472068
add th test
2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun
39bb5690f0
update th
2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun
44291f6697
add thai
2017-09-20 23:26:34 +07:00
Yam
978b24ccd4
Update punctuation.py
...
In Chinese, `~` and `——` is hyphens,
`·` is intermittent symbol
2017-09-20 23:02:22 +08:00
Yu-chun Huang
188b439b25
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:58:42 +08:00
Yu-chun Huang
1f1f35dcd0
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:57:24 +08:00
Yu-chun Huang
7692b8c071
Update __init__.py
...
Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-12 16:23:47 +08:00
Matthew Honnibal
ddaff6ca56
Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
...
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e
Fix customized-tokenizer tests
2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a
Merge pull request #1294 from Vimos/master
...
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0
Fix error on test
2017-09-04 18:59:36 +02:00
Matthew Honnibal
e8a26ebfab
Add efficiency note to new get_lca_matrix() method
2017-09-04 15:43:52 +02:00
Eric Zhao
d61c117081
Lowest common ancestor matrix for spans and docs
...
Added functionality for spans and docs to get lowest common ancestor
matrix by simply calling: doc.get_lca_matrix() or
doc[:3].get_lca_matrix().
Corresponding unit tests were also added under spacy/tests/doc and
spacy/tests/spans.
Designed to address: https://github.com/explosion/spaCy/issues/969 .
2017-09-03 12:22:19 -07:00
Matthew Honnibal
9bffcaa73d
Update test to make it slightly more direct
...
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6
fix issue #1292
2017-08-30 14:49:14 +08:00
Jeffrey Gerard
884ba168a8
Capture more noun chunks
2017-08-23 21:18:53 -07:00
ines
dcff10abe9
Add regression test for #1281
2017-08-21 16:11:47 +02:00
ines
edc596d9a7
Add missing tokenizer exceptions ( resolves #1281 )
2017-08-21 16:11:36 +02:00
Delirious Lettuce
d3b03f0544
Fix typos:
...
* `auxillary` -> `auxiliary`
* `consistute` -> `constitute`
* `earlist` -> `earliest`
* `prefered` -> `preferred`
* `direcory` -> `directory`
* `reuseable` -> `reusable`
* `idiosyncracies` -> `idiosyncrasies`
* `enviroment` -> `environment`
* `unecessary` -> `unnecessary`
* `yesteday` -> `yesterday`
* `resouces` -> `resources`
2017-08-06 21:31:39 -06:00
Matthew Honnibal
d51d55bba6
Increment version
2017-07-22 15:43:16 +02:00
Matthew Honnibal
796b2f4c1b
Remove print statements in tests
2017-07-22 15:42:38 +02:00
Matthew Honnibal
4b2e5e59ed
Add flush_cache method to tokenizer, to fix #1061
...
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061 .
When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Matthew Honnibal
23a55b40ca
Default to English noun chunks iterator if no lang set
2017-07-22 14:15:25 +02:00
Matthew Honnibal
9750a0128c
Fix Span.noun_chunks. Closes #1207
2017-07-22 14:14:57 +02:00
Matthew Honnibal
d9b85675d7
Rename regression test
2017-07-22 14:14:35 +02:00
Matthew Honnibal
dfbc7e49de
Add test for Issue #1207
2017-07-22 14:14:01 +02:00
Matthew Honnibal
0ae3807d7d
Fix gaps in Lexeme API. Closes #1031
2017-07-22 13:53:48 +02:00
Matthew Honnibal
83e1b5f1e3
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 13:45:35 +02:00
Matthew Honnibal
45f6961ae0
Add __version__ symbol in __init__.py
2017-07-22 13:45:21 +02:00
Matthew Honnibal
8b9c4c5e1c
Add missing SP symbol to tag map, re #1052
2017-07-22 13:44:17 +02:00
Ines Montani
9af04ea11f
Merge pull request #1161 from AlexisEidelman/patch-1
...
French NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:40:46 +02:00
Matthew Honnibal
44dd247e73
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 13:35:30 +02:00
Matthew Honnibal
94267ec50f
Fix merge conflit in printer
2017-07-22 13:35:15 +02:00
Ines Montani
c7708dc736
Merge pull request #1177 from swierh/master
...
Dutch NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:35:08 +02:00
Matthew Honnibal
5916d46ba8
Avoid use of deepcopy in printer
2017-07-22 13:34:01 +02:00
Ines Montani
9eca6503c1
Merge pull request #1157 from polm/master
...
Add basic Japanese Tokenizer Test
2017-07-10 13:07:11 +02:00
Paul O'Leary McCann
bc87b815cc
Add comment clarifying what LANGUAGES does
2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188
Remove Japanese from LANGUAGES
...
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Swier
29720150f9
fix import of stop words in language data
2017-07-05 14:08:04 +02:00
Swier
f377c9c952
Rename stop_words.py to word_sets.py
2017-07-05 14:06:28 +02:00
Swier
5357874bf7
add Dutch numbers and ordinals
2017-07-05 14:03:30 +02:00
gispk47
669bd14213
Update __init__.py
...
remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error
2017-07-01 13:12:00 +08:00
Paul O'Leary McCann
c336193392
Parametrize and extend Japanese tokenizer tests
2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e
Add importorskip for janome
2017-06-29 00:09:20 +09:00
Alexis
1b3a5d87ba
French NUM_WORDS and ORDINAL_WORDS
2017-06-28 14:11:20 +02:00