Paul O'Leary McCann
c435f748d7
Put Mecab import in utility function
2017-08-22 00:01:28 +09:00
ines
dcff10abe9
Add regression test for #1281
2017-08-21 16:11:47 +02:00
ines
edc596d9a7
Add missing tokenizer exceptions ( resolves #1281 )
2017-08-21 16:11:36 +02:00
Paul O'Leary McCann
234a8a7591
Change default tag for 動詞,非自立可能
...
Example of this is いる in these sentences:
彼はそこにいる。# should be VERB
彼は底に立っている。# should be AUX
Unclear which case is more numerous - need to check a large corpus - but
in keeping with the other ambiguous tags, this is mapped to the
"dominant" or first part of the tag. -POLM
2017-08-21 00:21:45 +09:00
Paul O'Leary McCann
6e9e686568
Sample implementation of Japanese Tagger (ref #1214 )
...
This is far from complete but it should be enough to check some things.
1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.
2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)
3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.
Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
Delirious Lettuce
d3b03f0544
Fix typos:
...
* `auxillary` -> `auxiliary`
* `consistute` -> `constitute`
* `earlist` -> `earliest`
* `prefered` -> `preferred`
* `direcory` -> `directory`
* `reuseable` -> `reusable`
* `idiosyncracies` -> `idiosyncrasies`
* `enviroment` -> `environment`
* `unecessary` -> `unnecessary`
* `yesteday` -> `yesterday`
* `resouces` -> `resources`
2017-08-06 21:31:39 -06:00
Matthew Honnibal
d51d55bba6
Increment version
2017-07-22 15:43:16 +02:00
Matthew Honnibal
796b2f4c1b
Remove print statements in tests
2017-07-22 15:42:38 +02:00
Matthew Honnibal
4b2e5e59ed
Add flush_cache method to tokenizer, to fix #1061
...
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061 .
When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Matthew Honnibal
23a55b40ca
Default to English noun chunks iterator if no lang set
2017-07-22 14:15:25 +02:00
Matthew Honnibal
9750a0128c
Fix Span.noun_chunks. Closes #1207
2017-07-22 14:14:57 +02:00
Matthew Honnibal
d9b85675d7
Rename regression test
2017-07-22 14:14:35 +02:00
Matthew Honnibal
dfbc7e49de
Add test for Issue #1207
2017-07-22 14:14:01 +02:00
Matthew Honnibal
0ae3807d7d
Fix gaps in Lexeme API. Closes #1031
2017-07-22 13:53:48 +02:00
Matthew Honnibal
83e1b5f1e3
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 13:45:35 +02:00
Matthew Honnibal
45f6961ae0
Add __version__ symbol in __init__.py
2017-07-22 13:45:21 +02:00
Matthew Honnibal
8b9c4c5e1c
Add missing SP symbol to tag map, re #1052
2017-07-22 13:44:17 +02:00
Ines Montani
9af04ea11f
Merge pull request #1161 from AlexisEidelman/patch-1
...
French NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:40:46 +02:00
Matthew Honnibal
44dd247e73
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 13:35:30 +02:00
Matthew Honnibal
94267ec50f
Fix merge conflit in printer
2017-07-22 13:35:15 +02:00
Ines Montani
c7708dc736
Merge pull request #1177 from swierh/master
...
Dutch NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:35:08 +02:00
Matthew Honnibal
5916d46ba8
Avoid use of deepcopy in printer
2017-07-22 13:34:01 +02:00
Ines Montani
9eca6503c1
Merge pull request #1157 from polm/master
...
Add basic Japanese Tokenizer Test
2017-07-10 13:07:11 +02:00
Paul O'Leary McCann
bc87b815cc
Add comment clarifying what LANGUAGES does
2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188
Remove Japanese from LANGUAGES
...
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Swier
29720150f9
fix import of stop words in language data
2017-07-05 14:08:04 +02:00
Swier
f377c9c952
Rename stop_words.py to word_sets.py
2017-07-05 14:06:28 +02:00
Swier
5357874bf7
add Dutch numbers and ordinals
2017-07-05 14:03:30 +02:00
Raphaël Bournhonesque
8592f3de47
Fix fuzzy unit tests
2017-07-01 15:03:32 +02:00
Raphaël Bournhonesque
f4748834d9
Use spacy hash_string function instead of md5
2017-07-01 13:17:26 +02:00
Raphaël Bournhonesque
c3d722d66f
Add a disclaimer about classes copied from the Jinja2 project
2017-07-01 13:09:56 +02:00
gispk47
669bd14213
Update __init__.py
...
remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error
2017-07-01 13:12:00 +08:00
Paul O'Leary McCann
c336193392
Parametrize and extend Japanese tokenizer tests
2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e
Add importorskip for janome
2017-06-29 00:09:20 +09:00
Alexis
1b3a5d87ba
French NUM_WORDS and ORDINAL_WORDS
2017-06-28 14:11:20 +02:00
Paul O'Leary McCann
e56fea14eb
Add basic Japanese tokenizer test
2017-06-28 01:24:25 +09:00
Paul O'Leary McCann
84041a2bb5
Make create_tokenizer work with Japanese
2017-06-28 01:18:05 +09:00
Raphaël Bournhonesque
46637369aa
Add basic unit tests for Pattern
2017-06-11 18:34:38 +02:00
Raphaël Bournhonesque
1849a110e3
Improve logging
2017-06-11 18:31:19 +02:00
Raphaël Bournhonesque
4289a21703
Add 'ent' to node matching key
2017-06-11 18:30:53 +02:00
Raphaël Bournhonesque
d010f5a123
Fix node matching bug caused by lower function
2017-06-11 18:30:28 +02:00
Raphaël Bournhonesque
4ca8a396a2
Do not add the root token to the adjacency map
2017-06-11 18:30:01 +02:00
Raphaël Bournhonesque
d9c567371f
Move add_node and add_edge methods to the Tree base class
2017-06-11 18:29:28 +02:00
Raphaël Bournhonesque
8ff4f512a2
Check in PatternParser that the generated Pattern is valid
2017-06-11 18:28:36 +02:00
Raphaël Bournhonesque
e55199d454
Implementation of Pattern
2017-06-11 01:06:24 +02:00
György Orosz
fa26041da6
Fixed typo in cli/package.py
2017-06-07 16:19:08 +02:00
Ines Montani
e7ef51b382
Update tokenizer_exceptions.py
2017-06-02 19:00:01 +02:00
Ines Montani
81918155ef
Merge pull request #1096 from recognai/master
...
Spanish model features
2017-06-02 11:07:27 +02:00
Francisco Aranda
70a2180199
fix(spanish sentence segmentation): remove tokenizer exceptions the break sentence segmentation. Aligned with training corpus
2017-06-02 08:19:57 +02:00
Francisco Aranda
5b385e7d78
feat(spanish model): add the spanish noun chunker
2017-06-02 08:14:06 +02:00
Ines Montani
7f6be41f21
Fix typo in English tokenizer exceptions ( resolves #1071 )
2017-05-23 12:18:00 +02:00
Raphaël Bournhonesque
6381ebfb14
Use yield from syntax
2017-05-18 10:42:35 +02:00
Raphaël Bournhonesque
f37d078d6a
Fix issue #1069 with custom hook Doc.sents
definition
2017-05-18 09:59:38 +02:00
ines
9003fd25e5
Fix error messages if model is required ( resolves #1051 )
...
Rename about.__docs__ to about.__docs_models__.
2017-05-13 13:14:02 +02:00
ines
24e973b17f
Rename about.__docs__ to about.__docs_models__
2017-05-13 13:09:00 +02:00
ines
6e1dbc608e
Fix parse_tree test
2017-05-13 12:34:20 +02:00
ines
573f0ba867
Replace deepcopy
2017-05-13 12:34:14 +02:00
ines
bd428c0a70
Set defaults for light and flat kwargs
2017-05-13 12:34:05 +02:00
ines
c5669450a0
Fix formatting
2017-05-13 12:33:57 +02:00
Matthew Honnibal
ad590feaa8
Fix test, which imported English incorrectly
2017-05-13 11:36:19 +02:00
Ines Montani
8d742ac8ff
Merge pull request #1055 from recognai/master
...
Enable pruning out rare words from clusters data
2017-05-13 03:22:56 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
oeg
cdaefae60a
feature(populate_vocab): Enable pruning out rare words from clusters data
2017-05-12 16:15:19 +02:00
ines
b1f22c5a10
Fix formatting
2017-05-03 20:11:02 +02:00
ines
a04b5be1b2
Add glossary for annotation scheme ( closes #1034 )
...
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Ines Montani
3ea23a3f4d
Fix formatting
2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d
Raise custom ImportError if importing janome fails
2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b
Add newline
2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea
Add newline
2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135
Add newline
2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87
Add basic japanese support
2017-05-03 13:56:21 +09:00
Matthew Honnibal
31ec9e1371
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2
Add dropout optin for parser and NER
...
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.
nlp.entity.update(doc, gold, drop=0.4)
This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.
This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c
Add newline
2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2
Add newline
2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef
Add newline
2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21
Add newline
2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05
Add newline
2017-04-27 11:14:26 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27
Increment version
2017-04-26 20:25:41 +02:00
luvogels
b331929a7e
Merge branch 'master' of https://github.com/luvogels/spaCy
2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09
Update __init__.py
2017-04-26 18:27:55 +02:00
ines
527d51ac9a
Fetch shortcuts from GitHub and improve error handling
2017-04-26 18:00:28 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b
Fix reporting if no dev data with train
2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
ines
401045433c
Simplify compat.fix_text
2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64
Increment version
2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b
Update fix for Issue #999
2017-04-23 18:14:37 +02:00