Commit Graph

2844 Commits

Author SHA1 Message Date
Ines Montani
9af04ea11f Merge pull request #1161 from AlexisEidelman/patch-1
French NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:40:46 +02:00
Matthew Honnibal
44dd247e73 Merge branch 'master' of https://github.com/explosion/spaCy 2017-07-22 13:35:30 +02:00
Matthew Honnibal
94267ec50f Fix merge conflit in printer 2017-07-22 13:35:15 +02:00
Ines Montani
c7708dc736 Merge pull request #1177 from swierh/master
Dutch NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:35:08 +02:00
Matthew Honnibal
5916d46ba8 Avoid use of deepcopy in printer 2017-07-22 13:34:01 +02:00
Ines Montani
9eca6503c1 Merge pull request #1157 from polm/master
Add basic Japanese Tokenizer Test
2017-07-10 13:07:11 +02:00
Paul O'Leary McCann
bc87b815cc Add comment clarifying what LANGUAGES does 2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188 Remove Japanese from LANGUAGES
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Swier
29720150f9 fix import of stop words in language data 2017-07-05 14:08:04 +02:00
Swier
f377c9c952 Rename stop_words.py to word_sets.py 2017-07-05 14:06:28 +02:00
Swier
5357874bf7 add Dutch numbers and ordinals 2017-07-05 14:03:30 +02:00
gispk47
669bd14213 Update __init__.py
remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error
2017-07-01 13:12:00 +08:00
Paul O'Leary McCann
c336193392 Parametrize and extend Japanese tokenizer tests 2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e Add importorskip for janome 2017-06-29 00:09:20 +09:00
Alexis
1b3a5d87ba French NUM_WORDS and ORDINAL_WORDS 2017-06-28 14:11:20 +02:00
Paul O'Leary McCann
e56fea14eb Add basic Japanese tokenizer test 2017-06-28 01:24:25 +09:00
Paul O'Leary McCann
84041a2bb5 Make create_tokenizer work with Japanese 2017-06-28 01:18:05 +09:00
György Orosz
fa26041da6 Fixed typo in cli/package.py 2017-06-07 16:19:08 +02:00
Ines Montani
e7ef51b382 Update tokenizer_exceptions.py 2017-06-02 19:00:01 +02:00
Ines Montani
81918155ef Merge pull request #1096 from recognai/master
Spanish model features
2017-06-02 11:07:27 +02:00
Francisco Aranda
70a2180199 fix(spanish sentence segmentation): remove tokenizer exceptions the break sentence segmentation. Aligned with training corpus 2017-06-02 08:19:57 +02:00
Francisco Aranda
5b385e7d78 feat(spanish model): add the spanish noun chunker 2017-06-02 08:14:06 +02:00
Ines Montani
7f6be41f21 Fix typo in English tokenizer exceptions (resolves #1071) 2017-05-23 12:18:00 +02:00
Raphaël Bournhonesque
6381ebfb14 Use yield from syntax 2017-05-18 10:42:35 +02:00
Raphaël Bournhonesque
f37d078d6a Fix issue #1069 with custom hook Doc.sents definition 2017-05-18 09:59:38 +02:00
ines
9003fd25e5 Fix error messages if model is required (resolves #1051)
Rename about.__docs__ to about.__docs_models__.
2017-05-13 13:14:02 +02:00
ines
24e973b17f Rename about.__docs__ to about.__docs_models__ 2017-05-13 13:09:00 +02:00
ines
6e1dbc608e Fix parse_tree test 2017-05-13 12:34:20 +02:00
ines
573f0ba867 Replace deepcopy 2017-05-13 12:34:14 +02:00
ines
bd428c0a70 Set defaults for light and flat kwargs 2017-05-13 12:34:05 +02:00
ines
c5669450a0 Fix formatting 2017-05-13 12:33:57 +02:00
Matthew Honnibal
ad590feaa8 Fix test, which imported English incorrectly 2017-05-13 11:36:19 +02:00
Ines Montani
8d742ac8ff Merge pull request #1055 from recognai/master
Enable pruning out rare words from clusters data
2017-05-13 03:22:56 +02:00
Matthew Honnibal
b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
oeg
cdaefae60a feature(populate_vocab): Enable pruning out rare words from clusters data 2017-05-12 16:15:19 +02:00
ines
b1f22c5a10 Fix formatting 2017-05-03 20:11:02 +02:00
ines
a04b5be1b2 Add glossary for annotation scheme (closes #1034)
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Ines Montani
3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b Add newline 2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea Add newline 2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135 Add newline 2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Matthew Honnibal
31ec9e1371 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2 Add dropout optin for parser and NER
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.

    nlp.entity.update(doc, gold, drop=0.4)

This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.

This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Ines Montani
7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c Add newline 2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2 Add newline 2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef Add newline 2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21 Add newline 2017-04-27 11:14:42 +02:00