Commit Graph

2907 Commits

Author SHA1 Message Date
Paul O'Leary McCann
6e9e686568 Sample implementation of Japanese Tagger (ref #1214)
This is far from complete but it should be enough to check some things.

1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.

2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)

3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.

Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
Delirious Lettuce
d3b03f0544 Fix typos:
* `auxillary` -> `auxiliary`
  * `consistute` -> `constitute`
  * `earlist` -> `earliest`
  * `prefered` -> `preferred`
  * `direcory` -> `directory`
  * `reuseable` -> `reusable`
  * `idiosyncracies` -> `idiosyncrasies`
  * `enviroment` -> `environment`
  * `unecessary` -> `unnecessary`
  * `yesteday` -> `yesterday`
  * `resouces` -> `resources`
2017-08-06 21:31:39 -06:00
Matthew Honnibal
d51d55bba6 Increment version 2017-07-22 15:43:16 +02:00
Matthew Honnibal
796b2f4c1b Remove print statements in tests 2017-07-22 15:42:38 +02:00
Matthew Honnibal
4b2e5e59ed Add flush_cache method to tokenizer, to fix #1061
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Matthew Honnibal
23a55b40ca Default to English noun chunks iterator if no lang set 2017-07-22 14:15:25 +02:00
Matthew Honnibal
9750a0128c Fix Span.noun_chunks. Closes #1207 2017-07-22 14:14:57 +02:00
Matthew Honnibal
d9b85675d7 Rename regression test 2017-07-22 14:14:35 +02:00
Matthew Honnibal
dfbc7e49de Add test for Issue #1207 2017-07-22 14:14:01 +02:00
Matthew Honnibal
0ae3807d7d Fix gaps in Lexeme API. Closes #1031 2017-07-22 13:53:48 +02:00
Matthew Honnibal
83e1b5f1e3 Merge branch 'master' of https://github.com/explosion/spaCy 2017-07-22 13:45:35 +02:00
Matthew Honnibal
45f6961ae0 Add __version__ symbol in __init__.py 2017-07-22 13:45:21 +02:00
Matthew Honnibal
8b9c4c5e1c Add missing SP symbol to tag map, re #1052 2017-07-22 13:44:17 +02:00
Ines Montani
9af04ea11f Merge pull request #1161 from AlexisEidelman/patch-1
French NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:40:46 +02:00
Matthew Honnibal
44dd247e73 Merge branch 'master' of https://github.com/explosion/spaCy 2017-07-22 13:35:30 +02:00
Matthew Honnibal
94267ec50f Fix merge conflit in printer 2017-07-22 13:35:15 +02:00
Ines Montani
c7708dc736 Merge pull request #1177 from swierh/master
Dutch NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:35:08 +02:00
Matthew Honnibal
5916d46ba8 Avoid use of deepcopy in printer 2017-07-22 13:34:01 +02:00
Ines Montani
9eca6503c1 Merge pull request #1157 from polm/master
Add basic Japanese Tokenizer Test
2017-07-10 13:07:11 +02:00
Paul O'Leary McCann
bc87b815cc Add comment clarifying what LANGUAGES does 2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188 Remove Japanese from LANGUAGES
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Swier
29720150f9 fix import of stop words in language data 2017-07-05 14:08:04 +02:00
Swier
f377c9c952 Rename stop_words.py to word_sets.py 2017-07-05 14:06:28 +02:00
Swier
5357874bf7 add Dutch numbers and ordinals 2017-07-05 14:03:30 +02:00
gispk47
669bd14213 Update __init__.py
remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error
2017-07-01 13:12:00 +08:00
Paul O'Leary McCann
c336193392 Parametrize and extend Japanese tokenizer tests 2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e Add importorskip for janome 2017-06-29 00:09:20 +09:00
Alexis
1b3a5d87ba French NUM_WORDS and ORDINAL_WORDS 2017-06-28 14:11:20 +02:00
Paul O'Leary McCann
e56fea14eb Add basic Japanese tokenizer test 2017-06-28 01:24:25 +09:00
Paul O'Leary McCann
84041a2bb5 Make create_tokenizer work with Japanese 2017-06-28 01:18:05 +09:00
György Orosz
fa26041da6 Fixed typo in cli/package.py 2017-06-07 16:19:08 +02:00
Ines Montani
e7ef51b382 Update tokenizer_exceptions.py 2017-06-02 19:00:01 +02:00
Ines Montani
81918155ef Merge pull request #1096 from recognai/master
Spanish model features
2017-06-02 11:07:27 +02:00
Francisco Aranda
70a2180199 fix(spanish sentence segmentation): remove tokenizer exceptions the break sentence segmentation. Aligned with training corpus 2017-06-02 08:19:57 +02:00
Francisco Aranda
5b385e7d78 feat(spanish model): add the spanish noun chunker 2017-06-02 08:14:06 +02:00
Ines Montani
7f6be41f21 Fix typo in English tokenizer exceptions (resolves #1071) 2017-05-23 12:18:00 +02:00
Raphaël Bournhonesque
6381ebfb14 Use yield from syntax 2017-05-18 10:42:35 +02:00
Raphaël Bournhonesque
f37d078d6a Fix issue #1069 with custom hook Doc.sents definition 2017-05-18 09:59:38 +02:00
ines
9003fd25e5 Fix error messages if model is required (resolves #1051)
Rename about.__docs__ to about.__docs_models__.
2017-05-13 13:14:02 +02:00
ines
24e973b17f Rename about.__docs__ to about.__docs_models__ 2017-05-13 13:09:00 +02:00
ines
6e1dbc608e Fix parse_tree test 2017-05-13 12:34:20 +02:00
ines
573f0ba867 Replace deepcopy 2017-05-13 12:34:14 +02:00
ines
bd428c0a70 Set defaults for light and flat kwargs 2017-05-13 12:34:05 +02:00
ines
c5669450a0 Fix formatting 2017-05-13 12:33:57 +02:00
Matthew Honnibal
ad590feaa8 Fix test, which imported English incorrectly 2017-05-13 11:36:19 +02:00
Ines Montani
8d742ac8ff Merge pull request #1055 from recognai/master
Enable pruning out rare words from clusters data
2017-05-13 03:22:56 +02:00
Matthew Honnibal
b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
oeg
cdaefae60a feature(populate_vocab): Enable pruning out rare words from clusters data 2017-05-12 16:15:19 +02:00
ines
b1f22c5a10 Fix formatting 2017-05-03 20:11:02 +02:00
ines
a04b5be1b2 Add glossary for annotation scheme (closes #1034)
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Ines Montani
3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b Add newline 2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea Add newline 2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135 Add newline 2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Matthew Honnibal
31ec9e1371 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2 Add dropout optin for parser and NER
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.

    nlp.entity.update(doc, gold, drop=0.4)

This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.

This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Ines Montani
7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c Add newline 2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2 Add newline 2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef Add newline 2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21 Add newline 2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05 Add newline 2017-04-27 11:14:26 +02:00
luvogels
d12a0b6431 Hooked up tokenizer tests 2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27 Increment version 2017-04-26 20:25:41 +02:00
luvogels
b331929a7e Merge branch 'master' of https://github.com/luvogels/spaCy 2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9 Added tokenizer tests 2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13 Try to make test999 less flakey 2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09 Update __init__.py 2017-04-26 18:27:55 +02:00
ines
527d51ac9a Fetch shortcuts from GitHub and improve error handling 2017-04-26 18:00:28 +02:00
Matthew Honnibal
c4be9c36fe Fix unicode header in tests 2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5 Fix test 2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1 Fix flakey test 2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15 Make training test less flakey 2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b Fix reporting if no dev data with train 2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8 Create directory if missing in save_to_directory 2017-04-23 21:24:43 +02:00
ines
42305bc519 Remove unnecessary test 2017-04-23 21:21:41 +02:00
ines
012ea594d1 Add file for misc tests 2017-04-23 21:06:51 +02:00
ines
83f66947dc Rename test_download to test_cli 2017-04-23 21:06:50 +02:00
ines
401045433c Simplify compat.fix_text 2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64 Increment version 2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b Update fix for Issue #999 2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07 Add test for Issue #955 2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5 Ensure noun chunks can't be nested. Closes #955 2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445 Add test for Issue #999 2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52 Fix json dump for Python3 2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17 Remove xfail on Test #910 2017-04-23 16:28:55 +02:00
ines
3a9710f356 Pass dev_scores to print_progress correctly (resolves #1008)
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab Persist the actions within spacy.parser.cfg 2017-04-20 17:02:44 +02:00
ines
25c70b4cc5 Move fix_text to spacy.compat (see #1002) 2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee Merge pull request #1002 from oroszgy/model_cli_fix
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c Using ftfy for handling broken encoded strings. 2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046 Merge pull request #1001 from recognai/master
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing 2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10 Enable test 2017-04-20 02:25:24 +03:00