Matthew Honnibal
4a59f6358c
Fix thinc imports
2017-10-03 19:21:26 +02:00
Matthew Honnibal
e514d6aa0a
Import thinc modules more explicitly, to avoid cycles
2017-10-03 18:49:25 +02:00
Matthew Honnibal
338e1fda0e
Unbreak merge artefact
2017-10-03 09:41:05 -05:00
Matthew Honnibal
1289187279
Fix circular import
2017-10-03 09:33:21 -05:00
Matthew Honnibal
a44c4c3a5b
Add timer to evaluate
2017-10-03 09:15:35 -05:00
Matthew Honnibal
96da86b3e5
Add support for verbose flag to Language
2017-10-03 09:14:57 -05:00
Matthew Honnibal
02586a5243
Add timing to spacy evaluate command
2017-10-03 09:14:34 -05:00
ines
e49cd7aeaf
Move import into load to avoid circular imports
2017-10-03 15:22:19 +02:00
ines
b0dfa059db
Update docs link in about.py
2017-10-03 15:19:55 +02:00
Matthew Honnibal
dc3c791947
Fix history size option
2017-10-03 13:41:23 +02:00
Matthew Honnibal
278a4c17c6
Fix history features
2017-10-03 13:27:10 +02:00
Matthew Honnibal
b770f4e108
Fix embed class in history features
2017-10-03 13:26:55 +02:00
Matthew Honnibal
b50a359e11
Add support for history features in parsing models
2017-10-03 12:44:01 +02:00
Matthew Honnibal
ee41e4fea7
Support history features in stateclass
2017-10-03 12:43:48 +02:00
Matthew Honnibal
6aa6a5bc25
Add a layer type for history features
2017-10-03 12:43:09 +02:00
Matthew Honnibal
8902df44de
Fix component disabling during training
2017-10-02 21:07:23 +02:00
Matthew Honnibal
c617d288d8
Update pipeline component names in spaCy train
2017-10-02 17:20:19 +02:00
Matthew Honnibal
f942903429
Improve sentence merging in iob2json
2017-10-02 17:02:10 +02:00
Matthew Honnibal
31681d20e0
Fix concatenation in iob2json converter
2017-10-02 16:50:26 +02:00
Matthew Honnibal
4896ce3320
Remove misleading comment
2017-10-02 00:09:14 +02:00
Matthew Honnibal
d90cc917fa
Merge vectors.pyx doc strings
2017-10-01 17:05:54 -05:00
Matthew Honnibal
b2a8b9be77
Fix inconsistency of Vectors class API
2017-10-01 17:00:34 -05:00
Matthew Honnibal
e38089d598
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-01 22:10:54 +02:00
Matthew Honnibal
97c409b602
Add docstrings for spacy.vectors
2017-10-01 22:10:33 +02:00
ines
b776f48e58
Fix typo
2017-10-01 21:58:45 +02:00
Matthew Honnibal
94df115a81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-01 14:06:23 -05:00
Matthew Honnibal
2cf0f4622f
Fix loading of models with pre-trained vectors
2017-10-01 14:05:32 -05:00
Matthew Honnibal
69c7c642c2
Add spacy evaluate
2017-10-01 14:05:04 -05:00
ines
8dbe49ecb8
Always compare lowercase package names
...
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:55:17 +02:00
ines
153c2589d4
Revert "Always compare lowercase package names"
...
This reverts commit 7d77dc490f
.
2017-09-29 20:53:36 +02:00
ines
fd1a9225d8
Handle conversion of pipeline components correctly
...
Allow both comma and comma + whitespace as separators
2017-09-29 20:52:56 +02:00
ines
7d77dc490f
Always compare lowercase package names
...
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:52:28 +02:00
Matthew Honnibal
cdb2d83e16
Pass dropout in parser
2017-09-28 18:47:13 -05:00
Matthew Honnibal
158e177cae
Fix default embed size
2017-09-28 08:25:23 -05:00
Matthew Honnibal
f6330d69e6
Default embed size to 7000
2017-09-28 08:07:41 -05:00
Matthew Honnibal
ac8481a7b0
Print NER loss
2017-09-28 08:05:31 -05:00
Matthew Honnibal
542ebfa498
Improve defaults
2017-09-27 18:54:37 -05:00
Matthew Honnibal
dcb86bdc43
Default batch size to 32
2017-09-27 11:48:19 -05:00
Matthew Honnibal
1a37a2c0a0
Update training defaults
2017-09-27 11:48:07 -05:00
Matthew Honnibal
13d7a97f3a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-27 11:44:37 -05:00
Matthew Honnibal
66c388ee01
Remove unhelpful multitask objectives
2017-09-27 11:44:16 -05:00
Matthew Honnibal
983201a83a
Fix hard-coded vector width
2017-09-27 11:43:58 -05:00
Ines Montani
959c46eabe
Merge pull request #1365 from wannaphongcom/develop
...
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Matthew Honnibal
1ef4236f8e
Merge pull request #1343 from explosion/feature/phrasematcher
...
Update PhraseMatcher for spaCy 2
2017-09-26 20:44:23 +02:00
Wannaphong Phatthiyaphaibun
7b5263ffa4
fix thai test
2017-09-26 23:54:15 +07:00
ines
1ff62eaee7
Fix option shortcut to avoid conflict
2017-09-26 17:59:34 +02:00
Wannaphong Phatthiyaphaibun
3d5046c499
fix import in th
2017-09-26 22:41:20 +07:00
ines
7fdfb78141
Add version option to cli.train
2017-09-26 17:34:52 +02:00
Wannaphong Phatthiyaphaibun
a63f790b8c
fix thai tag_map
2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun
2ea27d07f4
fix tokenizer_exceptions in thai
2017-09-26 22:14:47 +07:00
Matthew Honnibal
41cc5c4c17
Merge branch 'develop' into feature/phrasematcher
2017-09-26 09:59:17 -05:00
Matthew Honnibal
c2e2f81773
Merge pull request #1355 from explosion/feature/noshare
...
Make pipeline components independent
2017-09-26 16:58:09 +02:00
Wannaphong Phatthiyaphaibun
a2bf4cc7bf
fix newline in file
2017-09-26 21:49:43 +07:00
ines
bb5c631402
Implement like_num getter for French (via #1161 )
2017-09-26 16:47:45 +02:00
ines
15479b3bae
Add comment to like_num re: future work
2017-09-26 16:43:28 +02:00
ines
adda08fe14
Implement like_num getter for Dutch (via #1177 )
2017-09-26 16:39:15 +02:00
ines
5ee10379db
Port over changes from #1340
2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun
5cba67146c
add thai in spacy2
2017-09-26 21:36:27 +07:00
ines
10d291f129
Port over change from #1351
2017-09-26 16:11:41 +02:00
Matthew Honnibal
3274b46a0d
Try to fix compile error on Windows
2017-09-26 09:05:53 -05:00
Matthew Honnibal
19c7c09bf7
Fix PhraseMatcher.__contains__
2017-09-26 08:35:53 -05:00
Matthew Honnibal
d02a41a8c9
Merge remote-tracking branch 'origin/develop' into feature/phrasematcher
2017-09-26 08:32:55 -05:00
Matthew Honnibal
698fc0d016
Remove merge artefact
2017-09-26 08:31:37 -05:00
Matthew Honnibal
defb68e94f
Update feature/noshare with recent develop changes
2017-09-26 08:15:14 -05:00
Matthew Honnibal
ca28590ddd
Use dep and ent multi-task objectives for parser'
2017-09-26 08:13:52 -05:00
Matthew Honnibal
9bfd585a11
Fix parameter name in .pxd file
2017-09-26 07:28:50 -05:00
Matthew Honnibal
74f08e1ad5
Update test
2017-09-26 06:45:56 -05:00
Matthew Honnibal
5aaef3e7b8
Dont link vectors in vocab deserialize
2017-09-26 06:45:47 -05:00
Matthew Honnibal
18a27c7579
Fix typo in tensorizer serialization
2017-09-26 06:45:14 -05:00
Matthew Honnibal
5056743ad5
Fix parser serialization
2017-09-26 06:44:56 -05:00
Ines Montani
7123139b2b
Add __contains__ to PhraseMatcher
2017-09-26 13:13:27 +02:00
Ines Montani
50ad50f96a
Update matcher.pyx
2017-09-26 13:11:17 +02:00
Matthew Honnibal
e34e70673f
Allow tagger models to be built with pre-defined tok2vec layer
2017-09-26 05:51:52 -05:00
Matthew Honnibal
bf917225ab
Allow multi-task objectives during training
2017-09-26 05:42:52 -05:00
Matthew Honnibal
4ae9ea7684
Remove unused argument in Language
2017-09-26 05:41:35 -05:00
ines
edf7e4881d
Add meta.json option to cli.train and add relevant properties
...
Add accuracy scores to meta.json instead of accuracy.json and replace
all relevant properties like lang, pipeline, spacy_version in existing
meta.json. If not present, also add name and version placeholders to
make it packagable.
2017-09-25 19:00:47 +02:00
ines
d2d35b63b7
Fix formatting
2017-09-25 18:37:13 +02:00
Matthew Honnibal
8eb0b7b779
Add docstrings for Pipe API
2017-09-25 16:22:07 +02:00
Matthew Honnibal
39f390dba7
Add docstrings for Pipe API
2017-09-25 16:20:49 +02:00
Matthew Honnibal
8716ffe57d
Serialize vocab last
2017-09-24 05:01:45 -05:00
Matthew Honnibal
72bbcc0871
Handle lemmatization for unknown string IDs
2017-09-24 05:01:31 -05:00
Matthew Honnibal
204b58c864
Fix evaluation during training
2017-09-24 05:01:03 -05:00
Matthew Honnibal
dc3a623d00
Remove unused update_shared argument
2017-09-24 05:00:37 -05:00
Matthew Honnibal
63bd87508d
Don't use iterated convolutions
2017-09-23 04:39:17 -05:00
Matthew Honnibal
5a7fd0fd36
Fix vector linkage
2017-09-22 20:11:52 -05:00
Matthew Honnibal
4348c479fc
Merge pre-trained vectors and noshare patches
2017-09-22 20:07:28 -05:00
Matthew Honnibal
7dc61b3f43
Whitespace
2017-09-22 20:00:50 -05:00
Matthew Honnibal
e93d43a43a
Fix training with preset vectors
2017-09-22 20:00:40 -05:00
Matthew Honnibal
0795857dcb
Fix beam parsing
2017-09-23 02:59:53 +02:00
Matthew Honnibal
4bd6a12b1f
Fix Tok2Vec
2017-09-23 02:58:54 +02:00
Matthew Honnibal
386c1a5bd8
Fix tagger training
2017-09-23 02:58:06 +02:00
Matthew Honnibal
a2357cce3f
Set random seed in train script
2017-09-23 02:57:31 +02:00
Matthew Honnibal
05596159bf
Fix serialization when pre-trained vectors
2017-09-22 15:33:27 -05:00
Matthew Honnibal
980fb6e854
Refactor Tok2Vec
2017-09-22 09:38:36 -05:00
Matthew Honnibal
d9124f1aa3
Add link_vectors_to_models function
2017-09-22 09:38:22 -05:00
Matthew Honnibal
a186596307
Add 'reapply' combinator, for iterated CNN
2017-09-22 09:37:03 -05:00
Matthew Honnibal
40a4873b70
Fix serialization of model options
2017-09-21 13:07:26 -05:00
Matthew Honnibal
0a9016cade
Fix serialization during training
2017-09-21 13:06:45 -05:00
Matthew Honnibal
20193371f5
Don't share CNN, to reduce complexities
2017-09-21 14:59:48 +02:00
Matthew Honnibal
1d73dec8b1
Refactor train script
2017-09-20 19:17:10 -05:00
Matthew Honnibal
ffda38356a
Add util function to enable GPU
2017-09-20 19:16:35 -05:00
Matthew Honnibal
24e85c2048
Pass values for CNN maxout pieces option
2017-09-20 19:16:12 -05:00
Matthew Honnibal
b832f89ff8
Add resume_training function
2017-09-20 19:15:20 -05:00
Matthew Honnibal
f5144f04be
Add argument for CNN maxout pieces
2017-09-20 19:14:41 -05:00
Matthew Honnibal
842e21de9f
Fix int type error for Python 2
2017-09-20 23:55:30 +02:00
Matthew Honnibal
0c93c73e49
Add __reduce__ method for PhraseMatcher
2017-09-20 22:26:40 +02:00
Matthew Honnibal
cc408fc189
Make PhraseMatcher API like Matcher API
2017-09-20 22:20:35 +02:00
Matthew Honnibal
43ad250dd5
Update matcher tests
2017-09-20 21:54:49 +02:00
Matthew Honnibal
828cc91545
Fix PhraseMatcher for spaCy 2
2017-09-20 21:54:31 +02:00
Matthew Honnibal
78301b2d29
Avoid comparison to None in Tok2Vec
2017-09-20 00:19:34 +02:00
Matthew Honnibal
b36a38f63d
Fix serialization of pretrained_dims property
2017-09-19 23:42:27 +02:00
Matthew Honnibal
2489dcaccf
Fix serialization of parser
2017-09-19 23:42:12 +02:00
Matthew Honnibal
40837b275d
Fix tensorizer with pretrained vectors
2017-09-18 18:05:38 -05:00
Matthew Honnibal
a0c4b33d03
Support resuming a model during spacy train
2017-09-18 18:04:47 -05:00
Matthew Honnibal
c858927271
Copy vectors to GPU on begin training
2017-09-18 18:04:16 -05:00
Matthew Honnibal
3fa76c17d1
Refactor Tok2Vec
2017-09-18 15:00:05 -05:00
Matthew Honnibal
217e7891cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-18 11:36:21 -05:00
Matthew Honnibal
7b3f391f80
Try dropping the Affine layer, conditionally
2017-09-18 11:35:59 -05:00
ines
2480f8f521
Add missing return in Doc.from_disk() ( closes #1330 )
2017-09-18 15:32:00 +02:00
Matthew Honnibal
2148ae605b
Dont use iterated convolutions
2017-09-17 17:36:04 -05:00
Matthew Honnibal
c013e5996f
Fix parser test
2017-09-17 13:13:20 -05:00
Matthew Honnibal
8f42f8d305
Remove unused 'preprocess' argument in Tok2Vec'
2017-09-17 12:30:16 -05:00
Matthew Honnibal
039d609362
Remove hard-coded default vectors width
2017-09-17 12:29:39 -05:00
Matthew Honnibal
4f38a67a89
Make width default to 0 in vectors.pyx
2017-09-17 12:29:14 -05:00
Matthew Honnibal
16122f566e
Fix cpdef enum in attrs.pyx
2017-09-17 12:28:53 -05:00
Matthew Honnibal
b159e0eb50
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-17 05:47:50 -05:00
Matthew Honnibal
2b0efc77ae
Fix wiring of pre-trained vectors in parser loading
2017-09-17 05:47:34 -05:00
Matthew Honnibal
31c2e91c35
Fix wiring of pre-trained vectors in parser loading
2017-09-17 05:46:55 -05:00
Matthew Honnibal
8f913a74ca
Fix defaults and args to build_tagger_model
2017-09-17 05:46:36 -05:00
Matthew Honnibal
c003c561c3
Revert NER action loading change, for model compatibility
2017-09-17 05:46:03 -05:00
Matthew Honnibal
43210abacc
Resolve fine-tuning conflict
2017-09-17 05:30:04 -05:00
ines
ece30c28a8
Don't split hyphenated words in German
...
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
ines
68f66aebf8
Use pkg_resources instead of pip for is_package ( resolves #1293 )
2017-09-16 20:27:59 +02:00
Matthew Honnibal
5ff2491f24
Pass option for pre-trained vectors in parser
2017-09-16 12:47:21 -05:00
Matthew Honnibal
8665a77f48
Fix feature error in NER
2017-09-16 12:46:57 -05:00
Matthew Honnibal
e37a50a436
Pass documents to tensorizer, not 'features'
2017-09-16 12:46:36 -05:00
Matthew Honnibal
84e637e2e6
Pass option for pretrained vectors in pipeline
2017-09-16 12:46:02 -05:00
Matthew Honnibal
2a93404da6
Support optional pre-trained vectors in tensorizer model
2017-09-16 12:45:37 -05:00
Matthew Honnibal
e0a2aa9289
Support having word vectors data on GPU
2017-09-16 12:45:09 -05:00
Matthew Honnibal
ebf8942564
Fix test for Python3
2017-09-16 16:22:38 +02:00
Matthew Honnibal
8c945310fb
Excuse emoji failure on narrow unicode builds
2017-09-16 16:21:13 +02:00
Matthew Honnibal
11f2a05ede
Fix code explosion from long enum in Python 3, Cython 0.24+
2017-09-16 12:20:04 +02:00
Matthew Honnibal
3fa5b40b5c
Add test for hash consistency
2017-09-16 11:21:35 +02:00
Matthew Honnibal
f730d07e4e
Fix prange error for Windows
2017-09-16 00:25:33 +02:00
Matthew Honnibal
4b2065430e
Merge branch 'feature/parser-history' into develop
2017-09-15 10:42:20 +02:00
Matthew Honnibal
2f08489694
Remove AddHistory layer -- didnt work as planned
2017-09-15 10:41:40 +02:00
Matthew Honnibal
8b481e0465
Remove redundant brackets
2017-09-15 10:38:08 +02:00
Matthew Honnibal
d84607f6bb
Vectorize update in AddHistory
2017-09-14 20:34:40 +02:00
Ines Montani
bd3da3d6fb
Port over change from #1323 and tidy up
2017-09-14 19:23:13 +02:00
Matthew Honnibal
18347ab69c
Implement AddHistory layer wrapper
2017-09-14 19:07:35 +02:00
Matthew Honnibal
d4ca6cef9e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-14 17:00:07 +02:00
Matthew Honnibal
8c503487af
Fix lookup of missing NER actions
2017-09-14 16:59:45 +02:00
Matthew Honnibal
664c5af745
Revert padding in parser
2017-09-14 16:59:25 +02:00
Matthew Honnibal
8496d76224
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-14 09:21:20 -05:00
Matthew Honnibal
d1518027a9
Increment version
2017-09-14 16:18:46 +02:00
Matthew Honnibal
70da88a3a7
Update comment on Language.begin_training
2017-09-14 16:18:30 +02:00
Matthew Honnibal
c6395b057a
Improve parser feature extraction, for missing values
2017-09-14 16:18:02 +02:00
Matthew Honnibal
daf869ab3b
Fix add_action for NER, so labelled 'O' actions aren't added
2017-09-14 16:16:41 +02:00
Matthew Honnibal
9cb2aef587
Remove print statement
2017-09-14 13:38:28 +02:00
Matthew Honnibal
ba23d63c35
Fix minibatch function, for fixed batch size
2017-09-14 13:37:41 +02:00
Jim O'Regan
7de709483b
missed adding here
2017-09-11 10:51:21 +01:00
Jim O'Regan
b1b6123867
add ga_tokenizer
2017-09-11 10:31:41 +01:00
Jim O'Regan
9dfd301962
rearrange
2017-09-11 10:14:18 +01:00
Jim O'Regan
187be6d372
copy/paste error
2017-09-11 09:33:17 +01:00
Jim O'Regan
c283e9edfe
first stab at test
2017-09-11 08:57:48 +01:00
Jim O'Regan
1ee75ae337
Merge remote-tracking branch 'origin/develop' into develop-irish
2017-09-11 08:40:11 +01:00
Matthew Honnibal
456bb8a74c
Unxfail and close #1305
2017-09-06 19:14:17 +02:00
Matthew Honnibal
99e44fbdbb
Update regression test
2017-09-06 19:13:51 +02:00
Matthew Honnibal
5c3ff06924
Fix lemmatizer rules
2017-09-06 19:13:24 +02:00
Matthew Honnibal
dd9cab0faf
Fix type-check for int/long
2017-09-06 19:03:05 +02:00
Matthew Honnibal
497a9308a8
Xfail new lemmatizer test
2017-09-06 18:41:22 +02:00
Matthew Honnibal
dcbf866970
Merge parser changes
2017-09-06 18:41:05 +02:00
Matthew Honnibal
5384fff5ce
Add test for 1305: Incorrect lemmatization of VBZ for English
2017-09-06 18:40:18 +02:00
Matthew Honnibal
24ff6b0ad9
Fix parsing and tok2vec models
2017-09-06 05:50:58 -05:00
Matthew Honnibal
1b65115bc2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-04 20:02:53 -05:00
Matthew Honnibal
33fa91feb7
Restore correctness of parser model
2017-09-04 21:19:30 +02:00
Matthew Honnibal
e88a42e460
Increment version
2017-09-04 21:14:39 +02:00
Matthew Honnibal
9d65d67985
Preserve model compatibility in parser, for now
2017-09-04 16:46:22 +02:00
Matthew Honnibal
d5fbf27335
Fix test
2017-09-04 16:45:11 +02:00
Matthew Honnibal
7fdafcc4c4
Fix config loading in tagger
2017-09-04 16:38:49 +02:00
Matthew Honnibal
058372d120
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-04 16:27:53 +02:00
Matthew Honnibal
16e25ce3b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-04 09:26:53 -05:00
Matthew Honnibal
9f512e657a
Fix drop_layer calculation
2017-09-04 09:26:38 -05:00
Matthew Honnibal
cb4839033c
Fix loader for EN tests
2017-09-04 15:19:18 +02:00
Matthew Honnibal
382ce566eb
Fix deserialization bug
2017-09-04 15:19:01 +02:00
Matthew Honnibal
bfddf50081
Fix #1296 : Incorrect lemmatization of base form verbs
2017-09-04 15:18:41 +02:00
Matthew Honnibal
b29e6bff46
Improve lemmatization rule for am|VBP
2017-09-04 15:18:10 +02:00
Matthew Honnibal
644d6c9e1a
Improve lemmatization tests, re #1296
2017-09-04 15:17:44 +02:00
Matthew Honnibal
3cf3fa1704
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-02 12:46:11 -05:00
Matthew Honnibal
e920885676
Fix pickle during train
2017-09-02 12:46:01 -05:00
Matthew Honnibal
c0eaba8b28
Fix low-data textcat
2017-09-02 15:17:32 +02:00
Matthew Honnibal
9e378bdac5
Fix textcat serialization
2017-09-02 15:17:20 +02:00
Matthew Honnibal
e3ea6ee02b
Increment version
2017-09-02 15:17:01 +02:00
Matthew Honnibal
a3b69bcb3d
Add low_data mode in textcat
2017-09-02 14:56:30 +02:00
Matthew Honnibal
ead78c7b9b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-09-02 12:55:25 +02:00
Matthew Honnibal
5e6a9e7dcc
Add rule-based SBD
2017-09-02 12:53:38 +02:00
Matthew Honnibal
a824cf8f9a
Adjust text classification model
2017-09-02 11:41:00 +02:00
Matthew Honnibal
ac040b99bb
Add support for pre-trained vectors in text classifier
2017-09-01 16:39:55 +02:00
Matthew Honnibal
7742a6d559
Add GloVe vectors reader
2017-09-01 16:39:22 +02:00
Matthew Honnibal
789e1a3980
Use 13 parser features, not 8
2017-08-31 14:13:00 -05:00
Matthew Honnibal
30e35d9666
Fix syntax error
2017-08-30 17:35:39 -05:00
Matthew Honnibal
4ceebde523
Fix gradient bug in parser
2017-08-30 17:32:56 -05:00
ines
173089a45a
Add more validation for model meta
2017-08-29 11:21:46 +02:00
Matthew Honnibal
2e28982e28
Merge pull request #1288 from geovedi/indonesian
...
Indonesian language support
2017-08-26 21:31:13 +02:00
ines
7e04b7f89c
Fix info text on pipeline in package cli
2017-08-26 18:30:59 +02:00
ines
40afa13a8a
Increment version
2017-08-26 18:30:49 +02:00
Matthew Honnibal
876f38c548
Merge pull request #1279 from oroszgy/model_cli_v2
...
Added vector loading to model cli
2017-08-26 15:57:50 +02:00
Matthew Honnibal
cfc055734e
Split % in units, for compatibility with corpus
2017-08-25 20:03:37 -05:00
Matthew Honnibal
4bb6bc3f9e
Add support for sent_start to GoldParse
2017-08-25 20:03:14 -05:00
Matthew Honnibal
44589fb38c
Fix Break oracle
2017-08-25 19:50:55 -05:00
Matthew Honnibal
6d4e8e14ca
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-25 12:37:16 -05:00
Matthew Honnibal
4ce5531389
Use layer norm instead of batch norm
2017-08-25 12:37:10 -05:00
Matthew Honnibal
20dd66ddc2
Constrain sentence boundaries to IS_PUNCT and IS_SPACE tokens
2017-08-25 19:35:47 +02:00
Jim Geovedi
58d8078971
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-25 09:21:49 +08:00
Matthew Honnibal
6ceb0f0518
Allow Lexeme.rank to be set
2017-08-24 21:43:00 +02:00
Matthew Honnibal
44a1fa80d3
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-23 13:02:16 +02:00
ines
bb1abbeba5
Only link model if download was successfull
2017-08-23 12:36:31 +02:00
Matthew Honnibal
bb2541ffd3
Fix PROB attr for OOV words
2017-08-23 12:11:52 +02:00
Matthew Honnibal
1c5c256e58
Fix fine_tune when optimizer is None
2017-08-23 10:51:33 +02:00
Matthew Honnibal
9c580ad28a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-22 17:02:04 -05:00
Matthew Honnibal
a4633fff6f
Restore use of batch norm in model
2017-08-22 17:01:58 -05:00
Matthew Honnibal
03b5b9727a
Fix Doc.vector for empty doc objects
2017-08-22 19:52:19 +02:00
Matthew Honnibal
0551b7b03a
Fix doc.vector
2017-08-22 19:46:52 +02:00
Matthew Honnibal
83f8e98450
Fix retrieval of OOV vectors
2017-08-22 19:46:35 +02:00
Matthew Honnibal
df2745eb08
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-22 19:00:43 +02:00
Matthew Honnibal
5b329acbf2
Fix vectors_length property in vocab
2017-08-22 19:00:27 +02:00
Matthew Honnibal
1fe605dfe5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-21 19:18:31 -05:00
Matthew Honnibal
18b64e79ec
Fix fine tuning
2017-08-21 19:18:26 -05:00
Matthew Honnibal
682346dd66
Restore optimized hidden_depth=0 for parser
2017-08-21 19:18:04 -05:00
Matthew Honnibal
a21d8f3f0b
Add predict paths to _ml models
2017-08-21 23:23:45 +02:00
Matthew Honnibal
cec76801dc
Add profile command to CLI
2017-08-21 23:23:05 +02:00
Matthew Honnibal
7be5f30f17
Add profile function
2017-08-21 23:22:49 +02:00
ines
a68dc891ea
Port over changes from #1281
2017-08-21 23:19:18 +02:00
Matthew Honnibal
5e50a65252
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-21 14:15:46 -05:00
Matthew Honnibal
80acbc5f1f
Fix fine-tune weight mixture
2017-08-21 14:15:29 -05:00
ines
d15775c3ad
Fix typos and commands in alpha docs
2017-08-21 13:40:11 +02:00
Gyorgy Orosz
b3576bfc86
Added vector leading to model cli
2017-08-20 23:16:12 +02:00
Matthew Honnibal
c10f63bf10
Initialize fine tuning to 0.5
2017-08-20 15:59:48 -05:00
Matthew Honnibal
62878e50db
Fix misalignment caued by filtering inputs at wrong point in parser
2017-08-20 15:59:28 -05:00
Matthew Honnibal
78a5f842e9
Fix update when update_shared=False
2017-08-20 15:58:34 -05:00
Matthew Honnibal
7a6edeea68
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-20 12:55:39 -05:00
Matthew Honnibal
f2f9229964
Fix name of update_shared flag
2017-08-20 18:19:06 +02:00
Matthew Honnibal
8a59718fd6
Fix fine-tuning
2017-08-20 18:17:35 +02:00
Matthew Honnibal
80a5146ec2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-20 11:07:08 -05:00
Matthew Honnibal
84bb543e4d
Add gold_preproc flag to cli/train
2017-08-20 11:07:00 -05:00
Matthew Honnibal
3fe0d76e6d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-20 14:50:01 +02:00
Matthew Honnibal
c1d3ff517a
Track loss in tagger
2017-08-20 14:42:23 +02:00
Matthew Honnibal
8875590081
Add optimizer in Language.update if sgd=None
2017-08-20 14:42:07 +02:00
Matthew Honnibal
84b7ed49e4
Ensure updates aren't made if no gold available
2017-08-20 14:41:38 +02:00
Ines Montani
c2bbd393af
Merge pull request #1276 from oroszgy/model_cli_v2
...
Ported model cli from v1
2017-08-20 11:52:59 +02:00
Jim Geovedi
f77443ab68
reworked
2017-08-20 13:43:21 +07:00
Jim Geovedi
fbc62a09c7
added {pre,suf,in}fix tests
2017-08-20 13:43:00 +07:00
Jim Geovedi
713d7c0aa0
added indonesian lang test
2017-08-20 12:17:14 +07:00
Jim Geovedi
b7d83f37c8
indonesian abbr.
2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b
direct lookup
2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505
added examples
2017-08-20 11:57:10 +07:00
Jim Geovedi
fa544e6c9a
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-20 11:49:40 +07:00
Matthew Honnibal
42fa84075f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-19 22:42:50 +02:00
Matthew Honnibal
aefef6fd28
Prevent strings from being lost during from_disk and from_bytes
2017-08-19 22:42:17 +02:00
ines
281e7e58b3
Don't escape forward slashes on ujson.dumps
2017-08-19 22:32:16 +02:00
ines
2d126a00ae
Fix typo
2017-08-19 22:32:07 +02:00
Matthew Honnibal
41c2218c53
Fix test for vectors
2017-08-19 22:09:12 +02:00
Matthew Honnibal
b8e1603cc4
Fix load fail for missing vectors
2017-08-19 22:07:00 +02:00
Matthew Honnibal
a3c51a0355
Fix creation of pipeline
2017-08-19 21:58:57 +02:00
Gyorgy Orosz
e5344b83a3
Ported model cli from v1
2017-08-19 21:45:23 +02:00
Matthew Honnibal
6a94648373
Fix serialization
2017-08-19 21:27:35 +02:00
Matthew Honnibal
1157294434
Improve vector handling
2017-08-19 20:35:33 +02:00
Matthew Honnibal
ef87562741
Restore vectors test utils
2017-08-19 20:35:16 +02:00
Matthew Honnibal
1391f9da37
Restore vectors tests
2017-08-19 20:34:58 +02:00
Matthew Honnibal
8cfeeb4884
Increment version
2017-08-19 19:52:58 +02:00
Matthew Honnibal
93fb8b64e9
Fix vector loading
2017-08-19 19:52:25 +02:00
Matthew Honnibal
49a615e7d9
Create Vectors object in Vocab
2017-08-19 18:50:16 +02:00
Matthew Honnibal
3d049af563
Improve vectors to/from disk
2017-08-19 18:42:11 +02:00
Matthew Honnibal
d55d6e1cfa
Fix comparison of Token from different docs. Closes #1257
2017-08-19 16:39:32 +02:00
Matthew Honnibal
9b6a5df15e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-19 16:24:57 +02:00
Matthew Honnibal
4fda02c7e6
Add test for new Span.to_array method
2017-08-19 16:24:38 +02:00
Matthew Honnibal
dea229c634
Fix Span.to_array method
2017-08-19 16:24:28 +02:00
Matthew Honnibal
c606b4a42c
Add test for Doc.char_span
2017-08-19 16:18:23 +02:00
Matthew Honnibal
8b7ac77c23
Allow span label to be string in Doc.char_span
2017-08-19 16:18:09 +02:00
Matthew Honnibal
7c47e38c12
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-19 09:03:15 -05:00
Matthew Honnibal
ab28f911b4
Fix parser learning rates
2017-08-19 09:02:57 -05:00
ines
1fe5e1a4d1
Add language example sentences (see #1107 )
...
da, de, en, es, fr, he, it, nb, pl, pt, sv
2017-08-19 12:22:29 +02:00
Matthew Honnibal
97aabafb5f
Document as_tuples keyword arg of Language.pipe
2017-08-19 12:21:33 +02:00
Matthew Honnibal
80236116a6
Add Doc.char_span method, to get a span by character offset
2017-08-19 12:21:09 +02:00
Matthew Honnibal
482bba1722
Add Span.to_array method
2017-08-19 12:20:45 +02:00
Matthew Honnibal
19c495f451
Fix vectors deserialization
2017-08-19 04:33:03 +02:00
Matthew Honnibal
42d47c1e5c
Fix tagger serialization
2017-08-19 04:16:32 +02:00
Matthew Honnibal
2da96a0ec7
Fix beam test
2017-08-19 04:15:46 +02:00
Matthew Honnibal
a7309a217d
Update tagger serialization
2017-08-18 23:12:05 +02:00
Matthew Honnibal
bae59bf92f
Remove BiLSTM import
2017-08-18 22:46:59 +02:00
Matthew Honnibal
c307a0ffb8
Restore patches from nn-beam-parser to spacy/syntax
2017-08-18 22:38:59 +02:00
Matthew Honnibal
fe90dfc390
Restore changes from nn-beam-parser to spacy/_ml
2017-08-18 22:38:28 +02:00
Matthew Honnibal
de7e8703e3
Restore tests for beam parser
2017-08-18 22:27:42 +02:00
Matthew Honnibal
11c31d285c
Restore changes from nn-beam-parser
2017-08-18 22:26:12 +02:00
Matthew Honnibal
ce321b0322
Restore changes from nn-beam-parser to spacy/_ml
2017-08-18 22:24:46 +02:00
Matthew Honnibal
5f81d700ff
Restore patches from nn-beam-parser to spacy/syntax
2017-08-18 22:23:03 +02:00
Matthew Honnibal
ec482580b5
Restore changes to pipeline.pyx from nn-beam-parser branch
2017-08-18 22:02:35 +02:00
Matthew Honnibal
931509d96a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-18 21:57:15 +02:00
Matthew Honnibal
ed95009b5c
Fix data loading on Python 2
2017-08-18 21:57:06 +02:00
Matthew Honnibal
baf36d0588
Add compat function for importlib.util
2017-08-18 21:56:47 +02:00
Matthew Honnibal
263366729e
Don't import BiLSTM
2017-08-18 21:56:31 +02:00
Matthew Honnibal
28162290b3
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-18 14:55:40 -05:00
Matthew Honnibal
85794c1167
Restore state of _ml.py
2017-08-18 14:55:23 -05:00
Matthew Honnibal
d456d2efe1
Fix conflicts in nn_parser
2017-08-18 20:55:58 +02:00
Matthew Honnibal
1cec1efca7
Fix merge conflicts in nn_parser from beam stuff
2017-08-18 20:50:49 +02:00
Matthew Honnibal
69bcacdc09
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-18 20:47:13 +02:00
Matthew Honnibal
2993b54fff
Load vectors in vocab
2017-08-18 20:46:56 +02:00
Matthew Honnibal
a1ec41298c
Restore CFile loader
2017-08-18 20:46:16 +02:00
Matthew Honnibal
ed4fb991dc
Work on vectors loading
2017-08-18 20:45:48 +02:00
Matthew Honnibal
426f84937f
Resolve conflicts when merging new beam parsing stuff
2017-08-18 13:38:32 -05:00
Matthew Honnibal
5181e8bedb
Fix merge conflict in _ml
2017-08-18 13:35:51 -05:00
Matthew Honnibal
f75420ae79
Unhack beam parsing, moving it under options instead of global flags
2017-08-18 13:31:15 -05:00
Jim Geovedi
7ae45bffcf
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-18 10:14:46 +07:00
Dan O'Huiginn
ebf5a3ce59
Allow loading with python < 3.6
...
Don't rely on recent python features to load models
Fixes Issue #1271
2017-08-17 15:15:47 +00:00
Matthew Honnibal
0209a06b4e
Update beam parser
2017-08-16 18:25:49 -05:00
Matthew Honnibal
4b1e7bd6d8
Improve tensorizer model
2017-08-16 18:25:20 -05:00
Matthew Honnibal
a6d8d7c82e
Add is_gold_parse method to transition system
2017-08-16 18:24:09 -05:00
Matthew Honnibal
3533bb61cb
Add option of 8 feature parse state
2017-08-16 18:23:27 -05:00
Matthew Honnibal
1cb2f15d65
Clean up unused predict_confidences function
2017-08-16 18:22:26 -05:00
Matthew Honnibal
210f6d5175
Fix efficiency error in batch parse
2017-08-15 03:19:03 -05:00
Matthew Honnibal
23537a011d
Tweaks to beam parser
2017-08-15 03:15:28 -05:00
Matthew Honnibal
500e92553d
Fix memory error when copying scores in beam
2017-08-15 03:15:04 -05:00
Matthew Honnibal
a8e4064dd8
Fix tensor gradient in parser
2017-08-15 03:14:36 -05:00
Matthew Honnibal
e420e0366c
Remove use of hash function in beam parser
2017-08-15 03:13:57 -05:00
Matthew Honnibal
6259490347
Fix mixture weights in fine_tune
2017-08-14 17:55:18 -05:00
Matthew Honnibal
335fa8b05c
Fix gradient in fine_tune
2017-08-14 14:55:47 -05:00
Matthew Honnibal
d9f82f6b50
Increment version
2017-08-14 14:55:26 +02:00
ines
a29f132ffd
Change python -m spacy to spacy
...
Reflects latest change to entry point or auto-alias
2017-08-14 13:04:48 +02:00
ines
65bf80302c
Increment version
2017-08-14 13:04:30 +02:00
Matthew Honnibal
52c180ecf5
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit ea8de11ad5
, reversing
changes made to 08e443e083
.
2017-08-14 13:00:23 +02:00
Matthew Honnibal
dbbfe595a5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-14 12:09:28 +02:00
Matthew Honnibal
ac6c25f762
Check SGD is not None in update
2017-08-14 12:09:18 +02:00
Matthew Honnibal
0ae045256d
Fix beam training
2017-08-13 18:02:05 -05:00
Matthew Honnibal
6a42cc16ff
Fix beam parser, improve efficiency of non-beam
2017-08-13 12:37:26 +02:00
Matthew Honnibal
4363b4aa4a
Fix redundant tokvecs updates during update
2017-08-13 12:36:55 +02:00
Matthew Honnibal
12de263813
Bug fixes to beam parsing. Learns small sample
2017-08-13 09:33:39 +02:00
Matthew Honnibal
4ae0d5e1e6
Set defaults for convert command
2017-08-13 09:03:38 +02:00
Matthew Honnibal
92ebab6073
Update beam-update tests
2017-08-13 08:56:02 +02:00
Matthew Honnibal
17874fe491
Disable beam parsing
2017-08-12 19:35:40 -05:00
Matthew Honnibal
69f21867b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-12 19:25:56 -05:00
Matthew Honnibal
3e30712b62
Improve defaults
2017-08-12 19:24:17 -05:00
Matthew Honnibal
28e930aae0
Fixes for beam parsing. Not working
2017-08-12 19:22:52 -05:00
Matthew Honnibal
c96d769836
Fix beam parse. Not sure if working
2017-08-12 18:21:54 -05:00
Matthew Honnibal
24b45b45c6
Add test for beam update
2017-08-12 17:15:28 -05:00
Matthew Honnibal
4638f4b869
Fix beam update
2017-08-12 17:15:16 -05:00
Matthew Honnibal
d4308d2363
Initialize State offset to 0
2017-08-12 17:14:39 -05:00
Matthew Honnibal
b353e4d843
Work on parser beam training
2017-08-12 14:47:45 -05:00
ines
d4f2baf7dd
Add create_meta option to package command
...
Re-create meta.json in model directory, even if it exists. Especially
useful when updating existing spaCy models or training with Prodigy.
Ensures user won't end up with multiple "en_core_web_sm" models, and
offers easy way to change the model's name and settings without having
to edit the meta.json file.
2017-08-12 21:44:18 +02:00
Matthew Honnibal
4ab0c8c8e9
Try different drop_layer structure in Tok2Vec
2017-08-12 08:56:57 -05:00
Matthew Honnibal
cd5ecedf6a
Try drop_layer in parser
2017-08-12 08:56:33 -05:00
Matthew Honnibal
8870d491f1
Remove redundant pickling during training
2017-08-12 08:55:53 -05:00
Matthew Honnibal
680043ebca
Improve efficiency of tagger.set_annotations for GPU
2017-08-12 08:54:21 -05:00
Matthew Honnibal
ebe0f7f641
Pass embed size correctly in tagger, and cache embeddings for efficiency
2017-08-12 05:45:20 -05:00
Matthew Honnibal
1a59db1c86
Fix dropout and learn rate in parser
2017-08-12 05:44:39 -05:00
Matthew Honnibal
d01dc3704a
Adjust parser model
2017-08-09 20:06:33 -05:00
Matthew Honnibal
f37528ef58
Pass embed size for parser fine-tune. Use SELU
2017-08-09 17:52:53 -05:00
Matthew Honnibal
f93f2bed58
Revert use of layer normalization in Tok2Vec
2017-08-09 17:47:03 -05:00
Matthew Honnibal
20944dd8aa
Fix conflict in parser fine-tuning
2017-08-09 16:43:05 -05:00
Matthew Honnibal
ac2de6dced
Switch to ReLu layers in Tok2Vec
2017-08-09 16:41:25 -05:00
Matthew Honnibal
bbace204be
Gate parser fine-tuning behind feature flag
2017-08-09 16:40:42 -05:00
Matthew Honnibal
a59a1deac4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-09 16:23:19 -05:00
Matthew Honnibal
bcce6f7de0
Fix parser fine tuning
2017-08-09 16:23:12 -05:00
ines
28e2fec23b
Fix autolinking failure on fresh model install ( resolves #1138 )
...
On fresh install via subprocess, pip.get_installed_distributions()
won't show new model, so is_package check in link command fails.
Solution for now is to get model package path explicitly and pass it to
link command.
2017-08-09 11:52:38 +02:00
Jim Geovedi
c62b49b7cc
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-09 09:17:46 +07:00
Matthew Honnibal
dbdd8afc4b
Fix parser fine-tune training
2017-08-08 15:46:07 -05:00
Matthew Honnibal
88bf1cf87c
Update parser for fine tuning
2017-08-08 15:34:17 -05:00
Jim O'Regan
c069b4acb5
fix in UD submitted; map either way
2017-08-08 19:22:14 +01:00
Jim O'Regan
76c22dec4d
UD Irish tag mapping
2017-08-08 19:04:52 +01:00
Jim O'Regan
95921d7d4c
Merge branch 'develop' into develop-irish
2017-08-08 17:21:27 +01:00
Matthew Honnibal
5d837c3776
Add mix weights on fine_tune
2017-08-07 06:32:59 -05:00
Matthew Honnibal
42bd26f6f3
Give parser its own tok2vec weights
2017-08-06 18:33:46 +02:00
Matthew Honnibal
3ed203de25
Use LayerNorm and SELU in Tok2Vec
2017-08-06 18:33:18 +02:00
Matthew Honnibal
78498a072d
Return Transition for missing actions in lookup_action
2017-08-06 14:16:36 +02:00
Matthew Honnibal
4a5cc89138
Fix tagger 'fine_tune', to keep private CNN weights
2017-08-06 14:15:48 +02:00
Matthew Honnibal
3cb8f06881
Fix NeuralLabeller
2017-08-06 14:15:14 +02:00
Matthew Honnibal
0acce0521b
Fix Language.update for pipeline
2017-08-06 14:13:03 +02:00
Matthew Honnibal
bfffdeabb2
Fix parser batch-size bug introduced during cleanup
2017-08-06 14:10:48 +02:00
Matthew Honnibal
0eec7c9e9b
Fix Language.evaluate
2017-08-06 02:18:31 +02:00
Matthew Honnibal
0a566dc320
Add update_tensors flag to Language.update. Experimental, re #1182
2017-08-06 02:18:12 +02:00
Matthew Honnibal
cc19ea0e7c
Add update_tensors flag to Language.update. Experimental, re #1182
2017-08-06 02:17:10 +02:00
Matthew Honnibal
4cfb7a54e7
Fix tagger
2017-08-06 01:53:31 +02:00
Matthew Honnibal
e9ab800e15
Fix tagging model
2017-08-06 01:50:08 +02:00
Matthew Honnibal
468c138ab3
WIP: Add fine-tuning logic to tagger model, re #1182
2017-08-06 01:13:23 +02:00
Matthew Honnibal
7f876a7a82
Clean up some unused code in parser
2017-08-06 00:00:21 +02:00
Matthew Honnibal
ae1ad81069
Increment version
2017-08-05 18:09:32 +02:00
Jim Geovedi
cc4772cac2
reworks
2017-08-03 13:08:38 +07:00
Jim Geovedi
37f19f5ed2
added more currencies based on corpus data
2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42
hashtag prefix should be handled somewhere else
2017-08-03 13:03:02 +07:00
Jim Geovedi
4705ae19ba
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-03 12:40:19 +07:00
Jim Geovedi
ba07e23c87
added USD in currency rules
2017-08-02 22:42:47 +07:00
Matthew Honnibal
5c323daa1a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-08-01 22:10:37 +02:00
Matthew Honnibal
2e00361522
Fix update when 0 docs
2017-08-01 22:10:17 +02:00
Matthew Honnibal
8fce187de4
Fix ArcEager for missing values
2017-08-01 22:10:05 +02:00
ines
78e262140f
Add workaround for displaCy server on Python 2/3 ( resolves #1227 )
...
Make sure status and headers are bytes on Python 2 and strings on
Python 3
2017-08-01 01:11:35 +02:00
Jim Geovedi
2572a9ddf0
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-07-30 21:24:16 +07:00
Jim Geovedi
bb08d696f9
added hashtag rule and fixed currency rules
2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803
added u-\d+ rules (sports team)
2017-07-30 21:23:01 +07:00
Matthew Honnibal
27abc56e98
Add method to get beam entities
2017-07-29 21:59:02 +02:00
Matthew Honnibal
ec63f4fe7b
Add option to control how missing entities are handled when getting NER tags
2017-07-29 21:58:37 +02:00
Jim Geovedi
e5adc26c72
simplified rules
2017-07-29 18:21:32 +07:00
Jim Geovedi
783f7d8b86
added test set for Indonesian language
2017-07-29 18:21:07 +07:00
Jim Geovedi
4d04898dea
updated regexp
2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea
updated like_num
2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798
added lex attrs rules
2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1
more exceptions
2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf
updated lemma
2017-07-27 19:46:21 +07:00
Jim Geovedi
c194f7ae26
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-07-27 10:55:34 +07:00
Jim Geovedi
547973b92a
wip syntax iterators
2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d
enable syntax iterator and lemma lookup
2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28
added wip lemma dict
2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b
added hyphen-suffix rules
2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441
removed -el from suffix rules
2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411
updated tokenizer exceptions
2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1
update punctuation rules
2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a
enable token match
2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb
updated tokenizer exceptions
2017-07-26 19:12:52 +07:00
Matthew Honnibal
aff325b7e0
Increment version
2017-07-25 19:41:20 +02:00
Matthew Honnibal
6780132821
Fix tagger loading
2017-07-25 19:41:11 +02:00
Matthew Honnibal
fd20a4af55
Increment version
2017-07-25 18:58:34 +02:00
Matthew Honnibal
523b0df2c9
Update text classification model
2017-07-25 18:57:59 +02:00
Matthew Honnibal
7c7fac9337
Add spacy.blank() loading function
2017-07-25 18:56:37 +02:00
Jim Geovedi
73f6ac9d9b
added hyhen
2017-07-24 15:56:31 +07:00
Jim Geovedi
68454c40bf
added missing import
2017-07-24 14:12:34 +07:00
Jim Geovedi
eaf9cbd708
cursed of copy & paste
2017-07-24 14:11:51 +07:00
Jim Geovedi
7aad6718bc
enable tokenizer exceptions
2017-07-24 14:11:10 +07:00
Jim Geovedi
ad56c9179a
added tokenizer exceptions list
2017-07-24 14:10:16 +07:00
Jim Geovedi
c1f3fe99fe
updated punctuation rules
2017-07-24 13:57:21 +07:00
Jim Geovedi
37fa2c8c80
punctution rules
2017-07-24 06:17:18 +07:00
Jim Geovedi
082e94ac1c
added inflix rules
2017-07-24 06:17:07 +07:00
Jim Geovedi
d0ec484725
reverted
2017-07-24 06:16:29 +07:00
Jim Geovedi
0e590c711f
added prefix & suffix rules
2017-07-23 23:46:40 +07:00
Jim Geovedi
ba922e30e8
added ampere hour unit
2017-07-23 23:46:18 +07:00
Jim Geovedi
3b17eba27b
added frequency units
2017-07-23 23:10:52 +07:00
Jim Geovedi
d5fd32a572
added known currencies
2017-07-23 22:56:48 +07:00
Jim Geovedi
f6f15678fb
added lex_attrs
2017-07-23 22:55:22 +07:00
Jim Geovedi
bed8162d00
added tokenizer_exceptions
2017-07-23 22:55:05 +07:00
Jim Geovedi
b80c35bc9a
added norm_exceptions
2017-07-23 22:54:49 +07:00
Jim Geovedi
b5de329ea3
added norm_exceptions
2017-07-23 22:54:19 +07:00
Jim Geovedi
082e9ade46
fixed typo
2017-07-23 21:30:34 +07:00
Jim Geovedi
e2efeb186e
added stopwords
2017-07-23 20:52:37 +07:00
Jim Geovedi
da98676839
use template
2017-07-23 20:51:31 +07:00
Jim Geovedi
c2b4dd7809
start working on Indonesian language
2017-07-23 20:50:56 +07:00
Matthew Honnibal
5771bd1ff8
Increment version
2017-07-23 14:18:38 +02:00
Matthew Honnibal
c4a81a47a4
Fix deserialization
2017-07-23 14:11:07 +02:00
Matthew Honnibal
2df563ad24
Remove optimization for textcat that caused loading problem
2017-07-23 14:10:51 +02:00
Matthew Honnibal
4fe77bced2
Add cfg attr to pipeline components
2017-07-23 00:52:47 +02:00
Matthew Honnibal
d8aa721664
Compute Language.meta with a property
2017-07-23 00:50:18 +02:00
Matthew Honnibal
a88a7deffe
Five save/load of textcat config
2017-07-23 00:33:43 +02:00
Matthew Honnibal
9bae0ddc50
Fix minibatching
2017-07-22 20:14:49 +02:00
Matthew Honnibal
ded0df5e2f
Expose hyper-param as keyword arg
2017-07-22 20:14:37 +02:00
Matthew Honnibal
f5de8deeec
Increment version
2017-07-22 20:04:53 +02:00
Matthew Honnibal
b55714d5d1
Make gold_tuples arg optional in begin_training
2017-07-22 20:04:43 +02:00
Matthew Honnibal
ed6c85fa3c
Fix loading of text categories in GoldParse
2017-07-22 20:04:03 +02:00
Matthew Honnibal
6ffec9dfea
Update _ml, for textcat model
2017-07-22 20:03:40 +02:00
Matthew Honnibal
d6a5c2c85a
Add test for NER
2017-07-22 01:48:58 +02:00
Matthew Honnibal
28244df4da
Add test for beam parsing
2017-07-22 01:48:35 +02:00
Matthew Honnibal
c86445bdfd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-07-22 01:14:28 +02:00
Matthew Honnibal
b3a749610e
Fix name of TextCategorizer
2017-07-22 01:14:07 +02:00
Matthew Honnibal
2424493970
Remove unnecessary import of Mock
2017-07-22 01:13:54 +02:00
Matthew Honnibal
baa3d81c35
Add text categorizer to Language
2017-07-22 01:13:36 +02:00
Matthew Honnibal
a6a2159969
Add slot for text categories to Doc
2017-07-22 00:34:15 +02:00
Matthew Honnibal
374ab3ecfb
Increment alpha version
2017-07-22 00:32:49 +02:00
Matthew Honnibal
289f23df51
Test beam parsing
2017-07-20 15:03:10 +02:00
Matthew Honnibal
3da1063b36
Add beam decoding to parser, to allow NER uncertainties
2017-07-20 15:02:55 +02:00
Matthew Honnibal
0ca5832427
Improve negative example handling in NER oracle
2017-07-20 00:18:49 +02:00
Matthew Honnibal
a231b56d40
Add text-classification hook to pipeline
2017-07-20 00:18:15 +02:00
Matthew Honnibal
7ea50182a5
Add support for text-classification labels to GoldParse
2017-07-20 00:17:47 +02:00
Matthew Honnibal
727481377e
Add text-classifer thinc models
2017-07-20 00:17:17 +02:00
Matthew Honnibal
f014138c11
Fix parser tests
2017-07-20 00:16:52 +02:00
mollerhoj
85144835da
Add Tag_map for Danish
2017-07-03 15:52:55 +02:00
mollerhoj
64c732918a
Add Morph_rules. (TODO: Not working?)
2017-07-03 15:52:55 +02:00
mollerhoj
3b2cb107a3
Add like_num functionality to Danish
2017-07-03 15:49:51 +02:00
mollerhoj
e8f40ceed8
Add short names of months to tokenizer_exceptions
2017-07-03 15:49:51 +02:00
mollerhoj
e840077601
Add some basic tests for Danish
2017-07-03 15:49:51 +02:00
mollerhoj
23025d3b05
Clean up a couple of strange English stopwords
2017-07-03 15:41:59 +02:00
mollerhoj
dc5be7d2f3
Cleanup list of Danish stopwords
2017-07-03 15:40:58 +02:00
Ines Montani
c91642efd5
Port over changes from #1168
2017-07-01 11:43:54 +02:00
Jim O'Regan
70f4d26c10
bounds checks
2017-06-28 10:59:46 +01:00
Jim O'Regan
1ba38b2036
some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology
2017-06-28 00:42:00 +01:00
Jim O'Regan
559e03605a
b'
2017-06-27 22:42:16 +01:00
Jim Regan
d81ceb0cd5
Merge branch 'develop' into polish
2017-06-26 22:42:27 +01:00
Jim O'Regan
2f84c73585
a start
2017-06-26 22:40:04 +01:00
Jim O'Regan
28d7f0a672
reference
2017-06-26 22:38:28 +01:00
Jim O'Regan
e12defdd9c
missed a couple
2017-06-26 22:24:14 +01:00
Jim O'Regan
c1e4e0f3bf
just now discovered that you can do multiwords
2017-06-26 22:19:39 +01:00
Jim O'Regan
5e5f94c1c0
fix dup
2017-06-26 21:57:00 +01:00
Jim O'Regan
a8dff9133e
add POS
2017-06-26 21:53:41 +01:00
Jim O'Regan
e9213f54de
missed one
2017-06-26 21:29:21 +01:00
Jim O'Regan
1eb7cc3017
attempt a port from #1147
2017-06-26 21:24:55 +01:00
Matthew Honnibal
91e52543ef
Merge pull request #1118 from Gregory-Howard/patch-2
...
Update _tokenizer_exceptions_list (adding cities)
2017-06-20 11:16:07 +02:00
Matthew Honnibal
8ea785e01a
Merge pull request #1119 from oroszgy/patch-3
...
Fixed conllu converter
2017-06-20 11:14:41 +02:00
Tpt
7745b3ae04
Adds noun chunks to French syntax iterators
2017-06-12 15:29:58 +02:00
Tpt
57e8254f63
Adds function to extract french noun chunks
2017-06-12 15:20:49 +02:00
György Orosz
62dbf9025c
Fixed conllu converter
2017-06-09 22:53:56 +02:00
Grégory Howard
cd974b32b7
Update _tokenizer_exceptions_list (adding cities)
2017-06-09 17:58:18 +02:00
ines
34a2eecb17
Add simple "naughty strings" test (see #1107 )
2017-06-06 17:43:51 +02:00
ines
045574a936
Update package name and increment version
2017-06-05 20:41:30 +02:00
Matthew Honnibal
1f5874a927
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-05 20:20:00 +02:00
ines
03db56f48c
Detect spaCy version and add package title
...
Package title allows customised package names (like spacy-nightly)
2017-06-05 20:11:02 +02:00
Matthew Honnibal
c0d90f52f7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-05 19:20:13 +02:00
ines
cc9c5dc7a3
Fix noun chunks test
2017-06-05 16:39:04 +02:00
Matthew Honnibal
836bfa2d0f
Add factory for experimental SimilarityHook component
2017-06-05 15:40:22 +02:00
Matthew Honnibal
d59fa32df1
Add experimental SimilarityHook omponent
2017-06-05 15:40:03 +02:00
Matthew Honnibal
5489b49203
Remove print statement
2017-06-05 13:20:41 +02:00
Matthew Honnibal
fc4204a12a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-05 13:13:23 +02:00
Matthew Honnibal
2479cde446
Support disable keyword in Language.__init__
2017-06-05 13:13:07 +02:00
ines
ea167e14db
Fix model package loading from link
2017-06-05 13:10:49 +02:00
ines
dd6dc4c120
Update spacy.load() helper functions
2017-06-05 13:02:31 +02:00
Matthew Honnibal
b4cdd05466
Add vectors.pyx in setup
2017-06-05 12:45:29 +02:00
Matthew Honnibal
280d419529
Add pickle method for vectors
2017-06-05 12:36:04 +02:00
Matthew Honnibal
30369d580f
Start testing Vectors class
2017-06-05 12:32:49 +02:00
Matthew Honnibal
eb7cbb62c2
Flesh out Vectors class
2017-06-05 12:32:08 +02:00
ines
51d7414e94
Make sure sents are a list
2017-06-05 12:30:13 +02:00
Matthew Honnibal
ebb6c49cd5
Make alignment case-insensitive for gold
2017-06-04 20:26:42 -05:00
Matthew Honnibal
fc4dd62e84
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 20:19:05 -05:00
Matthew Honnibal
8f8f90b46b
Disable labeller if not parsing
2017-06-04 20:18:54 -05:00
Matthew Honnibal
c52fde40f4
Improve train CLI
2017-06-04 20:18:37 -05:00
Matthew Honnibal
a053b1218e
Fix item counting during training
2017-06-04 20:18:20 -05:00
Matthew Honnibal
b3b5521625
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 20:17:18 -05:00
Matthew Honnibal
9bc4a26213
Add option of data augmentation noise
2017-06-04 20:16:57 -05:00
Matthew Honnibal
7b2ede783d
Add SP tag to tag map if missing
2017-06-04 20:16:30 -05:00
ines
a0f4592f0a
Update tests
2017-06-05 02:26:13 +02:00
ines
3e105bcd36
Update tests
2017-06-05 02:09:27 +02:00
Matthew Honnibal
516798e9fc
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-05 01:35:21 +02:00
Matthew Honnibal
193bf913c0
Set is_tagged=True after tagging
2017-06-05 01:35:07 +02:00
ines
078232932c
Fix tokenizer fixture scope
2017-06-05 01:06:34 +02:00
Matthew Honnibal
58be0e1f6f
Update tests
2017-06-04 16:35:06 -05:00
Matthew Honnibal
b78cc318c3
Fix loading of morphology exceptions
2017-06-04 16:34:32 -05:00
Matthew Honnibal
bb98d45a63
Fix tests
2017-06-04 16:00:44 -05:00
Matthew Honnibal
55d0621532
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 15:53:25 -05:00
Matthew Honnibal
5b9f116aca
Update tests
2017-06-04 15:53:17 -05:00
Matthew Honnibal
2a3bd5ee90
Fix fetching of noun chunk iterator
2017-06-04 15:53:05 -05:00
Matthew Honnibal
3680c51b8f
Avoid clobbering preset POS tags
2017-06-04 15:52:42 -05:00
Matthew Honnibal
939e8ed567
Add lookup properties for components in Language
2017-06-04 15:52:09 -05:00
Matthew Honnibal
e28f90b672
Fix syntax iterators
2017-06-04 15:51:50 -05:00
ines
8a29308d0b
Remove unused imports
2017-06-04 22:39:29 +02:00
Ines Montani
112c5787eb
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
...
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
96867a24ae
Fix typo
2017-06-04 22:36:40 +02:00
ines
f432bb4b48
Fix fixture scopes
2017-06-04 22:34:31 +02:00
Matthew Honnibal
6d0356e6cc
Whitespace
2017-06-04 14:55:24 -05:00
Matthew Honnibal
8a683a4494
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 21:53:56 +02:00
Matthew Honnibal
92ae36f84e
Improve way noun chunks iterator is looked up
2017-06-04 21:53:39 +02:00
ines
9254a3dd78
Import and add Spanish syntax iterators
2017-06-04 21:42:15 +02:00
ines
7db1a0e83e
Make sure printed values are always strings
2017-06-04 21:27:20 +02:00
Matthew Honnibal
51e1541ddb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 14:26:29 -05:00
Matthew Honnibal
add9a33782
Return False for vocab.has_vector
2017-06-04 14:26:14 -05:00
Matthew Honnibal
675f448313
Fix vector linkage on Doc
2017-06-04 14:25:30 -05:00
Matthew Honnibal
f4662e9218
Fix vector linkage for token
2017-06-04 14:19:58 -05:00
ines
070e026ed9
Ensure path on read_json
2017-06-04 20:44:37 +02:00
ines
e1e73936b1
Raise correct error
2017-06-04 20:44:27 +02:00
ines
848e47669e
Fix typo
2017-06-04 20:44:15 +02:00
ines
c4614c02a2
Fix dev resources URL
2017-06-04 15:45:50 +02:00
ines
a66cf24ee8
xfail tokenizer serialization tests for now
...
Tests pass locally, but not on Travis – needs more investigation
2017-06-04 13:58:20 +02:00
ines
7b7d46b64e
Fix typo and success message
2017-06-04 13:45:50 +02:00
ines
90d117f378
Update version
2017-06-04 13:41:16 +02:00
Matthew Honnibal
7ca215bc26
Resolve lex_attr_getters conflict
2017-06-03 16:12:01 -05:00
Matthew Honnibal
21eef90dbc
Support specifying which GPU
2017-06-03 16:10:23 -05:00
Matthew Honnibal
d0e42f9275
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 15:30:32 -05:00
Matthew Honnibal
8a17b99b1c
Use NORM attribute, not LOWER
2017-06-03 15:30:16 -05:00
ines
4c643d74c5
Add norm exceptions to other Language classes
2017-06-03 22:29:21 +02:00
ines
fa7e576c57
Change order of exception dicts
2017-06-03 21:52:06 +02:00
Matthew Honnibal
3f5c85d8de
Reorder setting of lex attrs, to avoid clobbering
2017-06-03 14:47:55 -05:00
Matthew Honnibal
aeb7520133
Make norm use lower-case
2017-06-03 14:47:38 -05:00
Matthew Honnibal
de3954843e
Populate norm exceptions with lower-case
2017-06-03 14:47:12 -05:00
Matthew Honnibal
f6955a459c
Fix prev commit
2017-06-03 14:38:37 -05:00
Matthew Honnibal
468ca6c760
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 14:33:51 -05:00
Matthew Honnibal
c647a0d33e
Fix training counter for gold preprocessing
2017-06-03 14:33:39 -05:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
d77c2cc8bb
Add tests for English norm exceptions
2017-06-03 20:59:50 +02:00
ines
0d6fa8b241
Add German norm exceptions
2017-06-03 20:54:18 +02:00
ines
5bd311c77e
Fix update of norm exceptions
2017-06-03 20:54:09 +02:00
Matthew Honnibal
94e063ae2a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 13:31:40 -05:00
Matthew Honnibal
fea1144e6d
Set max batch size in evaluate
2017-06-03 13:31:33 -05:00
Matthew Honnibal
805495af27
Fix off-by-one in number of tags
2017-06-03 13:29:23 -05:00
Matthew Honnibal
e62f46d39f
Clarify gold.pyx slightly
2017-06-03 13:28:52 -05:00
Matthew Honnibal
43353b5413
Improve train CLI script
2017-06-03 13:28:20 -05:00
ines
746653880c
Add English norm exceptions to lex_attrs
2017-06-03 20:27:28 +02:00
ines
095eeeb12f
Update English tokenizer exceptions and add norms
2017-06-03 20:27:16 +02:00
ines
e5d426406a
Add base norm exceptions
2017-06-03 20:27:05 +02:00
ines
4c2bbc3ccc
Add add_lookups util function
2017-06-03 19:44:47 +02:00
ines
05fe6758a7
Set lexeme attributes for tokenizer special cases
2017-06-03 19:44:39 +02:00
ines
3152ee5ca2
Update serialization tests for tokenizer
2017-06-03 17:05:28 +02:00
ines
7c919aeb09
Make sure serializers and deserializers are ordered
2017-06-03 17:05:09 +02:00
ines
1ebd0d3f27
Add assert_packed_msg_equal util function
2017-06-03 17:04:30 +02:00
ines
de974f7bef
Add serializer tests for tokenizer
2017-06-03 13:26:34 +02:00
ines
0153b66a86
Return self in Tokenizer.from_bytes
2017-06-03 13:26:13 +02:00
ines
82154a1861
Add letter spacing to arrow label
2017-06-03 13:25:41 +02:00
ines
32c6f05de9
Adjust spacing and sizing in compact mode
2017-06-03 13:25:32 +02:00
ines
cc8c8617a4
Shut down displaCy server on KeyboardInterrupt
2017-06-03 13:24:56 +02:00
ines
70fbba7d08
Clone Doc to never merge punctuation on original Doc
2017-06-03 13:24:43 +02:00
ines
459a1e8470
Fix whitespace
2017-06-03 11:31:18 +02:00
ines
5109bba910
Port over fix from #1070
2017-06-03 11:31:11 +02:00
ines
d21459f87d
Update serializer tests
2017-06-02 21:42:26 +02:00
ines
6669583f4e
Use OrderedDict
2017-06-02 21:07:56 +02:00
ines
2f1025a94c
Port over Spanish changes from #1096
2017-06-02 19:09:58 +02:00
ines
d86e7cde93
Add entity recognizer to parser serialization tests
2017-06-02 18:40:06 +02:00
ines
0051c05964
Add tests for serializing parser
2017-06-02 18:37:19 +02:00
ines
fdd0923be4
Translate model=True in exclude to lower_model and upper_model
2017-06-02 18:37:07 +02:00
ines
cef547a9f0
Add serialization tests for tensorizer
2017-06-02 18:18:30 +02:00
ines
924c58bde3
Fix serialization of optional elements
2017-06-02 18:18:17 +02:00
ines
f74a45c1fe
Remove unnecessary argument
2017-06-02 18:17:46 +02:00
ines
43b4d63f85
Add serialization tests for tagger
2017-06-02 17:29:34 +02:00
ines
1b593bbd6d
Fix encoding on tagger serialization
2017-06-02 17:29:21 +02:00
Matthew Honnibal
5f4d328e2c
Fix serialization of tag_map in NeuralTagger
2017-06-02 10:18:37 -05:00
Matthew Honnibal
ed6f575e06
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-02 04:26:39 -05:00
ines
acd65c00f6
Add serialization tests for StringStore and Vocab
2017-06-02 10:57:42 +02:00
ines
41a6adf1f6
Initialise Vocab length correctly
2017-06-02 10:57:25 +02:00
ines
53b82f972a
Add strings to Vocab in init, instead of StringStore
2017-06-02 10:57:06 +02:00
ines
023f38bdd4
Fix return value of Vocab.from_bytes
2017-06-02 10:56:40 +02:00
ines
9692c98f57
Add test utils for temp file and temp dir
2017-06-02 10:56:09 +02:00
Matthew Honnibal
c650bc481c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 13:03:57 -05:00
Matthew Honnibal
307d615c5f
Fix serialization for tagger when tag_map has changed
2017-06-01 12:18:36 -05:00
Matthew Honnibal
1d18cedae8
Fiddle with msgpack bytes vs unicode
2017-06-01 10:48:43 -05:00
ines
7a2380f617
Rename "nn_tagger" to "tagger"
2017-06-01 17:37:53 +02:00
ines
e5ae6ccf4e
Fix typo
2017-06-01 16:46:15 +02:00
ines
a3e4f91f4a
Only load vocab if it exists
2017-06-01 14:38:35 +02:00
Matthew Honnibal
d310b0aab3
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 04:58:03 -05:00
Matthew Honnibal
3ff7d7fcef
Merge for updated requirements
2017-06-01 04:57:47 -05:00
Matthew Honnibal
5eae3b9a1e
Fix to/from disk in tagger
2017-06-01 04:55:49 -05:00
ines
d5c8d2f5fd
Update about.py and increment version
2017-06-01 11:52:24 +02:00
Matthew Honnibal
4c97371051
Fixes for thinc 6.7
2017-06-01 04:22:16 -05:00
Matthew Honnibal
53d00a0371
Move weight serialization to Thinc
2017-06-01 03:04:36 -05:00
Matthew Honnibal
ae8010b526
Move weight serialization to Thinc
2017-06-01 02:56:12 -05:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
Matthew Honnibal
c8a58cfcf8
Fix Python2/3 load bug
2017-05-31 15:21:44 -05:00
Matthew Honnibal
99982684b0
Fix normalize_string_keys function'
2017-05-31 14:08:16 -05:00
Matthew Honnibal
67ade63fc4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 08:28:42 -05:00
Matthew Honnibal
490b38e6bb
Fix reference to thinc copy_array util
2017-05-31 08:25:21 -05:00
Matthew Honnibal
9805e0e369
Fix vocab pickling
2017-05-31 08:25:01 -05:00
Matthew Honnibal
6c51cd77b4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 15:06:56 +02:00
Matthew Honnibal
8dfb9546f0
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 07:21:14 -05:00
Matthew Honnibal
480ef8bfc8
Add compat function to normalize dict keys
2017-05-31 07:14:29 -05:00
Matthew Honnibal
92f9e5cc9a
Silence env_opt, and fix serialization for GPU
2017-05-31 07:14:11 -05:00
Matthew Honnibal
0561df2a9d
Fix tokenizer serialization
2017-05-31 14:12:38 +02:00
Matthew Honnibal
4a398c15b7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 13:44:16 +02:00
Matthew Honnibal
097ab9c6e4
Fix transition system to/from disk
2017-05-31 13:44:00 +02:00
Matthew Honnibal
b1469d3360
Fix string serialisation
2017-05-31 13:43:44 +02:00
Matthew Honnibal
e9419072e7
Fix tokenizer serialisation
2017-05-31 13:43:31 +02:00
Matthew Honnibal
33e5ec737f
Fix to/from disk methods
2017-05-31 13:43:10 +02:00
ines
5e1c361270
Update tests README with info on model tests
2017-05-31 12:22:58 +02:00
Matthew Honnibal
fe28602f2e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 11:43:56 +02:00
Matthew Honnibal
66af019d5d
Fix serialization of tokenizer
2017-05-31 11:43:40 +02:00
Ines Montani
e6cf3c7e1c
Merge pull request #1093 from oroszgy/hu_emoji_fix
...
Fixed emoji handling for Hungarian
2017-05-31 11:33:24 +02:00
Matthew Honnibal
e98eff275d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 10:29:15 +02:00
Matthew Honnibal
53a3824334
Fix mistake in ner feature
2017-05-31 03:01:02 +02:00
Matthew Honnibal
8a693c2605
Write binary file during training
2017-05-31 02:59:18 +02:00
Matthew Honnibal
498ad85309
Try using tensor for vector/similarity methdos
2017-05-30 23:35:17 +02:00
Matthew Honnibal
a131981f3b
Work on vectors
2017-05-30 23:34:50 +02:00
Matthew Honnibal
6937e311a4
Update doc tests
2017-05-30 23:34:23 +02:00
Matthew Honnibal
cc911feab2
Fix bug in NER state
2017-05-30 22:12:19 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
Matthew Honnibal
be4a640f0c
Fix arc eager label costs for uint64
2017-05-30 20:37:58 +02:00
Matthew Honnibal
b127645afc
Fix test_misc merge conflict
2017-05-29 18:31:44 -05:00
Matthew Honnibal
e0e8eae7c7
Tweak package test
2017-05-29 18:30:42 -05:00
Matthew Honnibal
11840ff5dd
Store tag map before normalizing props
2017-05-29 17:53:48 -05:00
Matthew Honnibal
b92a89f87b
Make it easier to reference embedding tables
2017-05-29 17:53:29 -05:00
Matthew Honnibal
293d1b425b
Serialize in consistent order
2017-05-29 17:53:06 -05:00
Matthew Honnibal
9bf22a94aa
Fix tag set serialisation
2017-05-29 17:52:36 -05:00
Matthew Honnibal
2a061e2777
Fix serialisation, for reals this time
2017-05-29 17:52:08 -05:00
ines
20a7003c0d
Update model fixtures and reorganise tests
2017-05-29 22:14:31 +02:00
ines
795fe43a4d
Add load_test_model function with importorskip()
...
Loads model only if it can be imported, i.e. if it's installed as a
package.
2017-05-29 22:11:31 +02:00
ines
ad3c8b3ad9
Fix formatting
2017-05-29 22:10:50 +02:00
ines
6e3937efc5
Check for arguments of model markers to specify models to test
...
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
Matthew Honnibal
35d981241f
Fix model deserialization
2017-05-29 14:46:31 -05:00
Matthew Honnibal
5b29f227ae
Fix serialization
2017-05-29 14:35:53 -05:00
Matthew Honnibal
1e6df0a2a1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 14:30:12 -05:00
ines
08382f21e3
Pass model meta to nlp object in load_model
2017-05-29 20:44:11 +02:00
ines
6145fe6a93
Catch all kwargs on Language
2017-05-29 20:43:48 +02:00
ines
0d7d50fe22
Add __version__ to __init__.py
2017-05-29 20:43:24 +02:00
Matthew Honnibal
6522ea6c8b
More serialization fixes. Still broken
2017-05-29 13:23:47 -05:00
Matthew Honnibal
9c9ee24411
Fix broken lambda scoping in Python 2
2017-05-29 13:23:28 -05:00
Matthew Honnibal
f1acdaab55
Fix serialization of weight offsets
2017-05-29 13:23:11 -05:00
Matthew Honnibal
c044e9c21c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 08:41:02 -05:00
Matthew Honnibal
aa4c33914b
Work on serialization
2017-05-29 08:40:45 -05:00
ines
9e83a17e95
Use new model templates
2017-05-29 15:27:24 +02:00
ines
567485a818
Fix and document model loading with pipeline and overrides
2017-05-29 14:10:10 +02:00
Matthew Honnibal
deac7eb01c
Fix for serialization
2017-05-29 13:54:18 +02:00
Matthew Honnibal
04c32aa091
Fix for serialization
2017-05-29 13:53:32 +02:00
Matthew Honnibal
a1960c2d09
Fix for serialization
2017-05-29 13:47:42 +02:00
Matthew Honnibal
7b06bb896e
Fix for serialization
2017-05-29 13:42:55 +02:00
Matthew Honnibal
74235587ef
Fix to serialization
2017-05-29 13:40:31 +02:00
Matthew Honnibal
59f355d525
Fixes for serialization
2017-05-29 13:38:20 +02:00
Matthew Honnibal
920887f4e4
Specify order of vocab deserialization
2017-05-29 13:04:40 +02:00
Matthew Honnibal
f4aafca222
Merge changes to test_misc
2017-05-29 12:26:02 +02:00
Matthew Honnibal
a318f0cae1
Add to/from disk/bytes methods for tokenizer
2017-05-29 12:24:41 +02:00
Matthew Honnibal
ff26aa6c37
Work on to/from bytes/disk serialization methods
2017-05-29 11:45:45 +02:00
ines
df920ba0e7
Add tests for displaCy and util functions and fix util typo
2017-05-29 10:51:19 +02:00
ines
c5714d4fb2
xfail matcher test for now until setting norm via Span.merge works
2017-05-29 10:51:02 +02:00
Matthew Honnibal
6b019b0540
Update to/from bytes methods
2017-05-29 10:14:20 +02:00
Matthew Honnibal
c91b121aeb
Move serialization functions to util
2017-05-29 10:13:42 +02:00
Matthew Honnibal
1fa2bfb600
Add model_to_bytes and model_from_bytes helpers. Probably belong in thinc.
2017-05-29 09:27:04 +02:00
Matthew Honnibal
6dad4117ad
Work on serialization for models
2017-05-29 01:37:57 +02:00
ines
7b1ddcc04d
Add test for vocab serialization
2017-05-29 01:09:52 +02:00
ines
00b2094dc3
Fix typos, long integers and tests
2017-05-29 01:09:52 +02:00
ines
804dbb8d25
Add StringStore test for API docs
2017-05-29 01:09:52 +02:00
Matthew Honnibal
6cd5730ee7
Fix lex struct setters for strings
2017-05-29 01:05:09 +02:00
Matthew Honnibal
2edd96ce47
Draft Vocab to/from disk/bytes
2017-05-28 23:34:12 +02:00
Matthew Honnibal
4ddff020c3
Fix compile error
2017-05-28 23:30:40 +02:00
Matthew Honnibal
6d3caeadd2
Fix type check for long
2017-05-28 23:22:45 +02:00
Matthew Honnibal
92dbf28c1e
Hack a fixture in the vectors tests, for xfail
2017-05-28 20:28:32 +02:00
Matthew Honnibal
9239f06ed3
Fix german noun chunks iterator
2017-05-28 20:13:03 +02:00
Matthew Honnibal
fd9b6722a9
Fix noun chunks iterator for new stringstore
2017-05-28 20:12:10 +02:00
ines
414193e9ba
Update docs to reflect StringStore changes
2017-05-28 18:19:11 +02:00
Matthew Honnibal
7996d21717
Fixes for new StringStore
2017-05-28 11:09:27 -05:00
Matthew Honnibal
8a24c60c1e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 08:12:05 -05:00
Matthew Honnibal
bc97bc292c
Fix __call__ method
2017-05-28 08:11:58 -05:00
Matthew Honnibal
5cf47b847b
Handle iob with no tag in converter
2017-05-28 08:11:39 -05:00
Matthew Honnibal
fe11564b8e
Finish stringstore change. Also xfail vectors tests
2017-05-28 15:10:22 +02:00
Matthew Honnibal
b007a2b0d3
Update stringstore tests
2017-05-28 14:08:09 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
fe4a746300
Accomodate symbols in new string scheme
2017-05-28 13:03:16 +02:00
Matthew Honnibal
f51e6a6c16
Adjust lexeme sizing for attr_t being 64 bit
2017-05-28 12:51:09 +02:00
Matthew Honnibal
a5606c3eda
Work on changing StringStore to return hashes.
2017-05-28 12:36:27 +02:00
Matthew Honnibal
39293ab2ee
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 11:46:57 +02:00
Matthew Honnibal
dd052572d4
Update arc eager for SBD changes
2017-05-28 11:46:51 +02:00
Matthew Honnibal
3ea98e2043
Remove vector member from lexeme
2017-05-28 11:46:24 +02:00
Matthew Honnibal
2445707f3c
Re-delegate vectors to vocab
2017-05-28 11:46:10 +02:00
Matthew Honnibal
6863d01361
Remove vectors from lexeme
2017-05-28 11:45:48 +02:00
Matthew Honnibal
15f6efc127
Remove vectors from vocab
2017-05-28 11:45:32 +02:00
Matthew Honnibal
c1263a844b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 18:32:57 -05:00
Matthew Honnibal
9e711c3476
Divide d_loss by batch size
2017-05-27 18:32:46 -05:00
Matthew Honnibal
b082f76494
Randomize pipeline order during training
2017-05-27 18:32:21 -05:00
Matthew Honnibal
a1d4c97fb7
Improve correctness of minibatching
2017-05-27 17:59:00 -05:00
ines
84189c1cab
Add 'xx' language ID for multi-language support
...
Allows models to specify their language ID as 'xx'.
2017-05-28 00:58:59 +02:00
ines
33e332e67c
Remove unused export
2017-05-28 00:57:59 +02:00
ines
c1983621fb
Update util functions for model loading
2017-05-28 00:22:40 +02:00
ines
c8543c8237
Fix formatting and docstrings and remove deprecated function
2017-05-28 00:22:40 +02:00
Matthew Honnibal
49235017bf
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 16:34:28 -05:00
Matthew Honnibal
7ebd26b8aa
Use ordered dict to specify transitions
2017-05-27 15:52:20 -05:00
Matthew Honnibal
3eea5383a1
Add move_names property to parser
2017-05-27 15:51:55 -05:00
Matthew Honnibal
8de9829f09
Don't overwrite model in initialization, when loading
2017-05-27 15:50:40 -05:00
Matthew Honnibal
99316fa631
Use ordered dict to specify actions
2017-05-27 15:50:21 -05:00
Matthew Honnibal
655ca58c16
Clarifying change to StateC.clone
2017-05-27 15:49:37 -05:00
Matthew Honnibal
5e4312feed
Evaluate loaded class, to ensure save/load works
2017-05-27 15:47:02 -05:00
Matthew Honnibal
34bbad8e0e
Add __reduce__ methods on parser subclasses. Fixes pickling.
2017-05-27 15:46:06 -05:00
Matthew Honnibal
7cc9c3e9a6
Fix convert CLI
2017-05-27 15:44:42 -05:00
ines
1203959625
Add pipeline setting to meta.json generator
2017-05-27 20:02:01 +02:00
ines
086a06e7d7
Fix CLI docstrings and add command as first argument
...
Workaround for Plac
2017-05-27 20:01:46 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
dc07d72d80
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 08:20:40 -05:00
Matthew Honnibal
de13fe0305
Remove length cap on sentences
2017-05-27 08:20:32 -05:00
Matthew Honnibal
73a643d32a
Don't randomise pipeline for training, and don't update if no gradient
2017-05-27 08:20:13 -05:00
Matthew Honnibal
3d22fcaf0b
Return None from parser if there are no annotations
2017-05-26 14:02:59 -05:00
Matthew Honnibal
d06f235fc9
Fix conflict on convert.py
2017-05-26 11:33:29 -05:00
Matthew Honnibal
2e587c6417
Export iob_to_biluo utility
2017-05-26 11:32:55 -05:00
Matthew Honnibal
2b3b937a04
Fix converter CLI
2017-05-26 11:32:41 -05:00
Matthew Honnibal
5a87bcf35f
Fix converters
2017-05-26 11:32:34 -05:00
Matthew Honnibal
8af3100143
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-26 11:31:41 -05:00
Matthew Honnibal
3d5a536eaa
Improve efficiency of parser batching
2017-05-26 11:31:23 -05:00
Matthew Honnibal
daac3e3573
Always shuffle gold data, and support length cap
2017-05-26 11:30:52 -05:00
Matthew Honnibal
d65f99a720
Improve model saving in train script
2017-05-26 05:52:09 -05:00
ines
51882c4984
Fix formatting
2017-05-26 12:37:45 +02:00
ines
353f0ef8d7
Use disable argument (list) for serialization
2017-05-26 12:33:54 +02:00
Matthew Honnibal
22d7b448a5
Fix convert command
2017-05-25 19:47:12 -05:00
Matthew Honnibal
dbf2a4cf57
Update all models on each epoch
2017-05-25 19:46:56 -05:00
Matthew Honnibal
faff1c23fb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-25 17:16:10 -05:00
Matthew Honnibal
82b11b0320
Remove print statement
2017-05-25 17:15:59 -05:00
Matthew Honnibal
80cf42e33b
Fix compounding and decaying utils
2017-05-25 17:15:39 -05:00
Matthew Honnibal
df8015f05d
Tweaks to train script
2017-05-25 17:15:24 -05:00
Matthew Honnibal
3a6e59cc53
Add minibatch function in spacy.gold
2017-05-25 17:15:09 -05:00
Matthew Honnibal
702fe74a4d
Clean up spacy.cli.train
2017-05-25 16:16:30 -05:00
Matthew Honnibal
b9cea9cd93
Add compounding and decaying functions
2017-05-25 16:16:10 -05:00
Matthew Honnibal
2cb7cc2db7
Remove commented code from parser
2017-05-25 14:55:09 -05:00
Matthew Honnibal
f403c2cd5f
Add env opts for optimizer
2017-05-25 11:19:26 -05:00
Matthew Honnibal
c245ff6b27
Rebatch parser inputs, with mid-sentence states
2017-05-25 11:18:59 -05:00
Matthew Honnibal
679efe79c8
Make parser update less hacky
2017-05-25 06:49:00 -05:00
Matthew Honnibal
8500d9b1da
Only train one task per iter, holding grads
2017-05-25 06:47:42 -05:00
Matthew Honnibal
b27c587800
Fix pieces argument to PrecomputedMaxout
2017-05-25 06:46:59 -05:00
Matthew Honnibal
e1cb5be0c7
Adjust dropout, depth and multi-task in parser
2017-05-24 20:11:41 -05:00
Matthew Honnibal
e6cc927ab1
Rearrange multi-task learning
2017-05-24 20:10:54 -05:00
Matthew Honnibal
135a13790c
Disable gold preprocessing
2017-05-24 20:10:20 -05:00
Matthew Honnibal
467bbeadb8
Add hidden layers for tagger
2017-05-24 20:09:51 -05:00
ines
66088851dc
Add Doc.to_disk() and Doc.from_disk() methods
2017-05-24 11:58:17 +02:00
Matthew Honnibal
620df0414f
Fix dropout in parser
2017-05-23 15:20:45 -05:00
Matthew Honnibal
5b67bcbee0
Increase default embed size to 7500
2017-05-23 15:20:16 -05:00
Matthew Honnibal
48eef94f92
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 18:47:32 +02:00
Matthew Honnibal
d44b1eafc4
Fix conflict artefacts
2017-05-23 18:47:11 +02:00
Matthew Honnibal
01e59e4e6e
* Add Token.sent_start property, re Issue #235
2017-05-23 18:41:11 +02:00
Matthew Honnibal
4917cbb484
Include sent_start test
2017-05-23 18:40:37 +02:00
Matthew Honnibal
d68dd1f251
Add SENT_START attribute, for custom sentence boundary detection
2017-05-23 18:37:58 +02:00
Matthew Honnibal
8026c183d0
Add hacky logic to accelerate depth=0 case in parser
2017-05-23 11:06:49 -05:00
Matthew Honnibal
e7d3159d91
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 05:58:17 -05:00
Matthew Honnibal
a8b6d11c5b
Support optional maxout layer
2017-05-23 05:58:07 -05:00
Matthew Honnibal
c55b8fa7c5
Fix bugs in parse_batch
2017-05-23 05:57:52 -05:00
ines
fb0ff0272f
xfail neural parser tests for now and remove test for deprecated method
2017-05-23 12:40:37 +02:00
Matthew Honnibal
964707d795
Restore support for deeper networks in parser
2017-05-23 05:31:13 -05:00
Matthew Honnibal
e27262f431
Go back to previous matcher signature, with on_match positional
2017-05-23 04:37:40 -05:00
Matthew Honnibal
5418bcf5d7
Resolve conflict on test
2017-05-23 04:37:16 -05:00
ines
e6acd3bbf2
Fix matcher tests and matcher docs
2017-05-23 11:36:02 +02:00
ines
d0c6d4f76d
Fix formatting
2017-05-23 11:32:00 +02:00
Matthew Honnibal
f0bcc0bd8d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 04:29:28 -05:00
Matthew Honnibal
9adfe9e8fc
Don't hold gradient updates in language -- let the parser decide how to batch the updates.
2017-05-23 04:29:10 -05:00
Matthew Honnibal
6b918cc58e
Support making updates periodically during training
2017-05-23 04:23:29 -05:00
Matthew Honnibal
3f725ff7b3
Roll back changes to parser update
2017-05-23 04:23:05 -05:00
Matthew Honnibal
3959d778ac
Revert "Revert "WIP on improving parser efficiency""
...
This reverts commit 532afef4a8
.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8
Revert "WIP on improving parser efficiency"
...
This reverts commit bdaac7ab44
.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44
WIP on improving parser efficiency
2017-05-23 02:59:31 -05:00
Matthew Honnibal
8a9e318deb
Put the parsing loop in a nogil prange block
2017-05-22 17:58:12 -05:00
ines
a23f487b06
Tidy up displaCy and add "manual" option
...
Also don't require title in EntityRenderer
2017-05-22 18:48:20 +02:00
Matthew Honnibal
0264447c4d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 10:41:56 -05:00
Matthew Honnibal
6e8dce2c05
Fix train command line args
2017-05-22 10:41:39 -05:00
Matthew Honnibal
a7ee63c0ac
Fix labeller loss for unseen labels
2017-05-22 10:41:20 -05:00
Matthew Honnibal
c9760b2104
Support sentence limits in GoldCorpus
2017-05-22 10:40:46 -05:00
Matthew Honnibal
e2136232f9
Exclude states with no matching gold annotations from parsing
2017-05-22 10:30:12 -05:00
Matthew Honnibal
83ffd16474
Fix offset calculation for other negative values
2017-05-22 08:00:53 -05:00
ines
b3c7ee0148
Fix tests and use the new Matcher API
2017-05-22 13:54:20 +02:00
Matthew Honnibal
f00f821496
Fix pseudoprojectivity->nonproj
2017-05-22 06:14:42 -05:00
Matthew Honnibal
ae8cf70dc1
Fix CLI train signature
2017-05-22 06:13:39 -05:00
Matthew Honnibal
187f370734
Update tests for matcher changes
2017-05-22 12:59:50 +02:00
Matthew Honnibal
5d59e74cf6
PseudoProjectivity->nonproj
2017-05-22 05:49:53 -05:00
Matthew Honnibal
7e2cdc0c81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:39:34 +02:00
Matthew Honnibal
70a8c531cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 05:39:18 -05:00
Matthew Honnibal
2f78413a02
PseudoProjectivity->nonproj
2017-05-22 05:39:03 -05:00
Matthew Honnibal
89ebc5c3cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:38:15 +02:00
Matthew Honnibal
d8bb5bb959
Implement StringStore serialization, and update tests
2017-05-22 12:38:00 +02:00
ines
54f04a9fe0
Update API docs with changes in spacy.gold and spacy.language
2017-05-22 12:29:30 +02:00
ines
b5fb43fdd8
Allow sys.exit status as exits keyword arg in util.prints()
2017-05-22 12:29:15 +02:00
ines
fc3ec733ea
Reduce complexity in CLI
...
Remove now redundant model command and move plac annotations to cli
files
2017-05-22 12:28:58 +02:00
Matthew Honnibal
b45b4aa392
PseudoProjectivity --> nonproj
2017-05-22 05:17:44 -05:00
Matthew Honnibal
aae97f00e9
Fix nonproj import
2017-05-22 05:15:06 -05:00
Matthew Honnibal
9262fc4829
Fix syntax error
2017-05-22 05:14:59 -05:00
Matthew Honnibal
93a042253b
Make GoldParse attributes writeable
2017-05-22 04:51:08 -05:00
Matthew Honnibal
2a5eb9f61e
Make nonproj methods top-level functions, instead of class methods
2017-05-22 04:51:08 -05:00
Matthew Honnibal
c998776c25
Make single array for features, to reduce GPU copies
2017-05-22 04:51:08 -05:00
Matthew Honnibal
bc2294d7f1
Add support for fiddly hyper-parameters to train func
2017-05-22 04:51:08 -05:00
Matthew Honnibal
80e19a2399
Simplify CLI implementation for subcommands. Remove model command.
2017-05-22 04:51:08 -05:00
Matthew Honnibal
33e2222839
Remove unused code in deprojectivize
2017-05-22 04:51:08 -05:00
Matthew Honnibal
4e0988605a
Pass through non-projective=True
2017-05-22 04:51:08 -05:00
Matthew Honnibal
025d9bbc37
Fix handling of non-projective deps
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5738d373d5
Add deprojectivize to pipeline
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1b5fa68996
Do pseudo-projective pre-processing for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1d5d9838a2
Fix action collection for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
8d1e64be69
Add experimental NeuralLabeller
2017-05-22 04:51:08 -05:00
Matthew Honnibal
9b1b0742fd
Fix prediction for tok2vec
2017-05-22 04:51:08 -05:00
Matthew Honnibal
f13d6c7359
Support gold preprocessing and single gold files
2017-05-22 04:51:08 -05:00
Matthew Honnibal
e14533757b
Use averaged params for evaluation
2017-05-22 04:51:08 -05:00
Matthew Honnibal
7811d97339
Refactor CLI
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
Matthew Honnibal
432b3499b3
Fix memory leak
2017-05-21 13:38:46 -05:00
Matthew Honnibal
59fbfb3829
Remove train.py -- functions now in GoldCorpus and Language
2017-05-21 09:08:27 -05:00
Matthew Honnibal
8904814c0e
Add missing import
2017-05-21 09:07:56 -05:00
Matthew Honnibal
baf3ef0ddc
Remove import of removed train_config script
2017-05-21 09:07:34 -05:00
Matthew Honnibal
4c9202249d
Refactor training, to fix memory leak
2017-05-21 09:07:06 -05:00
Matthew Honnibal
4803b3b69e
Add GoldCorpus class, to manage data streaming
2017-05-21 09:06:17 -05:00
Matthew Honnibal
180e5afede
Fix tokvecs flattening in pipeline
2017-05-21 09:05:34 -05:00
Matthew Honnibal
0731971bfc
Add itershuffle utility function. Maybe belongs in thinc
2017-05-21 09:05:05 -05:00
ines
2c5cfe8bbf
Update docstrings and API docs for StringStore
2017-05-21 14:18:58 +02:00
ines
251346b59f
Fix typos and formatting
2017-05-21 14:18:46 +02:00
ines
075f5ff87a
Update docstrings and API docs for GoldParse
2017-05-21 13:53:46 +02:00
ines
99b631617d
Reformat docstrings
2017-05-21 13:32:15 +02:00
ines
885e82c9b0
Update docstrings and remove deprecated load classmethod
2017-05-21 13:27:52 +02:00
ines
c5a653fa48
Update docstrings and API docs for Tokenizer
2017-05-21 13:18:14 +02:00
ines
f216422ac5
Remove deprecated load classmethod
2017-05-21 13:18:01 +02:00
ines
d82ae9a585
Change "function" to "callable" in docs
2017-05-21 13:17:40 +02:00
ines
3871157d84
Update spacy.util documentation
2017-05-21 01:12:09 +02:00
ines
0c6c65aa3c
Improve messaging if model linking fails after download
2017-05-21 00:28:37 +02:00
Matthew Honnibal
3b7c108246
Pass tokvecs through as a list, instead of concatenated. Also fix padding
2017-05-20 13:23:32 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
d52b65aec2
Revert "Move to contiguous buffer for token_ids and d_vectors"
...
This reverts commit 3ff8c35a79
.
2017-05-20 11:26:23 -05:00
ines
27de0834b2
Update docstrings and API docs for Lexeme
2017-05-20 15:13:42 +02:00
ines
7ed8a92ed1
Update docstrings and API docs for Token
2017-05-20 15:13:33 +02:00
ines
4ed6a36622
Update docstrings and API docs for Matcher
2017-05-20 14:43:10 +02:00
ines
39f36539f6
Update docstrings and API docs for Matcher
2017-05-20 14:32:34 +02:00
ines
c00ff257be
Update docstrings and API docs for Matcher
2017-05-20 14:26:10 +02:00
ines
790435e51c
Update docstrings
2017-05-20 14:05:07 +02:00
ines
f0cc642bb9
Update docstrings and API docs for Vocab
2017-05-20 14:00:41 +02:00
Matthew Honnibal
ce9234f593
Update Matcher API
2017-05-20 13:54:53 +02:00
Matthew Honnibal
b272890a8c
Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change
2017-05-20 06:40:10 -05:00
ines
e39ad78267
Resolve model name properly in cli.info
...
Use util.resolve_model_path() to also allow package names and paths.
2017-05-20 12:24:40 +02:00
Matthew Honnibal
3ff8c35a79
Move to contiguous buffer for token_ids and d_vectors
2017-05-20 04:17:30 -05:00
Matthew Honnibal
8b04b0af9f
Remove freqs from transition_system
2017-05-20 02:20:48 -05:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
a1ba20e2b1
Fix over-run on parse_batch
2017-05-19 18:57:30 -05:00
ines
1d4d3d0ecd
Add TODO
2017-05-20 01:38:04 +02:00
Matthew Honnibal
7ee1827af0
Disable data caching in parser
2017-05-19 18:17:11 -05:00
Matthew Honnibal
e84de028b5
Remove 'rebatch' op, and remove min-batch cap
2017-05-19 18:16:36 -05:00
Matthew Honnibal
3376d4d6e8
Update the train script, fixing GPU memory leak
2017-05-19 18:15:50 -05:00
Matthew Honnibal
836fe1d880
Update neural net tests
2017-05-19 18:11:29 -05:00
ines
fe5d8819ea
Update Matcher docstrings and API docs
2017-05-19 21:47:06 +02:00
Matthew Honnibal
08766240c3
Add incomplete iob converter
2017-05-19 13:27:51 -05:00
Matthew Honnibal
c12ab47a56
Remove state argument in pipeline. Other changes
2017-05-19 13:26:36 -05:00
Matthew Honnibal
66ea9aebe7
Remove the state argument from Language
2017-05-19 13:25:42 -05:00
Matthew Honnibal
09a877886b
WIP on iob converter
2017-05-19 13:24:39 -05:00
ines
a804045597
Use is_ancestor instead of deprecated is_ancestor_of
2017-05-19 20:23:40 +02:00
Matthew Honnibal
8d5e6d9f4f
Rename no_ner arg to no_entities
2017-05-19 13:23:11 -05:00
ines
e9e62b01b0
Update docstrings and API docs for Token
2017-05-19 18:47:56 +02:00
ines
62ceec4fc6
Update docstrings and API docs for Span
2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8
Update docstrings and API docs for Doc
2017-05-19 18:47:39 +02:00
ines
2c8c9dc0c9
Update docstrings and API docs for Language
2017-05-19 18:47:24 +02:00
ines
0791f0aae6
Update docstrings and API docs for Span class
2017-05-19 00:31:31 +02:00
ines
8455cb1327
Update docstring for Doc.__getitem__
2017-05-19 00:30:51 +02:00
ines
0fc05e54e4
Document TokenVectorEncoder
2017-05-19 00:00:02 +02:00
ines
b687ad109d
Update docstrings and API docs for Doc class
2017-05-18 23:59:44 +02:00
ines
d42bc16868
Update docstrings and API docs for Language class
2017-05-18 23:57:38 +02:00
ines
593361ee3c
Update docstrings for Span class
2017-05-18 22:17:41 +02:00
ines
b87066ff10
Update docstrings and API docs for Doc class
2017-05-18 22:17:41 +02:00
Matthew Honnibal
238be0f16a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-18 08:32:22 -05:00
Matthew Honnibal
c214c0decb
Improve env_opt reporting
2017-05-18 08:32:03 -05:00
Matthew Honnibal
bbb59e371c
Fix GPU evaluation
2017-05-18 08:31:15 -05:00
Matthew Honnibal
c2c825127a
Fix use_params and pipe methods
2017-05-18 08:30:59 -05:00
Matthew Honnibal
ca70b08661
Fix GPU training and evaluation
2017-05-18 08:30:33 -05:00
ines
489d2fb4ba
Add is_in_jupyter() helper for displaCy (see #1058 )
2017-05-18 14:13:14 +02:00
ines
abf0188b0a
Move cupy and CudaStream to compat
2017-05-18 14:12:45 +02:00
ines
33decd85b6
Reorganise and explicitly state what's importable
2017-05-18 14:12:31 +02:00
Matthew Honnibal
a438cef8c5
Fix significant bug in feature calculation -- off by 1
2017-05-18 06:21:32 -05:00
Matthew Honnibal
fc8d3a112c
Add util.env_opt support: Can set hyper params through environment variables.
2017-05-18 04:36:53 -05:00
Matthew Honnibal
d2626fdb45
Fix name error in nn parser
2017-05-18 04:31:01 -05:00
Matthew Honnibal
b460533827
Bug fixes to pipeline
2017-05-18 04:29:51 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
Matthew Honnibal
2713041571
Fix GPU usage in Language
2017-05-18 04:25:19 -05:00
Matthew Honnibal
711ad5edc4
Cache features in doc2feats
2017-05-18 04:22:20 -05:00
Matthew Honnibal
39ea38c4b1
Add option to use gpu to spacy train
2017-05-18 04:21:49 -05:00
Matthew Honnibal
a1d8e420b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 08:00:04 -05:00
Matthew Honnibal
edfea3a513
Fix progress bar
2017-05-17 14:59:37 +02:00
Matthew Honnibal
0b7fd67408
Fix style check in displacy
2017-05-17 07:57:24 -05:00
Matthew Honnibal
55dab77de8
Add conversion rule for .conll
2017-05-17 13:13:48 +02:00
Matthew Honnibal
692bd2a186
Bug fix to tagger: wasnt backproping to token vectors
2017-05-17 13:13:14 +02:00
Matthew Honnibal
877f83807f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 12:09:29 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
3bf4a28d8d
Use tag in CoNLL converter, not POS
2017-05-17 12:04:33 +02:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
c9a5d5d24b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-16 16:22:05 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
221b4c1ee8
Fix test for Python 3
2017-05-16 13:06:30 +02:00
Matthew Honnibal
5211645af3
Get data flowing through pipeline. Needs redesign
2017-05-16 11:21:59 +02:00
Matthew Honnibal
1d7c18e58a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-15 21:53:47 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
ines
98354be150
Only get user_data if it exists on doc
2017-05-15 13:39:47 +02:00
ines
c33bdeb564
Use uppercase for entity types
2017-05-15 01:24:57 +02:00
ines
4aaa607b8d
Add xmlns:xlink so SVGs are rendered properly as individual files
2017-05-14 19:54:13 +02:00
ines
9dd13cd76a
Update docstrings
2017-05-14 19:30:47 +02:00
ines
a04550605a
Add Jupyter notebook support (see #1058 )
2017-05-14 18:39:01 +02:00
ines
c31792aaec
Add displaCy visualisers (see #1058 )
2017-05-14 17:50:23 +02:00
ines
b462076d80
Merge load_lang_class and get_lang_class
2017-05-14 01:31:10 +02:00
ines
36bebe7164
Update docstrings
2017-05-14 01:30:29 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal
5cac951a16
Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.
2017-05-14 00:55:01 +02:00
Matthew Honnibal
f8c02b4341
Remove cupy imports from parser, so it can work on CPU
2017-05-14 00:37:53 +02:00
Matthew Honnibal
613ba79e2e
Fiddle with sizings for parser
2017-05-13 17:20:23 -05:00
Matthew Honnibal
e6d71e1778
Small fixes to parser
2017-05-13 17:19:04 -05:00
Matthew Honnibal
188c0f6949
Clean up unused import
2017-05-13 17:18:27 -05:00
Matthew Honnibal
f85c8464f7
Draft support of regression loss in parser
2017-05-13 17:17:27 -05:00
ines
1694c24e52
Add docstrings, error messages and fix consistency
2017-05-13 21:22:49 +02:00
ines
ee7dcf65c9
Fix expand_exc to make sure it returns combined dict
2017-05-13 21:22:25 +02:00
ines
824d09bb74
Move resolve_load_name to deprecated
2017-05-13 21:21:47 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
5858857a78
Update languages list in conftest
2017-05-13 15:37:54 +02:00
ines
9d85cda8e4
Fix models error message and use about.__docs_models__ (see #1051 )
2017-05-13 13:05:47 +02:00
ines
6b942763f0
Tidy up imports
2017-05-13 13:04:40 +02:00
ines
8c2a0c026d
Fix parse_tree test
2017-05-13 12:32:45 +02:00
ines
6129016e15
Replace deepcopy
2017-05-13 12:32:37 +02:00
ines
df68bf45ce
Set defaults for light and flat kwargs
2017-05-13 12:32:23 +02:00
ines
b9dea345e5
Remove old import
2017-05-13 12:32:11 +02:00
ines
293ee359c5
Fix formatting
2017-05-13 12:32:06 +02:00
ines
4eefb288e3
Port over PR #1055
2017-05-13 03:25:32 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
827b5af697
Update draft of parser neural network model
...
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.
Outline of the model:
We first predict context-sensitive vectors for each word in the input:
(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4
This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).
The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.
The current context tokens:
* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0
This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).
The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)
This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.
Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:
(exp(score) / Z) - (exp(score) / gZ)
Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.
Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines
c4857bc7db
Remove unused argument
2017-05-12 15:37:54 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
Matthew Honnibal
4efb391994
Fix serializer
2017-05-09 18:45:18 +02:00
Matthew Honnibal
b16ae75824
Remove serializer hacks from pipeline classes
2017-05-09 18:16:40 +02:00
Matthew Honnibal
7253b4e649
Remove old serialization tests
2017-05-09 18:12:58 +02:00
Matthew Honnibal
f9327343ce
Start updating serializer test
2017-05-09 18:12:03 +02:00
Matthew Honnibal
1166b0c491
Implement Doc.to_bytes and Doc.from_bytes methods
2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
Matthew Honnibal
b53f7dfdc3
Remove spacy.serialize
2017-05-09 17:22:06 +02:00
Matthew Honnibal
62ecdea9f2
Add binder class for document serialization
2017-05-09 17:21:00 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
2c3bdd09b1
Add English test for like_num
2017-05-09 11:06:34 +02:00
ines
22375eafb0
Fix and merge attrs and lex_attrs tests
2017-05-09 11:06:25 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
564939391a
Remove spacy.orth
2017-05-09 01:21:47 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
ea5fa46475
Import LEX_ATTRS from lang.lex_attrs
2017-05-09 00:58:10 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
7b83977020
Remove unused munge package
2017-05-09 00:16:16 +02:00
ines
c714841cc8
Move language-specific tests to tests/lang
2017-05-09 00:02:37 +02:00
ines
bd57b611cc
Update conftest to lazy load languages
2017-05-09 00:02:21 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
6eb6306843
Fix language data imports
2017-05-08 23:58:31 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
86d9c29f30
Reorder util functions
2017-05-08 23:51:15 +02:00
ines
9a0d2fdef1
Add load_lang_class() util function
2017-05-08 23:50:45 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00
ines
41a322c733
Fix LEMMA in exceptions and morph rules
2017-05-08 19:57:36 +02:00
ines
2edc0aee12
Update warning message
2017-05-08 19:53:36 +02:00
ines
6025cdb992
Fix string interpolation in times
2017-05-08 16:38:16 +02:00
ines
b9ba58ba5c
Add function to resolve load name
...
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines
e6f1a5d0a1
Add unicode declaration
2017-05-08 16:22:17 +02:00
ines
be5541bd16
Fix import and tokenizer exceptions
2017-05-08 16:20:14 +02:00
ines
2324788970
Remove bad tests
2017-05-08 16:15:27 +02:00
ines
b88c4193e7
Add missing symbol
2017-05-08 16:15:20 +02:00
ines
9a5b2bdd4c
Don't set morph rules without tag map
2017-05-08 16:15:12 +02:00
ines
4930f0fa8f
Explicitly import TOKEN_MATCH
2017-05-08 16:11:54 +02:00
ines
50b7ec03ca
Fix typo
2017-05-08 16:11:45 +02:00
ines
3ca611fe48
Fix wildcard imports
2017-05-08 15:56:29 +02:00
ines
c2469b8135
Remove __all__ export
2017-05-08 15:56:22 +02:00
ines
14a9c3ee7a
Fix wildcard import
2017-05-08 15:56:13 +02:00
ines
deed623864
Remove comment
2017-05-08 15:56:05 +02:00
ines
e7f95c37ee
Merge base tokenizer exceptions
2017-05-08 15:55:52 +02:00
ines
24606d364c
Remove redundant language_data.py files in languages
...
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines
a627d3e3b0
Reorganise Chinese language data
2017-05-08 15:54:36 +02:00
ines
7b86ee093a
Reorganise Swedish language data
2017-05-08 15:54:29 +02:00
ines
50510fa947
Reorganise Portuguese language data
2017-05-08 15:52:01 +02:00
ines
279895ea83
Reorganise Dutch language data
2017-05-08 15:51:39 +02:00
ines
04ef5025bd
Reorganise Norwegian language data
2017-05-08 15:51:22 +02:00
ines
5edbc725d8
Reorganise Japanese language data
2017-05-08 15:50:46 +02:00
ines
51a389d3bb
Reorganise Italian language data
2017-05-08 15:50:17 +02:00
ines
1bbfa14436
Reorganise Hungarian language data
2017-05-08 15:49:56 +02:00
ines
a77c9fc60d
Reorganise Hebrew language data
2017-05-08 15:49:28 +02:00
ines
7f05e977fa
Reorganise French language data
2017-05-08 15:49:05 +02:00
ines
0207ffdd52
Reorganise Finnish language data
2017-05-08 15:48:31 +02:00
ines
8e483ec950
Reorganise Spanish language data
2017-05-08 15:48:04 +02:00
ines
c7c21b980f
Reorganise English language data
2017-05-08 15:47:25 +02:00
ines
1bf9d5ec8b
Reorganise German language data
2017-05-08 15:44:26 +02:00
ines
7b3a983f96
Reorganise Bengali language data
2017-05-08 15:43:50 +02:00
ines
607ba458e7
Fix whitespace
2017-05-08 15:42:31 +02:00
ines
60db497525
Add update_exc and expand_exc to util
...
Doesn't require separate language data util anymore
2017-05-08 15:42:12 +02:00
Matthew Honnibal
b44f7e259c
Clean up unused parser code
2017-05-08 15:42:04 +02:00
ines
6e5bd4f228
Remove unused functions from deprecated
2017-05-08 15:40:16 +02:00
Matthew Honnibal
17efb1c001
Change width
2017-05-08 08:40:13 -05:00
ines
f68e420bc0
Add PRON_LEMMA and DET_LEMMA to deprecated
...
Will be replaced with proper values across the language data later.
2017-05-08 15:35:30 +02:00
ines
bd6a7cf4f6
Simplify deprecated model downloading
...
Only relevant for spaCy < v1.7.0.
2017-05-08 15:32:10 +02:00
ines
95edd9e896
Let parse_package_meta take full path
2017-05-08 15:30:48 +02:00
ines
326746eb15
Add util function to resolve arg to model path
...
1. check if in data dir or shortcut link
2. check if installed as a pip package
3. check if string is path to model
4. check if Path or Path-like object
2017-05-08 15:29:47 +02:00
Matthew Honnibal
bef89ef23d
Mergery
2017-05-08 08:29:36 -05:00
ines
a7801e7342
Update spacy.load()
...
path argument is now deprecated and name can either take a model name
or path. Implement lazy loading by importing module and read Language
class name off __all__.
2017-05-08 15:27:25 +02:00
Matthew Honnibal
50ddc9fc45
Fix infinite loop bug
2017-05-08 07:54:26 -05:00
Matthew Honnibal
94e86ae00a
Predict tags with encoder
2017-05-08 07:53:45 -05:00
Matthew Honnibal
56073a11ef
Don't use tags when calculating token vectors
2017-05-08 07:52:24 -05:00
Matthew Honnibal
a66a4a4d0f
Replace einsums
2017-05-08 14:46:50 +02:00
Matthew Honnibal
8d2eab74da
Use PretrainableMaxouts
2017-05-08 14:24:55 +02:00
Matthew Honnibal
807cb2e370
Add PretrainableMaxouts
2017-05-08 14:24:43 +02:00
Matthew Honnibal
2e2268a442
Precomputable hidden now working
2017-05-08 11:36:37 +02:00
ines
94697e9afc
Fix typo
2017-05-08 02:00:37 +02:00
ines
0ee2a22b67
Merge branch 'pr/1024' into develop
2017-05-08 01:12:44 +02:00
ines
c4492d260a
Fix kwargs
2017-05-08 01:05:24 +02:00
Matthew Honnibal
10682d35ab
Get pre-computed version working
2017-05-08 00:38:35 +02:00
ines
b5a726c5cd
Tidy up deprecated.py
2017-05-07 23:29:22 +02:00
ines
59c3b9d4dd
Tidy up CLI and fix print functions
2017-05-07 23:25:29 +02:00
ines
311704674d
Add path2str compat function
2017-05-07 23:24:56 +02:00
ines
e34069db9f
Move is_package and get_model_package_path to util
2017-05-07 23:24:51 +02:00
ines
957ba676b4
Add model files base path to about.py
2017-05-07 23:22:35 +02:00
ines
8d8dd9ceb2
Don't set default value for model
2017-05-07 23:22:21 +02:00
Matthew Honnibal
35458987e8
Checkpoint -- nearly finished reimpl
2017-05-07 23:05:01 +02:00
Matthew Honnibal
4441866f55
Checkpoint -- nearly finished reimpl
2017-05-07 22:47:06 +02:00
Matthew Honnibal
6782eedf9b
Tmp GPU code
2017-05-07 11:04:24 -05:00
Matthew Honnibal
e420e5a809
Tmp
2017-05-07 07:31:09 -05:00
Matthew Honnibal
12039e80ca
Switch to single matmul for state layer
2017-05-07 14:26:34 +02:00
Matthew Honnibal
700979fb3c
CPU/GPU compat
2017-05-07 04:01:11 +02:00
Matthew Honnibal
f99f5b75dc
working residual net
2017-05-07 03:57:26 +02:00
Matthew Honnibal
bdf2dba9fb
WIP on refactor, with hidde pre-computing
2017-05-07 02:02:43 +02:00
Matthew Honnibal
b439e04f8d
Learning smoothly
2017-05-06 20:38:12 +02:00
Matthew Honnibal
08bee76790
Learns things
2017-05-06 18:24:38 +02:00
Matthew Honnibal
04ae1c01f1
Learns things
2017-05-06 18:21:02 +02:00
Matthew Honnibal
bcf4cd0a5f
Learns things
2017-05-06 17:37:36 +02:00
Matthew Honnibal
8e48b58cd6
Gradients look correct
2017-05-06 16:47:15 +02:00
Matthew Honnibal
7e04260d38
Data running through, likely errors in model
2017-05-06 14:22:20 +02:00
Matthew Honnibal
fa7c1990b6
Restore tok2vec function
2017-05-05 20:12:03 +02:00
Matthew Honnibal
efe9630e1c
Bug fixes
2017-05-05 20:09:50 +02:00
Matthew Honnibal
ef4fa594aa
Draft of NN parser, to be tested
2017-05-05 19:20:39 +02:00
Matthew Honnibal
7d1df50aec
Draft up Parser model
2017-05-04 13:31:40 +02:00
Matthew Honnibal
ccaf26206b
Pseudocode for parser
2017-05-04 12:17:59 +02:00
ines
b1f22c5a10
Fix formatting
2017-05-03 20:11:02 +02:00
ines
a04b5be1b2
Add glossary for annotation scheme ( closes #1034 )
...
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard
929f2792a7
Rennaming cls in module. cls is now a class
2017-05-03 15:41:07 +02:00
Gregory Howard
0e8c41ea4f
Adding method lemmatizer for every class
2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e
adding export japanese
2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224
Merge branch 'master' into master
2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4
Lazy imports language
2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d
Fix formatting
2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d
Raise custom ImportError if importing janome fails
2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b
Add newline
2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea
Add newline
2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135
Add newline
2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87
Add basic japanese support
2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb
Merge remote-tracking branch 'remotes/upstream/master'
2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2
Add dropout optin for parser and NER
...
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.
nlp.entity.update(doc, gold, drop=0.4)
This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.
This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b
Removing extra spaces
2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e
Adding unitest for tokenization in french (with title)
2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c
Add newline
2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2
Add newline
2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef
Add newline
2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21
Add newline
2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05
Add newline
2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849
Adding unitest for tokenization in french (with title)
2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45
Improvement of rules now title insentive and have same declaration format
2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27
Increment version
2017-04-26 20:25:41 +02:00
luvogels
b331929a7e
Merge branch 'master' of https://github.com/luvogels/spaCy
2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09
Update __init__.py
2017-04-26 18:27:55 +02:00
ines
527d51ac9a
Fetch shortcuts from GitHub and improve error handling
2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
26e31afc18
renamming tests
2017-04-25 17:46:01 +02:00
ghoward
c085c2d391
Adding some unitests
2017-04-25 17:44:16 +02:00
ghoward
55c6910f90
Look_up table for languages in spacy.
...
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b
Fix reporting if no dev data with train
2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
ines
401045433c
Simplify compat.fix_text
2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64
Increment version
2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b
Update fix for Issue #999
2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07
Add test for Issue #955
2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5
Ensure noun chunks can't be nested. Closes #955
2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445
Add test for Issue #999
2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52
Fix json dump for Python3
2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17
Remove xfail on Test #910
2017-04-23 16:28:55 +02:00
ines
3a9710f356
Pass dev_scores to print_progress correctly ( resolves #1008 )
...
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab
Persist the actions within spacy.parser.cfg
2017-04-20 17:02:44 +02:00
ines
25c70b4cc5
Move fix_text to spacy.compat (see #1002 )
2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee
Merge pull request #1002 from oroszgy/model_cli_fix
...
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c
Using ftfy for handling broken encoded strings.
2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046
Merge pull request #1001 from recognai/master
...
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00