Matthew Honnibal
d0e42f9275
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 15:30:32 -05:00
Matthew Honnibal
8a17b99b1c
Use NORM attribute, not LOWER
2017-06-03 15:30:16 -05:00
ines
4c643d74c5
Add norm exceptions to other Language classes
2017-06-03 22:29:21 +02:00
ines
fa7e576c57
Change order of exception dicts
2017-06-03 21:52:06 +02:00
Matthew Honnibal
3f5c85d8de
Reorder setting of lex attrs, to avoid clobbering
2017-06-03 14:47:55 -05:00
Matthew Honnibal
aeb7520133
Make norm use lower-case
2017-06-03 14:47:38 -05:00
Matthew Honnibal
de3954843e
Populate norm exceptions with lower-case
2017-06-03 14:47:12 -05:00
Matthew Honnibal
f6955a459c
Fix prev commit
2017-06-03 14:38:37 -05:00
Matthew Honnibal
468ca6c760
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 14:33:51 -05:00
Matthew Honnibal
c647a0d33e
Fix training counter for gold preprocessing
2017-06-03 14:33:39 -05:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
d77c2cc8bb
Add tests for English norm exceptions
2017-06-03 20:59:50 +02:00
ines
0d6fa8b241
Add German norm exceptions
2017-06-03 20:54:18 +02:00
ines
5bd311c77e
Fix update of norm exceptions
2017-06-03 20:54:09 +02:00
Matthew Honnibal
94e063ae2a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 13:31:40 -05:00
Matthew Honnibal
fea1144e6d
Set max batch size in evaluate
2017-06-03 13:31:33 -05:00
Matthew Honnibal
805495af27
Fix off-by-one in number of tags
2017-06-03 13:29:23 -05:00
Matthew Honnibal
e62f46d39f
Clarify gold.pyx slightly
2017-06-03 13:28:52 -05:00
Matthew Honnibal
43353b5413
Improve train CLI script
2017-06-03 13:28:20 -05:00
ines
746653880c
Add English norm exceptions to lex_attrs
2017-06-03 20:27:28 +02:00
ines
095eeeb12f
Update English tokenizer exceptions and add norms
2017-06-03 20:27:16 +02:00
ines
e5d426406a
Add base norm exceptions
2017-06-03 20:27:05 +02:00
ines
4c2bbc3ccc
Add add_lookups util function
2017-06-03 19:44:47 +02:00
ines
05fe6758a7
Set lexeme attributes for tokenizer special cases
2017-06-03 19:44:39 +02:00
ines
3152ee5ca2
Update serialization tests for tokenizer
2017-06-03 17:05:28 +02:00
ines
7c919aeb09
Make sure serializers and deserializers are ordered
2017-06-03 17:05:09 +02:00
ines
1ebd0d3f27
Add assert_packed_msg_equal util function
2017-06-03 17:04:30 +02:00
ines
de974f7bef
Add serializer tests for tokenizer
2017-06-03 13:26:34 +02:00
ines
0153b66a86
Return self in Tokenizer.from_bytes
2017-06-03 13:26:13 +02:00
ines
82154a1861
Add letter spacing to arrow label
2017-06-03 13:25:41 +02:00
ines
32c6f05de9
Adjust spacing and sizing in compact mode
2017-06-03 13:25:32 +02:00
ines
cc8c8617a4
Shut down displaCy server on KeyboardInterrupt
2017-06-03 13:24:56 +02:00
ines
70fbba7d08
Clone Doc to never merge punctuation on original Doc
2017-06-03 13:24:43 +02:00
ines
459a1e8470
Fix whitespace
2017-06-03 11:31:18 +02:00
ines
5109bba910
Port over fix from #1070
2017-06-03 11:31:11 +02:00
ines
d21459f87d
Update serializer tests
2017-06-02 21:42:26 +02:00
ines
6669583f4e
Use OrderedDict
2017-06-02 21:07:56 +02:00
ines
2f1025a94c
Port over Spanish changes from #1096
2017-06-02 19:09:58 +02:00
ines
d86e7cde93
Add entity recognizer to parser serialization tests
2017-06-02 18:40:06 +02:00
ines
0051c05964
Add tests for serializing parser
2017-06-02 18:37:19 +02:00
ines
fdd0923be4
Translate model=True in exclude to lower_model and upper_model
2017-06-02 18:37:07 +02:00
ines
cef547a9f0
Add serialization tests for tensorizer
2017-06-02 18:18:30 +02:00
ines
924c58bde3
Fix serialization of optional elements
2017-06-02 18:18:17 +02:00
ines
f74a45c1fe
Remove unnecessary argument
2017-06-02 18:17:46 +02:00
ines
43b4d63f85
Add serialization tests for tagger
2017-06-02 17:29:34 +02:00
ines
1b593bbd6d
Fix encoding on tagger serialization
2017-06-02 17:29:21 +02:00
Matthew Honnibal
5f4d328e2c
Fix serialization of tag_map in NeuralTagger
2017-06-02 10:18:37 -05:00
Matthew Honnibal
ed6f575e06
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-02 04:26:39 -05:00
ines
acd65c00f6
Add serialization tests for StringStore and Vocab
2017-06-02 10:57:42 +02:00
ines
41a6adf1f6
Initialise Vocab length correctly
2017-06-02 10:57:25 +02:00
ines
53b82f972a
Add strings to Vocab in init, instead of StringStore
2017-06-02 10:57:06 +02:00
ines
023f38bdd4
Fix return value of Vocab.from_bytes
2017-06-02 10:56:40 +02:00
ines
9692c98f57
Add test utils for temp file and temp dir
2017-06-02 10:56:09 +02:00
Matthew Honnibal
c650bc481c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 13:03:57 -05:00
Matthew Honnibal
307d615c5f
Fix serialization for tagger when tag_map has changed
2017-06-01 12:18:36 -05:00
Matthew Honnibal
1d18cedae8
Fiddle with msgpack bytes vs unicode
2017-06-01 10:48:43 -05:00
ines
7a2380f617
Rename "nn_tagger" to "tagger"
2017-06-01 17:37:53 +02:00
ines
e5ae6ccf4e
Fix typo
2017-06-01 16:46:15 +02:00
ines
a3e4f91f4a
Only load vocab if it exists
2017-06-01 14:38:35 +02:00
Matthew Honnibal
d310b0aab3
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 04:58:03 -05:00
Matthew Honnibal
3ff7d7fcef
Merge for updated requirements
2017-06-01 04:57:47 -05:00
Matthew Honnibal
5eae3b9a1e
Fix to/from disk in tagger
2017-06-01 04:55:49 -05:00
ines
d5c8d2f5fd
Update about.py and increment version
2017-06-01 11:52:24 +02:00
Matthew Honnibal
4c97371051
Fixes for thinc 6.7
2017-06-01 04:22:16 -05:00
Matthew Honnibal
53d00a0371
Move weight serialization to Thinc
2017-06-01 03:04:36 -05:00
Matthew Honnibal
ae8010b526
Move weight serialization to Thinc
2017-06-01 02:56:12 -05:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
Matthew Honnibal
c8a58cfcf8
Fix Python2/3 load bug
2017-05-31 15:21:44 -05:00
Matthew Honnibal
99982684b0
Fix normalize_string_keys function'
2017-05-31 14:08:16 -05:00
Matthew Honnibal
67ade63fc4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 08:28:42 -05:00
Matthew Honnibal
490b38e6bb
Fix reference to thinc copy_array util
2017-05-31 08:25:21 -05:00
Matthew Honnibal
9805e0e369
Fix vocab pickling
2017-05-31 08:25:01 -05:00
Matthew Honnibal
6c51cd77b4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 15:06:56 +02:00
Matthew Honnibal
8dfb9546f0
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 07:21:14 -05:00
Matthew Honnibal
480ef8bfc8
Add compat function to normalize dict keys
2017-05-31 07:14:29 -05:00
Matthew Honnibal
92f9e5cc9a
Silence env_opt, and fix serialization for GPU
2017-05-31 07:14:11 -05:00
Matthew Honnibal
0561df2a9d
Fix tokenizer serialization
2017-05-31 14:12:38 +02:00
Matthew Honnibal
4a398c15b7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 13:44:16 +02:00
Matthew Honnibal
097ab9c6e4
Fix transition system to/from disk
2017-05-31 13:44:00 +02:00
Matthew Honnibal
b1469d3360
Fix string serialisation
2017-05-31 13:43:44 +02:00
Matthew Honnibal
e9419072e7
Fix tokenizer serialisation
2017-05-31 13:43:31 +02:00
Matthew Honnibal
33e5ec737f
Fix to/from disk methods
2017-05-31 13:43:10 +02:00
ines
5e1c361270
Update tests README with info on model tests
2017-05-31 12:22:58 +02:00
Matthew Honnibal
fe28602f2e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 11:43:56 +02:00
Matthew Honnibal
66af019d5d
Fix serialization of tokenizer
2017-05-31 11:43:40 +02:00
Ines Montani
e6cf3c7e1c
Merge pull request #1093 from oroszgy/hu_emoji_fix
...
Fixed emoji handling for Hungarian
2017-05-31 11:33:24 +02:00
Matthew Honnibal
e98eff275d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 10:29:15 +02:00
Matthew Honnibal
53a3824334
Fix mistake in ner feature
2017-05-31 03:01:02 +02:00
Matthew Honnibal
8a693c2605
Write binary file during training
2017-05-31 02:59:18 +02:00
Matthew Honnibal
498ad85309
Try using tensor for vector/similarity methdos
2017-05-30 23:35:17 +02:00
Matthew Honnibal
a131981f3b
Work on vectors
2017-05-30 23:34:50 +02:00
Matthew Honnibal
6937e311a4
Update doc tests
2017-05-30 23:34:23 +02:00
Matthew Honnibal
cc911feab2
Fix bug in NER state
2017-05-30 22:12:19 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
Matthew Honnibal
be4a640f0c
Fix arc eager label costs for uint64
2017-05-30 20:37:58 +02:00
Matthew Honnibal
b127645afc
Fix test_misc merge conflict
2017-05-29 18:31:44 -05:00
Matthew Honnibal
e0e8eae7c7
Tweak package test
2017-05-29 18:30:42 -05:00
Matthew Honnibal
11840ff5dd
Store tag map before normalizing props
2017-05-29 17:53:48 -05:00
Matthew Honnibal
b92a89f87b
Make it easier to reference embedding tables
2017-05-29 17:53:29 -05:00
Matthew Honnibal
293d1b425b
Serialize in consistent order
2017-05-29 17:53:06 -05:00
Matthew Honnibal
9bf22a94aa
Fix tag set serialisation
2017-05-29 17:52:36 -05:00
Matthew Honnibal
2a061e2777
Fix serialisation, for reals this time
2017-05-29 17:52:08 -05:00
ines
20a7003c0d
Update model fixtures and reorganise tests
2017-05-29 22:14:31 +02:00
ines
795fe43a4d
Add load_test_model function with importorskip()
...
Loads model only if it can be imported, i.e. if it's installed as a
package.
2017-05-29 22:11:31 +02:00
ines
ad3c8b3ad9
Fix formatting
2017-05-29 22:10:50 +02:00
ines
6e3937efc5
Check for arguments of model markers to specify models to test
...
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
Matthew Honnibal
35d981241f
Fix model deserialization
2017-05-29 14:46:31 -05:00
Matthew Honnibal
5b29f227ae
Fix serialization
2017-05-29 14:35:53 -05:00
Matthew Honnibal
1e6df0a2a1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 14:30:12 -05:00
ines
08382f21e3
Pass model meta to nlp object in load_model
2017-05-29 20:44:11 +02:00
ines
6145fe6a93
Catch all kwargs on Language
2017-05-29 20:43:48 +02:00
ines
0d7d50fe22
Add __version__ to __init__.py
2017-05-29 20:43:24 +02:00
Matthew Honnibal
6522ea6c8b
More serialization fixes. Still broken
2017-05-29 13:23:47 -05:00
Matthew Honnibal
9c9ee24411
Fix broken lambda scoping in Python 2
2017-05-29 13:23:28 -05:00
Matthew Honnibal
f1acdaab55
Fix serialization of weight offsets
2017-05-29 13:23:11 -05:00
Matthew Honnibal
c044e9c21c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 08:41:02 -05:00
Matthew Honnibal
aa4c33914b
Work on serialization
2017-05-29 08:40:45 -05:00
ines
9e83a17e95
Use new model templates
2017-05-29 15:27:24 +02:00
ines
567485a818
Fix and document model loading with pipeline and overrides
2017-05-29 14:10:10 +02:00
Matthew Honnibal
deac7eb01c
Fix for serialization
2017-05-29 13:54:18 +02:00
Matthew Honnibal
04c32aa091
Fix for serialization
2017-05-29 13:53:32 +02:00
Matthew Honnibal
a1960c2d09
Fix for serialization
2017-05-29 13:47:42 +02:00
Matthew Honnibal
7b06bb896e
Fix for serialization
2017-05-29 13:42:55 +02:00
Matthew Honnibal
74235587ef
Fix to serialization
2017-05-29 13:40:31 +02:00
Matthew Honnibal
59f355d525
Fixes for serialization
2017-05-29 13:38:20 +02:00
Matthew Honnibal
920887f4e4
Specify order of vocab deserialization
2017-05-29 13:04:40 +02:00
Matthew Honnibal
f4aafca222
Merge changes to test_misc
2017-05-29 12:26:02 +02:00
Matthew Honnibal
a318f0cae1
Add to/from disk/bytes methods for tokenizer
2017-05-29 12:24:41 +02:00
Matthew Honnibal
ff26aa6c37
Work on to/from bytes/disk serialization methods
2017-05-29 11:45:45 +02:00
ines
df920ba0e7
Add tests for displaCy and util functions and fix util typo
2017-05-29 10:51:19 +02:00
ines
c5714d4fb2
xfail matcher test for now until setting norm via Span.merge works
2017-05-29 10:51:02 +02:00
Matthew Honnibal
6b019b0540
Update to/from bytes methods
2017-05-29 10:14:20 +02:00
Matthew Honnibal
c91b121aeb
Move serialization functions to util
2017-05-29 10:13:42 +02:00
Matthew Honnibal
1fa2bfb600
Add model_to_bytes and model_from_bytes helpers. Probably belong in thinc.
2017-05-29 09:27:04 +02:00
Matthew Honnibal
6dad4117ad
Work on serialization for models
2017-05-29 01:37:57 +02:00
ines
7b1ddcc04d
Add test for vocab serialization
2017-05-29 01:09:52 +02:00
ines
00b2094dc3
Fix typos, long integers and tests
2017-05-29 01:09:52 +02:00
ines
804dbb8d25
Add StringStore test for API docs
2017-05-29 01:09:52 +02:00
Matthew Honnibal
6cd5730ee7
Fix lex struct setters for strings
2017-05-29 01:05:09 +02:00
Matthew Honnibal
2edd96ce47
Draft Vocab to/from disk/bytes
2017-05-28 23:34:12 +02:00
Matthew Honnibal
4ddff020c3
Fix compile error
2017-05-28 23:30:40 +02:00
Matthew Honnibal
6d3caeadd2
Fix type check for long
2017-05-28 23:22:45 +02:00
Matthew Honnibal
92dbf28c1e
Hack a fixture in the vectors tests, for xfail
2017-05-28 20:28:32 +02:00
Matthew Honnibal
9239f06ed3
Fix german noun chunks iterator
2017-05-28 20:13:03 +02:00
Matthew Honnibal
fd9b6722a9
Fix noun chunks iterator for new stringstore
2017-05-28 20:12:10 +02:00
ines
414193e9ba
Update docs to reflect StringStore changes
2017-05-28 18:19:11 +02:00
Matthew Honnibal
7996d21717
Fixes for new StringStore
2017-05-28 11:09:27 -05:00
Matthew Honnibal
8a24c60c1e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 08:12:05 -05:00
Matthew Honnibal
bc97bc292c
Fix __call__ method
2017-05-28 08:11:58 -05:00
Matthew Honnibal
5cf47b847b
Handle iob with no tag in converter
2017-05-28 08:11:39 -05:00
Matthew Honnibal
fe11564b8e
Finish stringstore change. Also xfail vectors tests
2017-05-28 15:10:22 +02:00
Matthew Honnibal
b007a2b0d3
Update stringstore tests
2017-05-28 14:08:09 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
fe4a746300
Accomodate symbols in new string scheme
2017-05-28 13:03:16 +02:00
Matthew Honnibal
f51e6a6c16
Adjust lexeme sizing for attr_t being 64 bit
2017-05-28 12:51:09 +02:00
Matthew Honnibal
a5606c3eda
Work on changing StringStore to return hashes.
2017-05-28 12:36:27 +02:00
Matthew Honnibal
39293ab2ee
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 11:46:57 +02:00
Matthew Honnibal
dd052572d4
Update arc eager for SBD changes
2017-05-28 11:46:51 +02:00
Matthew Honnibal
3ea98e2043
Remove vector member from lexeme
2017-05-28 11:46:24 +02:00
Matthew Honnibal
2445707f3c
Re-delegate vectors to vocab
2017-05-28 11:46:10 +02:00
Matthew Honnibal
6863d01361
Remove vectors from lexeme
2017-05-28 11:45:48 +02:00
Matthew Honnibal
15f6efc127
Remove vectors from vocab
2017-05-28 11:45:32 +02:00
Matthew Honnibal
c1263a844b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 18:32:57 -05:00
Matthew Honnibal
9e711c3476
Divide d_loss by batch size
2017-05-27 18:32:46 -05:00
Matthew Honnibal
b082f76494
Randomize pipeline order during training
2017-05-27 18:32:21 -05:00
Matthew Honnibal
a1d4c97fb7
Improve correctness of minibatching
2017-05-27 17:59:00 -05:00
ines
84189c1cab
Add 'xx' language ID for multi-language support
...
Allows models to specify their language ID as 'xx'.
2017-05-28 00:58:59 +02:00
ines
33e332e67c
Remove unused export
2017-05-28 00:57:59 +02:00
ines
c1983621fb
Update util functions for model loading
2017-05-28 00:22:40 +02:00
ines
c8543c8237
Fix formatting and docstrings and remove deprecated function
2017-05-28 00:22:40 +02:00
Matthew Honnibal
49235017bf
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 16:34:28 -05:00
Matthew Honnibal
7ebd26b8aa
Use ordered dict to specify transitions
2017-05-27 15:52:20 -05:00
Matthew Honnibal
3eea5383a1
Add move_names property to parser
2017-05-27 15:51:55 -05:00
Matthew Honnibal
8de9829f09
Don't overwrite model in initialization, when loading
2017-05-27 15:50:40 -05:00
Matthew Honnibal
99316fa631
Use ordered dict to specify actions
2017-05-27 15:50:21 -05:00
Matthew Honnibal
655ca58c16
Clarifying change to StateC.clone
2017-05-27 15:49:37 -05:00
Matthew Honnibal
5e4312feed
Evaluate loaded class, to ensure save/load works
2017-05-27 15:47:02 -05:00
Matthew Honnibal
34bbad8e0e
Add __reduce__ methods on parser subclasses. Fixes pickling.
2017-05-27 15:46:06 -05:00
Matthew Honnibal
7cc9c3e9a6
Fix convert CLI
2017-05-27 15:44:42 -05:00
ines
1203959625
Add pipeline setting to meta.json generator
2017-05-27 20:02:01 +02:00
ines
086a06e7d7
Fix CLI docstrings and add command as first argument
...
Workaround for Plac
2017-05-27 20:01:46 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
dc07d72d80
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 08:20:40 -05:00
Matthew Honnibal
de13fe0305
Remove length cap on sentences
2017-05-27 08:20:32 -05:00
Matthew Honnibal
73a643d32a
Don't randomise pipeline for training, and don't update if no gradient
2017-05-27 08:20:13 -05:00
Matthew Honnibal
3d22fcaf0b
Return None from parser if there are no annotations
2017-05-26 14:02:59 -05:00
Matthew Honnibal
d06f235fc9
Fix conflict on convert.py
2017-05-26 11:33:29 -05:00
Matthew Honnibal
2e587c6417
Export iob_to_biluo utility
2017-05-26 11:32:55 -05:00
Matthew Honnibal
2b3b937a04
Fix converter CLI
2017-05-26 11:32:41 -05:00
Matthew Honnibal
5a87bcf35f
Fix converters
2017-05-26 11:32:34 -05:00
Matthew Honnibal
8af3100143
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-26 11:31:41 -05:00
Matthew Honnibal
3d5a536eaa
Improve efficiency of parser batching
2017-05-26 11:31:23 -05:00
Matthew Honnibal
daac3e3573
Always shuffle gold data, and support length cap
2017-05-26 11:30:52 -05:00
Matthew Honnibal
d65f99a720
Improve model saving in train script
2017-05-26 05:52:09 -05:00
ines
51882c4984
Fix formatting
2017-05-26 12:37:45 +02:00
ines
353f0ef8d7
Use disable argument (list) for serialization
2017-05-26 12:33:54 +02:00
Matthew Honnibal
22d7b448a5
Fix convert command
2017-05-25 19:47:12 -05:00
Matthew Honnibal
dbf2a4cf57
Update all models on each epoch
2017-05-25 19:46:56 -05:00
Matthew Honnibal
faff1c23fb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-25 17:16:10 -05:00
Matthew Honnibal
82b11b0320
Remove print statement
2017-05-25 17:15:59 -05:00
Matthew Honnibal
80cf42e33b
Fix compounding and decaying utils
2017-05-25 17:15:39 -05:00
Matthew Honnibal
df8015f05d
Tweaks to train script
2017-05-25 17:15:24 -05:00
Matthew Honnibal
3a6e59cc53
Add minibatch function in spacy.gold
2017-05-25 17:15:09 -05:00
Matthew Honnibal
702fe74a4d
Clean up spacy.cli.train
2017-05-25 16:16:30 -05:00
Matthew Honnibal
b9cea9cd93
Add compounding and decaying functions
2017-05-25 16:16:10 -05:00
Matthew Honnibal
2cb7cc2db7
Remove commented code from parser
2017-05-25 14:55:09 -05:00
Matthew Honnibal
f403c2cd5f
Add env opts for optimizer
2017-05-25 11:19:26 -05:00
Matthew Honnibal
c245ff6b27
Rebatch parser inputs, with mid-sentence states
2017-05-25 11:18:59 -05:00
Matthew Honnibal
679efe79c8
Make parser update less hacky
2017-05-25 06:49:00 -05:00
Matthew Honnibal
8500d9b1da
Only train one task per iter, holding grads
2017-05-25 06:47:42 -05:00
Matthew Honnibal
b27c587800
Fix pieces argument to PrecomputedMaxout
2017-05-25 06:46:59 -05:00
Matthew Honnibal
e1cb5be0c7
Adjust dropout, depth and multi-task in parser
2017-05-24 20:11:41 -05:00
Matthew Honnibal
e6cc927ab1
Rearrange multi-task learning
2017-05-24 20:10:54 -05:00
Matthew Honnibal
135a13790c
Disable gold preprocessing
2017-05-24 20:10:20 -05:00
Matthew Honnibal
467bbeadb8
Add hidden layers for tagger
2017-05-24 20:09:51 -05:00
ines
66088851dc
Add Doc.to_disk() and Doc.from_disk() methods
2017-05-24 11:58:17 +02:00
Matthew Honnibal
620df0414f
Fix dropout in parser
2017-05-23 15:20:45 -05:00
Matthew Honnibal
5b67bcbee0
Increase default embed size to 7500
2017-05-23 15:20:16 -05:00
Matthew Honnibal
48eef94f92
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 18:47:32 +02:00
Matthew Honnibal
d44b1eafc4
Fix conflict artefacts
2017-05-23 18:47:11 +02:00
Matthew Honnibal
01e59e4e6e
* Add Token.sent_start property, re Issue #235
2017-05-23 18:41:11 +02:00
Matthew Honnibal
4917cbb484
Include sent_start test
2017-05-23 18:40:37 +02:00
Matthew Honnibal
d68dd1f251
Add SENT_START attribute, for custom sentence boundary detection
2017-05-23 18:37:58 +02:00
Matthew Honnibal
8026c183d0
Add hacky logic to accelerate depth=0 case in parser
2017-05-23 11:06:49 -05:00
Matthew Honnibal
e7d3159d91
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 05:58:17 -05:00
Matthew Honnibal
a8b6d11c5b
Support optional maxout layer
2017-05-23 05:58:07 -05:00
Matthew Honnibal
c55b8fa7c5
Fix bugs in parse_batch
2017-05-23 05:57:52 -05:00
ines
fb0ff0272f
xfail neural parser tests for now and remove test for deprecated method
2017-05-23 12:40:37 +02:00
Matthew Honnibal
964707d795
Restore support for deeper networks in parser
2017-05-23 05:31:13 -05:00
Matthew Honnibal
e27262f431
Go back to previous matcher signature, with on_match positional
2017-05-23 04:37:40 -05:00
Matthew Honnibal
5418bcf5d7
Resolve conflict on test
2017-05-23 04:37:16 -05:00
ines
e6acd3bbf2
Fix matcher tests and matcher docs
2017-05-23 11:36:02 +02:00
ines
d0c6d4f76d
Fix formatting
2017-05-23 11:32:00 +02:00
Matthew Honnibal
f0bcc0bd8d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 04:29:28 -05:00
Matthew Honnibal
9adfe9e8fc
Don't hold gradient updates in language -- let the parser decide how to batch the updates.
2017-05-23 04:29:10 -05:00
Matthew Honnibal
6b918cc58e
Support making updates periodically during training
2017-05-23 04:23:29 -05:00
Matthew Honnibal
3f725ff7b3
Roll back changes to parser update
2017-05-23 04:23:05 -05:00
Matthew Honnibal
3959d778ac
Revert "Revert "WIP on improving parser efficiency""
...
This reverts commit 532afef4a8
.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8
Revert "WIP on improving parser efficiency"
...
This reverts commit bdaac7ab44
.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44
WIP on improving parser efficiency
2017-05-23 02:59:31 -05:00
Matthew Honnibal
8a9e318deb
Put the parsing loop in a nogil prange block
2017-05-22 17:58:12 -05:00
ines
a23f487b06
Tidy up displaCy and add "manual" option
...
Also don't require title in EntityRenderer
2017-05-22 18:48:20 +02:00
Matthew Honnibal
0264447c4d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 10:41:56 -05:00
Matthew Honnibal
6e8dce2c05
Fix train command line args
2017-05-22 10:41:39 -05:00
Matthew Honnibal
a7ee63c0ac
Fix labeller loss for unseen labels
2017-05-22 10:41:20 -05:00
Matthew Honnibal
c9760b2104
Support sentence limits in GoldCorpus
2017-05-22 10:40:46 -05:00
Matthew Honnibal
e2136232f9
Exclude states with no matching gold annotations from parsing
2017-05-22 10:30:12 -05:00
Matthew Honnibal
83ffd16474
Fix offset calculation for other negative values
2017-05-22 08:00:53 -05:00
ines
b3c7ee0148
Fix tests and use the new Matcher API
2017-05-22 13:54:20 +02:00
Matthew Honnibal
f00f821496
Fix pseudoprojectivity->nonproj
2017-05-22 06:14:42 -05:00
Matthew Honnibal
ae8cf70dc1
Fix CLI train signature
2017-05-22 06:13:39 -05:00
Matthew Honnibal
187f370734
Update tests for matcher changes
2017-05-22 12:59:50 +02:00
Matthew Honnibal
5d59e74cf6
PseudoProjectivity->nonproj
2017-05-22 05:49:53 -05:00
Matthew Honnibal
7e2cdc0c81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:39:34 +02:00
Matthew Honnibal
70a8c531cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 05:39:18 -05:00
Matthew Honnibal
2f78413a02
PseudoProjectivity->nonproj
2017-05-22 05:39:03 -05:00
Matthew Honnibal
89ebc5c3cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:38:15 +02:00
Matthew Honnibal
d8bb5bb959
Implement StringStore serialization, and update tests
2017-05-22 12:38:00 +02:00
ines
54f04a9fe0
Update API docs with changes in spacy.gold and spacy.language
2017-05-22 12:29:30 +02:00
ines
b5fb43fdd8
Allow sys.exit status as exits keyword arg in util.prints()
2017-05-22 12:29:15 +02:00
ines
fc3ec733ea
Reduce complexity in CLI
...
Remove now redundant model command and move plac annotations to cli
files
2017-05-22 12:28:58 +02:00
Matthew Honnibal
b45b4aa392
PseudoProjectivity --> nonproj
2017-05-22 05:17:44 -05:00
Matthew Honnibal
aae97f00e9
Fix nonproj import
2017-05-22 05:15:06 -05:00
Matthew Honnibal
9262fc4829
Fix syntax error
2017-05-22 05:14:59 -05:00
Matthew Honnibal
93a042253b
Make GoldParse attributes writeable
2017-05-22 04:51:08 -05:00
Matthew Honnibal
2a5eb9f61e
Make nonproj methods top-level functions, instead of class methods
2017-05-22 04:51:08 -05:00
Matthew Honnibal
c998776c25
Make single array for features, to reduce GPU copies
2017-05-22 04:51:08 -05:00
Matthew Honnibal
bc2294d7f1
Add support for fiddly hyper-parameters to train func
2017-05-22 04:51:08 -05:00
Matthew Honnibal
80e19a2399
Simplify CLI implementation for subcommands. Remove model command.
2017-05-22 04:51:08 -05:00
Matthew Honnibal
33e2222839
Remove unused code in deprojectivize
2017-05-22 04:51:08 -05:00
Matthew Honnibal
4e0988605a
Pass through non-projective=True
2017-05-22 04:51:08 -05:00
Matthew Honnibal
025d9bbc37
Fix handling of non-projective deps
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5738d373d5
Add deprojectivize to pipeline
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1b5fa68996
Do pseudo-projective pre-processing for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1d5d9838a2
Fix action collection for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
8d1e64be69
Add experimental NeuralLabeller
2017-05-22 04:51:08 -05:00
Matthew Honnibal
9b1b0742fd
Fix prediction for tok2vec
2017-05-22 04:51:08 -05:00
Matthew Honnibal
f13d6c7359
Support gold preprocessing and single gold files
2017-05-22 04:51:08 -05:00
Matthew Honnibal
e14533757b
Use averaged params for evaluation
2017-05-22 04:51:08 -05:00
Matthew Honnibal
7811d97339
Refactor CLI
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
Matthew Honnibal
432b3499b3
Fix memory leak
2017-05-21 13:38:46 -05:00
Matthew Honnibal
59fbfb3829
Remove train.py -- functions now in GoldCorpus and Language
2017-05-21 09:08:27 -05:00
Matthew Honnibal
8904814c0e
Add missing import
2017-05-21 09:07:56 -05:00
Matthew Honnibal
baf3ef0ddc
Remove import of removed train_config script
2017-05-21 09:07:34 -05:00
Matthew Honnibal
4c9202249d
Refactor training, to fix memory leak
2017-05-21 09:07:06 -05:00
Matthew Honnibal
4803b3b69e
Add GoldCorpus class, to manage data streaming
2017-05-21 09:06:17 -05:00
Matthew Honnibal
180e5afede
Fix tokvecs flattening in pipeline
2017-05-21 09:05:34 -05:00
Matthew Honnibal
0731971bfc
Add itershuffle utility function. Maybe belongs in thinc
2017-05-21 09:05:05 -05:00
ines
2c5cfe8bbf
Update docstrings and API docs for StringStore
2017-05-21 14:18:58 +02:00
ines
251346b59f
Fix typos and formatting
2017-05-21 14:18:46 +02:00
ines
075f5ff87a
Update docstrings and API docs for GoldParse
2017-05-21 13:53:46 +02:00
ines
99b631617d
Reformat docstrings
2017-05-21 13:32:15 +02:00
ines
885e82c9b0
Update docstrings and remove deprecated load classmethod
2017-05-21 13:27:52 +02:00
ines
c5a653fa48
Update docstrings and API docs for Tokenizer
2017-05-21 13:18:14 +02:00
ines
f216422ac5
Remove deprecated load classmethod
2017-05-21 13:18:01 +02:00
ines
d82ae9a585
Change "function" to "callable" in docs
2017-05-21 13:17:40 +02:00
ines
3871157d84
Update spacy.util documentation
2017-05-21 01:12:09 +02:00
ines
0c6c65aa3c
Improve messaging if model linking fails after download
2017-05-21 00:28:37 +02:00
Matthew Honnibal
3b7c108246
Pass tokvecs through as a list, instead of concatenated. Also fix padding
2017-05-20 13:23:32 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
d52b65aec2
Revert "Move to contiguous buffer for token_ids and d_vectors"
...
This reverts commit 3ff8c35a79
.
2017-05-20 11:26:23 -05:00
ines
27de0834b2
Update docstrings and API docs for Lexeme
2017-05-20 15:13:42 +02:00
ines
7ed8a92ed1
Update docstrings and API docs for Token
2017-05-20 15:13:33 +02:00
ines
4ed6a36622
Update docstrings and API docs for Matcher
2017-05-20 14:43:10 +02:00
ines
39f36539f6
Update docstrings and API docs for Matcher
2017-05-20 14:32:34 +02:00
ines
c00ff257be
Update docstrings and API docs for Matcher
2017-05-20 14:26:10 +02:00
ines
790435e51c
Update docstrings
2017-05-20 14:05:07 +02:00
ines
f0cc642bb9
Update docstrings and API docs for Vocab
2017-05-20 14:00:41 +02:00
Matthew Honnibal
ce9234f593
Update Matcher API
2017-05-20 13:54:53 +02:00
Matthew Honnibal
b272890a8c
Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change
2017-05-20 06:40:10 -05:00
ines
e39ad78267
Resolve model name properly in cli.info
...
Use util.resolve_model_path() to also allow package names and paths.
2017-05-20 12:24:40 +02:00
Matthew Honnibal
3ff8c35a79
Move to contiguous buffer for token_ids and d_vectors
2017-05-20 04:17:30 -05:00
Matthew Honnibal
8b04b0af9f
Remove freqs from transition_system
2017-05-20 02:20:48 -05:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
a1ba20e2b1
Fix over-run on parse_batch
2017-05-19 18:57:30 -05:00
ines
1d4d3d0ecd
Add TODO
2017-05-20 01:38:04 +02:00
Matthew Honnibal
7ee1827af0
Disable data caching in parser
2017-05-19 18:17:11 -05:00
Matthew Honnibal
e84de028b5
Remove 'rebatch' op, and remove min-batch cap
2017-05-19 18:16:36 -05:00
Matthew Honnibal
3376d4d6e8
Update the train script, fixing GPU memory leak
2017-05-19 18:15:50 -05:00
Matthew Honnibal
836fe1d880
Update neural net tests
2017-05-19 18:11:29 -05:00
ines
fe5d8819ea
Update Matcher docstrings and API docs
2017-05-19 21:47:06 +02:00
Matthew Honnibal
08766240c3
Add incomplete iob converter
2017-05-19 13:27:51 -05:00
Matthew Honnibal
c12ab47a56
Remove state argument in pipeline. Other changes
2017-05-19 13:26:36 -05:00
Matthew Honnibal
66ea9aebe7
Remove the state argument from Language
2017-05-19 13:25:42 -05:00
Matthew Honnibal
09a877886b
WIP on iob converter
2017-05-19 13:24:39 -05:00
ines
a804045597
Use is_ancestor instead of deprecated is_ancestor_of
2017-05-19 20:23:40 +02:00
Matthew Honnibal
8d5e6d9f4f
Rename no_ner arg to no_entities
2017-05-19 13:23:11 -05:00
ines
e9e62b01b0
Update docstrings and API docs for Token
2017-05-19 18:47:56 +02:00
ines
62ceec4fc6
Update docstrings and API docs for Span
2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8
Update docstrings and API docs for Doc
2017-05-19 18:47:39 +02:00
ines
2c8c9dc0c9
Update docstrings and API docs for Language
2017-05-19 18:47:24 +02:00
ines
0791f0aae6
Update docstrings and API docs for Span class
2017-05-19 00:31:31 +02:00
ines
8455cb1327
Update docstring for Doc.__getitem__
2017-05-19 00:30:51 +02:00
ines
0fc05e54e4
Document TokenVectorEncoder
2017-05-19 00:00:02 +02:00
ines
b687ad109d
Update docstrings and API docs for Doc class
2017-05-18 23:59:44 +02:00
ines
d42bc16868
Update docstrings and API docs for Language class
2017-05-18 23:57:38 +02:00
ines
593361ee3c
Update docstrings for Span class
2017-05-18 22:17:41 +02:00
ines
b87066ff10
Update docstrings and API docs for Doc class
2017-05-18 22:17:41 +02:00
Matthew Honnibal
238be0f16a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-18 08:32:22 -05:00
Matthew Honnibal
c214c0decb
Improve env_opt reporting
2017-05-18 08:32:03 -05:00
Matthew Honnibal
bbb59e371c
Fix GPU evaluation
2017-05-18 08:31:15 -05:00
Matthew Honnibal
c2c825127a
Fix use_params and pipe methods
2017-05-18 08:30:59 -05:00
Matthew Honnibal
ca70b08661
Fix GPU training and evaluation
2017-05-18 08:30:33 -05:00
ines
489d2fb4ba
Add is_in_jupyter() helper for displaCy (see #1058 )
2017-05-18 14:13:14 +02:00
ines
abf0188b0a
Move cupy and CudaStream to compat
2017-05-18 14:12:45 +02:00
ines
33decd85b6
Reorganise and explicitly state what's importable
2017-05-18 14:12:31 +02:00
Matthew Honnibal
a438cef8c5
Fix significant bug in feature calculation -- off by 1
2017-05-18 06:21:32 -05:00
Matthew Honnibal
fc8d3a112c
Add util.env_opt support: Can set hyper params through environment variables.
2017-05-18 04:36:53 -05:00
Matthew Honnibal
d2626fdb45
Fix name error in nn parser
2017-05-18 04:31:01 -05:00
Matthew Honnibal
b460533827
Bug fixes to pipeline
2017-05-18 04:29:51 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
Matthew Honnibal
2713041571
Fix GPU usage in Language
2017-05-18 04:25:19 -05:00
Matthew Honnibal
711ad5edc4
Cache features in doc2feats
2017-05-18 04:22:20 -05:00
Matthew Honnibal
39ea38c4b1
Add option to use gpu to spacy train
2017-05-18 04:21:49 -05:00
Matthew Honnibal
a1d8e420b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 08:00:04 -05:00
Matthew Honnibal
edfea3a513
Fix progress bar
2017-05-17 14:59:37 +02:00
Matthew Honnibal
0b7fd67408
Fix style check in displacy
2017-05-17 07:57:24 -05:00
Matthew Honnibal
55dab77de8
Add conversion rule for .conll
2017-05-17 13:13:48 +02:00
Matthew Honnibal
692bd2a186
Bug fix to tagger: wasnt backproping to token vectors
2017-05-17 13:13:14 +02:00
Matthew Honnibal
877f83807f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 12:09:29 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
3bf4a28d8d
Use tag in CoNLL converter, not POS
2017-05-17 12:04:33 +02:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
c9a5d5d24b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-16 16:22:05 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
221b4c1ee8
Fix test for Python 3
2017-05-16 13:06:30 +02:00
Matthew Honnibal
5211645af3
Get data flowing through pipeline. Needs redesign
2017-05-16 11:21:59 +02:00
Matthew Honnibal
1d7c18e58a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-15 21:53:47 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
ines
98354be150
Only get user_data if it exists on doc
2017-05-15 13:39:47 +02:00
ines
c33bdeb564
Use uppercase for entity types
2017-05-15 01:24:57 +02:00
ines
4aaa607b8d
Add xmlns:xlink so SVGs are rendered properly as individual files
2017-05-14 19:54:13 +02:00
ines
9dd13cd76a
Update docstrings
2017-05-14 19:30:47 +02:00
ines
a04550605a
Add Jupyter notebook support (see #1058 )
2017-05-14 18:39:01 +02:00
ines
c31792aaec
Add displaCy visualisers (see #1058 )
2017-05-14 17:50:23 +02:00
ines
b462076d80
Merge load_lang_class and get_lang_class
2017-05-14 01:31:10 +02:00
ines
36bebe7164
Update docstrings
2017-05-14 01:30:29 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal
5cac951a16
Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.
2017-05-14 00:55:01 +02:00
Matthew Honnibal
f8c02b4341
Remove cupy imports from parser, so it can work on CPU
2017-05-14 00:37:53 +02:00
Matthew Honnibal
613ba79e2e
Fiddle with sizings for parser
2017-05-13 17:20:23 -05:00
Matthew Honnibal
e6d71e1778
Small fixes to parser
2017-05-13 17:19:04 -05:00
Matthew Honnibal
188c0f6949
Clean up unused import
2017-05-13 17:18:27 -05:00
Matthew Honnibal
f85c8464f7
Draft support of regression loss in parser
2017-05-13 17:17:27 -05:00
ines
1694c24e52
Add docstrings, error messages and fix consistency
2017-05-13 21:22:49 +02:00
ines
ee7dcf65c9
Fix expand_exc to make sure it returns combined dict
2017-05-13 21:22:25 +02:00
ines
824d09bb74
Move resolve_load_name to deprecated
2017-05-13 21:21:47 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
5858857a78
Update languages list in conftest
2017-05-13 15:37:54 +02:00
ines
9d85cda8e4
Fix models error message and use about.__docs_models__ (see #1051 )
2017-05-13 13:05:47 +02:00
ines
6b942763f0
Tidy up imports
2017-05-13 13:04:40 +02:00
ines
8c2a0c026d
Fix parse_tree test
2017-05-13 12:32:45 +02:00
ines
6129016e15
Replace deepcopy
2017-05-13 12:32:37 +02:00
ines
df68bf45ce
Set defaults for light and flat kwargs
2017-05-13 12:32:23 +02:00
ines
b9dea345e5
Remove old import
2017-05-13 12:32:11 +02:00
ines
293ee359c5
Fix formatting
2017-05-13 12:32:06 +02:00
ines
4eefb288e3
Port over PR #1055
2017-05-13 03:25:32 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
827b5af697
Update draft of parser neural network model
...
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.
Outline of the model:
We first predict context-sensitive vectors for each word in the input:
(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4
This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).
The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.
The current context tokens:
* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0
This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).
The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)
This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.
Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:
(exp(score) / Z) - (exp(score) / gZ)
Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.
Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines
c4857bc7db
Remove unused argument
2017-05-12 15:37:54 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
Matthew Honnibal
4efb391994
Fix serializer
2017-05-09 18:45:18 +02:00
Matthew Honnibal
b16ae75824
Remove serializer hacks from pipeline classes
2017-05-09 18:16:40 +02:00
Matthew Honnibal
7253b4e649
Remove old serialization tests
2017-05-09 18:12:58 +02:00
Matthew Honnibal
f9327343ce
Start updating serializer test
2017-05-09 18:12:03 +02:00
Matthew Honnibal
1166b0c491
Implement Doc.to_bytes and Doc.from_bytes methods
2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
Matthew Honnibal
b53f7dfdc3
Remove spacy.serialize
2017-05-09 17:22:06 +02:00
Matthew Honnibal
62ecdea9f2
Add binder class for document serialization
2017-05-09 17:21:00 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
2c3bdd09b1
Add English test for like_num
2017-05-09 11:06:34 +02:00
ines
22375eafb0
Fix and merge attrs and lex_attrs tests
2017-05-09 11:06:25 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
564939391a
Remove spacy.orth
2017-05-09 01:21:47 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
ea5fa46475
Import LEX_ATTRS from lang.lex_attrs
2017-05-09 00:58:10 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
7b83977020
Remove unused munge package
2017-05-09 00:16:16 +02:00
ines
c714841cc8
Move language-specific tests to tests/lang
2017-05-09 00:02:37 +02:00
ines
bd57b611cc
Update conftest to lazy load languages
2017-05-09 00:02:21 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
6eb6306843
Fix language data imports
2017-05-08 23:58:31 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
86d9c29f30
Reorder util functions
2017-05-08 23:51:15 +02:00
ines
9a0d2fdef1
Add load_lang_class() util function
2017-05-08 23:50:45 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00
ines
41a322c733
Fix LEMMA in exceptions and morph rules
2017-05-08 19:57:36 +02:00
ines
2edc0aee12
Update warning message
2017-05-08 19:53:36 +02:00
ines
6025cdb992
Fix string interpolation in times
2017-05-08 16:38:16 +02:00
ines
b9ba58ba5c
Add function to resolve load name
...
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines
e6f1a5d0a1
Add unicode declaration
2017-05-08 16:22:17 +02:00
ines
be5541bd16
Fix import and tokenizer exceptions
2017-05-08 16:20:14 +02:00
ines
2324788970
Remove bad tests
2017-05-08 16:15:27 +02:00
ines
b88c4193e7
Add missing symbol
2017-05-08 16:15:20 +02:00
ines
9a5b2bdd4c
Don't set morph rules without tag map
2017-05-08 16:15:12 +02:00
ines
4930f0fa8f
Explicitly import TOKEN_MATCH
2017-05-08 16:11:54 +02:00
ines
50b7ec03ca
Fix typo
2017-05-08 16:11:45 +02:00
ines
3ca611fe48
Fix wildcard imports
2017-05-08 15:56:29 +02:00
ines
c2469b8135
Remove __all__ export
2017-05-08 15:56:22 +02:00
ines
14a9c3ee7a
Fix wildcard import
2017-05-08 15:56:13 +02:00
ines
deed623864
Remove comment
2017-05-08 15:56:05 +02:00
ines
e7f95c37ee
Merge base tokenizer exceptions
2017-05-08 15:55:52 +02:00
ines
24606d364c
Remove redundant language_data.py files in languages
...
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines
a627d3e3b0
Reorganise Chinese language data
2017-05-08 15:54:36 +02:00
ines
7b86ee093a
Reorganise Swedish language data
2017-05-08 15:54:29 +02:00
ines
50510fa947
Reorganise Portuguese language data
2017-05-08 15:52:01 +02:00
ines
279895ea83
Reorganise Dutch language data
2017-05-08 15:51:39 +02:00
ines
04ef5025bd
Reorganise Norwegian language data
2017-05-08 15:51:22 +02:00
ines
5edbc725d8
Reorganise Japanese language data
2017-05-08 15:50:46 +02:00
ines
51a389d3bb
Reorganise Italian language data
2017-05-08 15:50:17 +02:00
ines
1bbfa14436
Reorganise Hungarian language data
2017-05-08 15:49:56 +02:00
ines
a77c9fc60d
Reorganise Hebrew language data
2017-05-08 15:49:28 +02:00
ines
7f05e977fa
Reorganise French language data
2017-05-08 15:49:05 +02:00
ines
0207ffdd52
Reorganise Finnish language data
2017-05-08 15:48:31 +02:00
ines
8e483ec950
Reorganise Spanish language data
2017-05-08 15:48:04 +02:00
ines
c7c21b980f
Reorganise English language data
2017-05-08 15:47:25 +02:00
ines
1bf9d5ec8b
Reorganise German language data
2017-05-08 15:44:26 +02:00
ines
7b3a983f96
Reorganise Bengali language data
2017-05-08 15:43:50 +02:00
ines
607ba458e7
Fix whitespace
2017-05-08 15:42:31 +02:00
ines
60db497525
Add update_exc and expand_exc to util
...
Doesn't require separate language data util anymore
2017-05-08 15:42:12 +02:00
Matthew Honnibal
b44f7e259c
Clean up unused parser code
2017-05-08 15:42:04 +02:00
ines
6e5bd4f228
Remove unused functions from deprecated
2017-05-08 15:40:16 +02:00
Matthew Honnibal
17efb1c001
Change width
2017-05-08 08:40:13 -05:00
ines
f68e420bc0
Add PRON_LEMMA and DET_LEMMA to deprecated
...
Will be replaced with proper values across the language data later.
2017-05-08 15:35:30 +02:00
ines
bd6a7cf4f6
Simplify deprecated model downloading
...
Only relevant for spaCy < v1.7.0.
2017-05-08 15:32:10 +02:00
ines
95edd9e896
Let parse_package_meta take full path
2017-05-08 15:30:48 +02:00
ines
326746eb15
Add util function to resolve arg to model path
...
1. check if in data dir or shortcut link
2. check if installed as a pip package
3. check if string is path to model
4. check if Path or Path-like object
2017-05-08 15:29:47 +02:00
Matthew Honnibal
bef89ef23d
Mergery
2017-05-08 08:29:36 -05:00
ines
a7801e7342
Update spacy.load()
...
path argument is now deprecated and name can either take a model name
or path. Implement lazy loading by importing module and read Language
class name off __all__.
2017-05-08 15:27:25 +02:00
Matthew Honnibal
50ddc9fc45
Fix infinite loop bug
2017-05-08 07:54:26 -05:00
Matthew Honnibal
94e86ae00a
Predict tags with encoder
2017-05-08 07:53:45 -05:00
Matthew Honnibal
56073a11ef
Don't use tags when calculating token vectors
2017-05-08 07:52:24 -05:00
Matthew Honnibal
a66a4a4d0f
Replace einsums
2017-05-08 14:46:50 +02:00
Matthew Honnibal
8d2eab74da
Use PretrainableMaxouts
2017-05-08 14:24:55 +02:00
Matthew Honnibal
807cb2e370
Add PretrainableMaxouts
2017-05-08 14:24:43 +02:00
Matthew Honnibal
2e2268a442
Precomputable hidden now working
2017-05-08 11:36:37 +02:00
ines
94697e9afc
Fix typo
2017-05-08 02:00:37 +02:00
ines
0ee2a22b67
Merge branch 'pr/1024' into develop
2017-05-08 01:12:44 +02:00
ines
c4492d260a
Fix kwargs
2017-05-08 01:05:24 +02:00
Matthew Honnibal
10682d35ab
Get pre-computed version working
2017-05-08 00:38:35 +02:00
ines
b5a726c5cd
Tidy up deprecated.py
2017-05-07 23:29:22 +02:00
ines
59c3b9d4dd
Tidy up CLI and fix print functions
2017-05-07 23:25:29 +02:00
ines
311704674d
Add path2str compat function
2017-05-07 23:24:56 +02:00
ines
e34069db9f
Move is_package and get_model_package_path to util
2017-05-07 23:24:51 +02:00
ines
957ba676b4
Add model files base path to about.py
2017-05-07 23:22:35 +02:00
ines
8d8dd9ceb2
Don't set default value for model
2017-05-07 23:22:21 +02:00
Matthew Honnibal
35458987e8
Checkpoint -- nearly finished reimpl
2017-05-07 23:05:01 +02:00
Matthew Honnibal
4441866f55
Checkpoint -- nearly finished reimpl
2017-05-07 22:47:06 +02:00
Matthew Honnibal
6782eedf9b
Tmp GPU code
2017-05-07 11:04:24 -05:00
Matthew Honnibal
e420e5a809
Tmp
2017-05-07 07:31:09 -05:00
Matthew Honnibal
12039e80ca
Switch to single matmul for state layer
2017-05-07 14:26:34 +02:00
Matthew Honnibal
700979fb3c
CPU/GPU compat
2017-05-07 04:01:11 +02:00
Matthew Honnibal
f99f5b75dc
working residual net
2017-05-07 03:57:26 +02:00
Matthew Honnibal
bdf2dba9fb
WIP on refactor, with hidde pre-computing
2017-05-07 02:02:43 +02:00
Matthew Honnibal
b439e04f8d
Learning smoothly
2017-05-06 20:38:12 +02:00
Matthew Honnibal
08bee76790
Learns things
2017-05-06 18:24:38 +02:00
Matthew Honnibal
04ae1c01f1
Learns things
2017-05-06 18:21:02 +02:00
Matthew Honnibal
bcf4cd0a5f
Learns things
2017-05-06 17:37:36 +02:00
Matthew Honnibal
8e48b58cd6
Gradients look correct
2017-05-06 16:47:15 +02:00
Matthew Honnibal
7e04260d38
Data running through, likely errors in model
2017-05-06 14:22:20 +02:00
Matthew Honnibal
fa7c1990b6
Restore tok2vec function
2017-05-05 20:12:03 +02:00
Matthew Honnibal
efe9630e1c
Bug fixes
2017-05-05 20:09:50 +02:00
Matthew Honnibal
ef4fa594aa
Draft of NN parser, to be tested
2017-05-05 19:20:39 +02:00
Matthew Honnibal
7d1df50aec
Draft up Parser model
2017-05-04 13:31:40 +02:00
Matthew Honnibal
ccaf26206b
Pseudocode for parser
2017-05-04 12:17:59 +02:00
ines
b1f22c5a10
Fix formatting
2017-05-03 20:11:02 +02:00
ines
a04b5be1b2
Add glossary for annotation scheme ( closes #1034 )
...
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard
929f2792a7
Rennaming cls in module. cls is now a class
2017-05-03 15:41:07 +02:00
Gregory Howard
0e8c41ea4f
Adding method lemmatizer for every class
2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e
adding export japanese
2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224
Merge branch 'master' into master
2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4
Lazy imports language
2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d
Fix formatting
2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d
Raise custom ImportError if importing janome fails
2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b
Add newline
2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea
Add newline
2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135
Add newline
2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87
Add basic japanese support
2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb
Merge remote-tracking branch 'remotes/upstream/master'
2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2
Add dropout optin for parser and NER
...
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.
nlp.entity.update(doc, gold, drop=0.4)
This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.
This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b
Removing extra spaces
2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e
Adding unitest for tokenization in french (with title)
2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c
Add newline
2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2
Add newline
2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef
Add newline
2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21
Add newline
2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05
Add newline
2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849
Adding unitest for tokenization in french (with title)
2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45
Improvement of rules now title insentive and have same declaration format
2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27
Increment version
2017-04-26 20:25:41 +02:00
luvogels
b331929a7e
Merge branch 'master' of https://github.com/luvogels/spaCy
2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09
Update __init__.py
2017-04-26 18:27:55 +02:00
ines
527d51ac9a
Fetch shortcuts from GitHub and improve error handling
2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
26e31afc18
renamming tests
2017-04-25 17:46:01 +02:00
ghoward
c085c2d391
Adding some unitests
2017-04-25 17:44:16 +02:00
ghoward
55c6910f90
Look_up table for languages in spacy.
...
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b
Fix reporting if no dev data with train
2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
ines
401045433c
Simplify compat.fix_text
2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64
Increment version
2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b
Update fix for Issue #999
2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07
Add test for Issue #955
2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5
Ensure noun chunks can't be nested. Closes #955
2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445
Add test for Issue #999
2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52
Fix json dump for Python3
2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17
Remove xfail on Test #910
2017-04-23 16:28:55 +02:00
ines
3a9710f356
Pass dev_scores to print_progress correctly ( resolves #1008 )
...
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab
Persist the actions within spacy.parser.cfg
2017-04-20 17:02:44 +02:00
ines
25c70b4cc5
Move fix_text to spacy.compat (see #1002 )
2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee
Merge pull request #1002 from oroszgy/model_cli_fix
...
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c
Using ftfy for handling broken encoded strings.
2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046
Merge pull request #1001 from recognai/master
...
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb
fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing
2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10
Enable test
2017-04-20 02:25:24 +03:00
Ben Eyal
33af52599e
Redefine alphabetic characters
...
For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.
2017-04-20 02:25:02 +03:00
Ben Eyal
d8098a8be2
Use regex
instead of re
2017-04-20 02:22:52 +03:00
oeg
daaa42dd25
Merge remote-tracking branch 'upstream/master'
2017-04-19 23:30:36 +02:00
oeg
936a297241
fix(model): Fix tag map for fixing issues with tag SPACE
2017-04-19 23:30:21 +02:00
luvogels
c7cec7e5e2
Update __init__.py
2017-04-19 21:06:30 +02:00