Matthew Honnibal
2cb7cc2db7
Remove commented code from parser
2017-05-25 14:55:09 -05:00
Matthew Honnibal
f403c2cd5f
Add env opts for optimizer
2017-05-25 11:19:26 -05:00
Matthew Honnibal
c245ff6b27
Rebatch parser inputs, with mid-sentence states
2017-05-25 11:18:59 -05:00
Matthew Honnibal
679efe79c8
Make parser update less hacky
2017-05-25 06:49:00 -05:00
Matthew Honnibal
8500d9b1da
Only train one task per iter, holding grads
2017-05-25 06:47:42 -05:00
Matthew Honnibal
b27c587800
Fix pieces argument to PrecomputedMaxout
2017-05-25 06:46:59 -05:00
Matthew Honnibal
e1cb5be0c7
Adjust dropout, depth and multi-task in parser
2017-05-24 20:11:41 -05:00
Matthew Honnibal
e6cc927ab1
Rearrange multi-task learning
2017-05-24 20:10:54 -05:00
Matthew Honnibal
135a13790c
Disable gold preprocessing
2017-05-24 20:10:20 -05:00
Matthew Honnibal
467bbeadb8
Add hidden layers for tagger
2017-05-24 20:09:51 -05:00
ines
66088851dc
Add Doc.to_disk() and Doc.from_disk() methods
2017-05-24 11:58:17 +02:00
Matthew Honnibal
620df0414f
Fix dropout in parser
2017-05-23 15:20:45 -05:00
Matthew Honnibal
5b67bcbee0
Increase default embed size to 7500
2017-05-23 15:20:16 -05:00
Matthew Honnibal
48eef94f92
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 18:47:32 +02:00
Matthew Honnibal
d44b1eafc4
Fix conflict artefacts
2017-05-23 18:47:11 +02:00
Matthew Honnibal
01e59e4e6e
* Add Token.sent_start property, re Issue #235
2017-05-23 18:41:11 +02:00
Matthew Honnibal
4917cbb484
Include sent_start test
2017-05-23 18:40:37 +02:00
Matthew Honnibal
d68dd1f251
Add SENT_START attribute, for custom sentence boundary detection
2017-05-23 18:37:58 +02:00
Matthew Honnibal
8026c183d0
Add hacky logic to accelerate depth=0 case in parser
2017-05-23 11:06:49 -05:00
Matthew Honnibal
e7d3159d91
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 05:58:17 -05:00
Matthew Honnibal
a8b6d11c5b
Support optional maxout layer
2017-05-23 05:58:07 -05:00
Matthew Honnibal
c55b8fa7c5
Fix bugs in parse_batch
2017-05-23 05:57:52 -05:00
ines
fb0ff0272f
xfail neural parser tests for now and remove test for deprecated method
2017-05-23 12:40:37 +02:00
Matthew Honnibal
964707d795
Restore support for deeper networks in parser
2017-05-23 05:31:13 -05:00
Matthew Honnibal
e27262f431
Go back to previous matcher signature, with on_match positional
2017-05-23 04:37:40 -05:00
Matthew Honnibal
5418bcf5d7
Resolve conflict on test
2017-05-23 04:37:16 -05:00
ines
e6acd3bbf2
Fix matcher tests and matcher docs
2017-05-23 11:36:02 +02:00
ines
d0c6d4f76d
Fix formatting
2017-05-23 11:32:00 +02:00
Matthew Honnibal
f0bcc0bd8d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 04:29:28 -05:00
Matthew Honnibal
9adfe9e8fc
Don't hold gradient updates in language -- let the parser decide how to batch the updates.
2017-05-23 04:29:10 -05:00
Matthew Honnibal
6b918cc58e
Support making updates periodically during training
2017-05-23 04:23:29 -05:00
Matthew Honnibal
3f725ff7b3
Roll back changes to parser update
2017-05-23 04:23:05 -05:00
Matthew Honnibal
3959d778ac
Revert "Revert "WIP on improving parser efficiency""
...
This reverts commit 532afef4a8
.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8
Revert "WIP on improving parser efficiency"
...
This reverts commit bdaac7ab44
.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44
WIP on improving parser efficiency
2017-05-23 02:59:31 -05:00
Matthew Honnibal
8a9e318deb
Put the parsing loop in a nogil prange block
2017-05-22 17:58:12 -05:00
ines
a23f487b06
Tidy up displaCy and add "manual" option
...
Also don't require title in EntityRenderer
2017-05-22 18:48:20 +02:00
Matthew Honnibal
0264447c4d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 10:41:56 -05:00
Matthew Honnibal
6e8dce2c05
Fix train command line args
2017-05-22 10:41:39 -05:00
Matthew Honnibal
a7ee63c0ac
Fix labeller loss for unseen labels
2017-05-22 10:41:20 -05:00
Matthew Honnibal
c9760b2104
Support sentence limits in GoldCorpus
2017-05-22 10:40:46 -05:00
Matthew Honnibal
e2136232f9
Exclude states with no matching gold annotations from parsing
2017-05-22 10:30:12 -05:00
Matthew Honnibal
83ffd16474
Fix offset calculation for other negative values
2017-05-22 08:00:53 -05:00
ines
b3c7ee0148
Fix tests and use the new Matcher API
2017-05-22 13:54:20 +02:00
Matthew Honnibal
f00f821496
Fix pseudoprojectivity->nonproj
2017-05-22 06:14:42 -05:00
Matthew Honnibal
ae8cf70dc1
Fix CLI train signature
2017-05-22 06:13:39 -05:00
Matthew Honnibal
187f370734
Update tests for matcher changes
2017-05-22 12:59:50 +02:00
Matthew Honnibal
5d59e74cf6
PseudoProjectivity->nonproj
2017-05-22 05:49:53 -05:00
Matthew Honnibal
7e2cdc0c81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:39:34 +02:00
Matthew Honnibal
70a8c531cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 05:39:18 -05:00
Matthew Honnibal
2f78413a02
PseudoProjectivity->nonproj
2017-05-22 05:39:03 -05:00
Matthew Honnibal
89ebc5c3cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:38:15 +02:00
Matthew Honnibal
d8bb5bb959
Implement StringStore serialization, and update tests
2017-05-22 12:38:00 +02:00
ines
54f04a9fe0
Update API docs with changes in spacy.gold and spacy.language
2017-05-22 12:29:30 +02:00
ines
b5fb43fdd8
Allow sys.exit status as exits keyword arg in util.prints()
2017-05-22 12:29:15 +02:00
ines
fc3ec733ea
Reduce complexity in CLI
...
Remove now redundant model command and move plac annotations to cli
files
2017-05-22 12:28:58 +02:00
Matthew Honnibal
b45b4aa392
PseudoProjectivity --> nonproj
2017-05-22 05:17:44 -05:00
Matthew Honnibal
aae97f00e9
Fix nonproj import
2017-05-22 05:15:06 -05:00
Matthew Honnibal
9262fc4829
Fix syntax error
2017-05-22 05:14:59 -05:00
Matthew Honnibal
93a042253b
Make GoldParse attributes writeable
2017-05-22 04:51:08 -05:00
Matthew Honnibal
2a5eb9f61e
Make nonproj methods top-level functions, instead of class methods
2017-05-22 04:51:08 -05:00
Matthew Honnibal
c998776c25
Make single array for features, to reduce GPU copies
2017-05-22 04:51:08 -05:00
Matthew Honnibal
bc2294d7f1
Add support for fiddly hyper-parameters to train func
2017-05-22 04:51:08 -05:00
Matthew Honnibal
80e19a2399
Simplify CLI implementation for subcommands. Remove model command.
2017-05-22 04:51:08 -05:00
Matthew Honnibal
33e2222839
Remove unused code in deprojectivize
2017-05-22 04:51:08 -05:00
Matthew Honnibal
4e0988605a
Pass through non-projective=True
2017-05-22 04:51:08 -05:00
Matthew Honnibal
025d9bbc37
Fix handling of non-projective deps
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5738d373d5
Add deprojectivize to pipeline
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1b5fa68996
Do pseudo-projective pre-processing for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1d5d9838a2
Fix action collection for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
8d1e64be69
Add experimental NeuralLabeller
2017-05-22 04:51:08 -05:00
Matthew Honnibal
9b1b0742fd
Fix prediction for tok2vec
2017-05-22 04:51:08 -05:00
Matthew Honnibal
f13d6c7359
Support gold preprocessing and single gold files
2017-05-22 04:51:08 -05:00
Matthew Honnibal
e14533757b
Use averaged params for evaluation
2017-05-22 04:51:08 -05:00
Matthew Honnibal
7811d97339
Refactor CLI
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
Matthew Honnibal
432b3499b3
Fix memory leak
2017-05-21 13:38:46 -05:00
Matthew Honnibal
59fbfb3829
Remove train.py -- functions now in GoldCorpus and Language
2017-05-21 09:08:27 -05:00
Matthew Honnibal
8904814c0e
Add missing import
2017-05-21 09:07:56 -05:00
Matthew Honnibal
baf3ef0ddc
Remove import of removed train_config script
2017-05-21 09:07:34 -05:00
Matthew Honnibal
4c9202249d
Refactor training, to fix memory leak
2017-05-21 09:07:06 -05:00
Matthew Honnibal
4803b3b69e
Add GoldCorpus class, to manage data streaming
2017-05-21 09:06:17 -05:00
Matthew Honnibal
180e5afede
Fix tokvecs flattening in pipeline
2017-05-21 09:05:34 -05:00
Matthew Honnibal
0731971bfc
Add itershuffle utility function. Maybe belongs in thinc
2017-05-21 09:05:05 -05:00
ines
2c5cfe8bbf
Update docstrings and API docs for StringStore
2017-05-21 14:18:58 +02:00
ines
251346b59f
Fix typos and formatting
2017-05-21 14:18:46 +02:00
ines
075f5ff87a
Update docstrings and API docs for GoldParse
2017-05-21 13:53:46 +02:00
ines
99b631617d
Reformat docstrings
2017-05-21 13:32:15 +02:00
ines
885e82c9b0
Update docstrings and remove deprecated load classmethod
2017-05-21 13:27:52 +02:00
ines
c5a653fa48
Update docstrings and API docs for Tokenizer
2017-05-21 13:18:14 +02:00
ines
f216422ac5
Remove deprecated load classmethod
2017-05-21 13:18:01 +02:00
ines
d82ae9a585
Change "function" to "callable" in docs
2017-05-21 13:17:40 +02:00
ines
3871157d84
Update spacy.util documentation
2017-05-21 01:12:09 +02:00
ines
0c6c65aa3c
Improve messaging if model linking fails after download
2017-05-21 00:28:37 +02:00
Matthew Honnibal
3b7c108246
Pass tokvecs through as a list, instead of concatenated. Also fix padding
2017-05-20 13:23:32 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
d52b65aec2
Revert "Move to contiguous buffer for token_ids and d_vectors"
...
This reverts commit 3ff8c35a79
.
2017-05-20 11:26:23 -05:00
ines
27de0834b2
Update docstrings and API docs for Lexeme
2017-05-20 15:13:42 +02:00
ines
7ed8a92ed1
Update docstrings and API docs for Token
2017-05-20 15:13:33 +02:00
ines
4ed6a36622
Update docstrings and API docs for Matcher
2017-05-20 14:43:10 +02:00
ines
39f36539f6
Update docstrings and API docs for Matcher
2017-05-20 14:32:34 +02:00
ines
c00ff257be
Update docstrings and API docs for Matcher
2017-05-20 14:26:10 +02:00
ines
790435e51c
Update docstrings
2017-05-20 14:05:07 +02:00
ines
f0cc642bb9
Update docstrings and API docs for Vocab
2017-05-20 14:00:41 +02:00
Matthew Honnibal
ce9234f593
Update Matcher API
2017-05-20 13:54:53 +02:00
Matthew Honnibal
b272890a8c
Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change
2017-05-20 06:40:10 -05:00
ines
e39ad78267
Resolve model name properly in cli.info
...
Use util.resolve_model_path() to also allow package names and paths.
2017-05-20 12:24:40 +02:00
Matthew Honnibal
3ff8c35a79
Move to contiguous buffer for token_ids and d_vectors
2017-05-20 04:17:30 -05:00
Matthew Honnibal
8b04b0af9f
Remove freqs from transition_system
2017-05-20 02:20:48 -05:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
a1ba20e2b1
Fix over-run on parse_batch
2017-05-19 18:57:30 -05:00
ines
1d4d3d0ecd
Add TODO
2017-05-20 01:38:04 +02:00
Matthew Honnibal
7ee1827af0
Disable data caching in parser
2017-05-19 18:17:11 -05:00
Matthew Honnibal
e84de028b5
Remove 'rebatch' op, and remove min-batch cap
2017-05-19 18:16:36 -05:00
Matthew Honnibal
3376d4d6e8
Update the train script, fixing GPU memory leak
2017-05-19 18:15:50 -05:00
Matthew Honnibal
836fe1d880
Update neural net tests
2017-05-19 18:11:29 -05:00
ines
fe5d8819ea
Update Matcher docstrings and API docs
2017-05-19 21:47:06 +02:00
Matthew Honnibal
08766240c3
Add incomplete iob converter
2017-05-19 13:27:51 -05:00
Matthew Honnibal
c12ab47a56
Remove state argument in pipeline. Other changes
2017-05-19 13:26:36 -05:00
Matthew Honnibal
66ea9aebe7
Remove the state argument from Language
2017-05-19 13:25:42 -05:00
Matthew Honnibal
09a877886b
WIP on iob converter
2017-05-19 13:24:39 -05:00
ines
a804045597
Use is_ancestor instead of deprecated is_ancestor_of
2017-05-19 20:23:40 +02:00
Matthew Honnibal
8d5e6d9f4f
Rename no_ner arg to no_entities
2017-05-19 13:23:11 -05:00
ines
e9e62b01b0
Update docstrings and API docs for Token
2017-05-19 18:47:56 +02:00
ines
62ceec4fc6
Update docstrings and API docs for Span
2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8
Update docstrings and API docs for Doc
2017-05-19 18:47:39 +02:00
ines
2c8c9dc0c9
Update docstrings and API docs for Language
2017-05-19 18:47:24 +02:00
ines
0791f0aae6
Update docstrings and API docs for Span class
2017-05-19 00:31:31 +02:00
ines
8455cb1327
Update docstring for Doc.__getitem__
2017-05-19 00:30:51 +02:00
ines
0fc05e54e4
Document TokenVectorEncoder
2017-05-19 00:00:02 +02:00
ines
b687ad109d
Update docstrings and API docs for Doc class
2017-05-18 23:59:44 +02:00
ines
d42bc16868
Update docstrings and API docs for Language class
2017-05-18 23:57:38 +02:00
ines
593361ee3c
Update docstrings for Span class
2017-05-18 22:17:41 +02:00
ines
b87066ff10
Update docstrings and API docs for Doc class
2017-05-18 22:17:41 +02:00
Matthew Honnibal
238be0f16a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-18 08:32:22 -05:00
Matthew Honnibal
c214c0decb
Improve env_opt reporting
2017-05-18 08:32:03 -05:00
Matthew Honnibal
bbb59e371c
Fix GPU evaluation
2017-05-18 08:31:15 -05:00
Matthew Honnibal
c2c825127a
Fix use_params and pipe methods
2017-05-18 08:30:59 -05:00
Matthew Honnibal
ca70b08661
Fix GPU training and evaluation
2017-05-18 08:30:33 -05:00
ines
489d2fb4ba
Add is_in_jupyter() helper for displaCy (see #1058 )
2017-05-18 14:13:14 +02:00
ines
abf0188b0a
Move cupy and CudaStream to compat
2017-05-18 14:12:45 +02:00
ines
33decd85b6
Reorganise and explicitly state what's importable
2017-05-18 14:12:31 +02:00
Matthew Honnibal
a438cef8c5
Fix significant bug in feature calculation -- off by 1
2017-05-18 06:21:32 -05:00
Matthew Honnibal
fc8d3a112c
Add util.env_opt support: Can set hyper params through environment variables.
2017-05-18 04:36:53 -05:00
Matthew Honnibal
d2626fdb45
Fix name error in nn parser
2017-05-18 04:31:01 -05:00
Matthew Honnibal
b460533827
Bug fixes to pipeline
2017-05-18 04:29:51 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
Matthew Honnibal
2713041571
Fix GPU usage in Language
2017-05-18 04:25:19 -05:00
Matthew Honnibal
711ad5edc4
Cache features in doc2feats
2017-05-18 04:22:20 -05:00
Matthew Honnibal
39ea38c4b1
Add option to use gpu to spacy train
2017-05-18 04:21:49 -05:00
Matthew Honnibal
a1d8e420b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 08:00:04 -05:00
Matthew Honnibal
edfea3a513
Fix progress bar
2017-05-17 14:59:37 +02:00
Matthew Honnibal
0b7fd67408
Fix style check in displacy
2017-05-17 07:57:24 -05:00
Matthew Honnibal
55dab77de8
Add conversion rule for .conll
2017-05-17 13:13:48 +02:00
Matthew Honnibal
692bd2a186
Bug fix to tagger: wasnt backproping to token vectors
2017-05-17 13:13:14 +02:00
Matthew Honnibal
877f83807f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 12:09:29 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
3bf4a28d8d
Use tag in CoNLL converter, not POS
2017-05-17 12:04:33 +02:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
c9a5d5d24b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-16 16:22:05 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
221b4c1ee8
Fix test for Python 3
2017-05-16 13:06:30 +02:00
Matthew Honnibal
5211645af3
Get data flowing through pipeline. Needs redesign
2017-05-16 11:21:59 +02:00
Matthew Honnibal
1d7c18e58a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-15 21:53:47 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
ines
98354be150
Only get user_data if it exists on doc
2017-05-15 13:39:47 +02:00
ines
c33bdeb564
Use uppercase for entity types
2017-05-15 01:24:57 +02:00
ines
4aaa607b8d
Add xmlns:xlink so SVGs are rendered properly as individual files
2017-05-14 19:54:13 +02:00
ines
9dd13cd76a
Update docstrings
2017-05-14 19:30:47 +02:00
ines
a04550605a
Add Jupyter notebook support (see #1058 )
2017-05-14 18:39:01 +02:00
ines
c31792aaec
Add displaCy visualisers (see #1058 )
2017-05-14 17:50:23 +02:00
ines
b462076d80
Merge load_lang_class and get_lang_class
2017-05-14 01:31:10 +02:00
ines
36bebe7164
Update docstrings
2017-05-14 01:30:29 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal
5cac951a16
Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.
2017-05-14 00:55:01 +02:00
Matthew Honnibal
f8c02b4341
Remove cupy imports from parser, so it can work on CPU
2017-05-14 00:37:53 +02:00
Matthew Honnibal
613ba79e2e
Fiddle with sizings for parser
2017-05-13 17:20:23 -05:00
Matthew Honnibal
e6d71e1778
Small fixes to parser
2017-05-13 17:19:04 -05:00
Matthew Honnibal
188c0f6949
Clean up unused import
2017-05-13 17:18:27 -05:00
Matthew Honnibal
f85c8464f7
Draft support of regression loss in parser
2017-05-13 17:17:27 -05:00
ines
1694c24e52
Add docstrings, error messages and fix consistency
2017-05-13 21:22:49 +02:00
ines
ee7dcf65c9
Fix expand_exc to make sure it returns combined dict
2017-05-13 21:22:25 +02:00
ines
824d09bb74
Move resolve_load_name to deprecated
2017-05-13 21:21:47 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
5858857a78
Update languages list in conftest
2017-05-13 15:37:54 +02:00
ines
9d85cda8e4
Fix models error message and use about.__docs_models__ (see #1051 )
2017-05-13 13:05:47 +02:00
ines
6b942763f0
Tidy up imports
2017-05-13 13:04:40 +02:00
ines
8c2a0c026d
Fix parse_tree test
2017-05-13 12:32:45 +02:00
ines
6129016e15
Replace deepcopy
2017-05-13 12:32:37 +02:00
ines
df68bf45ce
Set defaults for light and flat kwargs
2017-05-13 12:32:23 +02:00
ines
b9dea345e5
Remove old import
2017-05-13 12:32:11 +02:00
ines
293ee359c5
Fix formatting
2017-05-13 12:32:06 +02:00
ines
4eefb288e3
Port over PR #1055
2017-05-13 03:25:32 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
827b5af697
Update draft of parser neural network model
...
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.
Outline of the model:
We first predict context-sensitive vectors for each word in the input:
(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4
This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).
The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.
The current context tokens:
* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0
This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).
The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)
This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.
Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:
(exp(score) / Z) - (exp(score) / gZ)
Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.
Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines
c4857bc7db
Remove unused argument
2017-05-12 15:37:54 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
Matthew Honnibal
4efb391994
Fix serializer
2017-05-09 18:45:18 +02:00
Matthew Honnibal
b16ae75824
Remove serializer hacks from pipeline classes
2017-05-09 18:16:40 +02:00
Matthew Honnibal
7253b4e649
Remove old serialization tests
2017-05-09 18:12:58 +02:00
Matthew Honnibal
f9327343ce
Start updating serializer test
2017-05-09 18:12:03 +02:00
Matthew Honnibal
1166b0c491
Implement Doc.to_bytes and Doc.from_bytes methods
2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
Matthew Honnibal
b53f7dfdc3
Remove spacy.serialize
2017-05-09 17:22:06 +02:00
Matthew Honnibal
62ecdea9f2
Add binder class for document serialization
2017-05-09 17:21:00 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
2c3bdd09b1
Add English test for like_num
2017-05-09 11:06:34 +02:00
ines
22375eafb0
Fix and merge attrs and lex_attrs tests
2017-05-09 11:06:25 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
564939391a
Remove spacy.orth
2017-05-09 01:21:47 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
ea5fa46475
Import LEX_ATTRS from lang.lex_attrs
2017-05-09 00:58:10 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
7b83977020
Remove unused munge package
2017-05-09 00:16:16 +02:00
ines
c714841cc8
Move language-specific tests to tests/lang
2017-05-09 00:02:37 +02:00
ines
bd57b611cc
Update conftest to lazy load languages
2017-05-09 00:02:21 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
6eb6306843
Fix language data imports
2017-05-08 23:58:31 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
86d9c29f30
Reorder util functions
2017-05-08 23:51:15 +02:00
ines
9a0d2fdef1
Add load_lang_class() util function
2017-05-08 23:50:45 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00
ines
41a322c733
Fix LEMMA in exceptions and morph rules
2017-05-08 19:57:36 +02:00
ines
2edc0aee12
Update warning message
2017-05-08 19:53:36 +02:00
ines
6025cdb992
Fix string interpolation in times
2017-05-08 16:38:16 +02:00
ines
b9ba58ba5c
Add function to resolve load name
...
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines
e6f1a5d0a1
Add unicode declaration
2017-05-08 16:22:17 +02:00
ines
be5541bd16
Fix import and tokenizer exceptions
2017-05-08 16:20:14 +02:00
ines
2324788970
Remove bad tests
2017-05-08 16:15:27 +02:00
ines
b88c4193e7
Add missing symbol
2017-05-08 16:15:20 +02:00
ines
9a5b2bdd4c
Don't set morph rules without tag map
2017-05-08 16:15:12 +02:00
ines
4930f0fa8f
Explicitly import TOKEN_MATCH
2017-05-08 16:11:54 +02:00
ines
50b7ec03ca
Fix typo
2017-05-08 16:11:45 +02:00
ines
3ca611fe48
Fix wildcard imports
2017-05-08 15:56:29 +02:00
ines
c2469b8135
Remove __all__ export
2017-05-08 15:56:22 +02:00
ines
14a9c3ee7a
Fix wildcard import
2017-05-08 15:56:13 +02:00
ines
deed623864
Remove comment
2017-05-08 15:56:05 +02:00
ines
e7f95c37ee
Merge base tokenizer exceptions
2017-05-08 15:55:52 +02:00
ines
24606d364c
Remove redundant language_data.py files in languages
...
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines
a627d3e3b0
Reorganise Chinese language data
2017-05-08 15:54:36 +02:00
ines
7b86ee093a
Reorganise Swedish language data
2017-05-08 15:54:29 +02:00
ines
50510fa947
Reorganise Portuguese language data
2017-05-08 15:52:01 +02:00
ines
279895ea83
Reorganise Dutch language data
2017-05-08 15:51:39 +02:00
ines
04ef5025bd
Reorganise Norwegian language data
2017-05-08 15:51:22 +02:00
ines
5edbc725d8
Reorganise Japanese language data
2017-05-08 15:50:46 +02:00
ines
51a389d3bb
Reorganise Italian language data
2017-05-08 15:50:17 +02:00
ines
1bbfa14436
Reorganise Hungarian language data
2017-05-08 15:49:56 +02:00
ines
a77c9fc60d
Reorganise Hebrew language data
2017-05-08 15:49:28 +02:00
ines
7f05e977fa
Reorganise French language data
2017-05-08 15:49:05 +02:00
ines
0207ffdd52
Reorganise Finnish language data
2017-05-08 15:48:31 +02:00
ines
8e483ec950
Reorganise Spanish language data
2017-05-08 15:48:04 +02:00
ines
c7c21b980f
Reorganise English language data
2017-05-08 15:47:25 +02:00
ines
1bf9d5ec8b
Reorganise German language data
2017-05-08 15:44:26 +02:00
ines
7b3a983f96
Reorganise Bengali language data
2017-05-08 15:43:50 +02:00
ines
607ba458e7
Fix whitespace
2017-05-08 15:42:31 +02:00
ines
60db497525
Add update_exc and expand_exc to util
...
Doesn't require separate language data util anymore
2017-05-08 15:42:12 +02:00
Matthew Honnibal
b44f7e259c
Clean up unused parser code
2017-05-08 15:42:04 +02:00
ines
6e5bd4f228
Remove unused functions from deprecated
2017-05-08 15:40:16 +02:00
Matthew Honnibal
17efb1c001
Change width
2017-05-08 08:40:13 -05:00
ines
f68e420bc0
Add PRON_LEMMA and DET_LEMMA to deprecated
...
Will be replaced with proper values across the language data later.
2017-05-08 15:35:30 +02:00
ines
bd6a7cf4f6
Simplify deprecated model downloading
...
Only relevant for spaCy < v1.7.0.
2017-05-08 15:32:10 +02:00
ines
95edd9e896
Let parse_package_meta take full path
2017-05-08 15:30:48 +02:00
ines
326746eb15
Add util function to resolve arg to model path
...
1. check if in data dir or shortcut link
2. check if installed as a pip package
3. check if string is path to model
4. check if Path or Path-like object
2017-05-08 15:29:47 +02:00
Matthew Honnibal
bef89ef23d
Mergery
2017-05-08 08:29:36 -05:00
ines
a7801e7342
Update spacy.load()
...
path argument is now deprecated and name can either take a model name
or path. Implement lazy loading by importing module and read Language
class name off __all__.
2017-05-08 15:27:25 +02:00
Matthew Honnibal
50ddc9fc45
Fix infinite loop bug
2017-05-08 07:54:26 -05:00
Matthew Honnibal
94e86ae00a
Predict tags with encoder
2017-05-08 07:53:45 -05:00
Matthew Honnibal
56073a11ef
Don't use tags when calculating token vectors
2017-05-08 07:52:24 -05:00
Matthew Honnibal
a66a4a4d0f
Replace einsums
2017-05-08 14:46:50 +02:00
Matthew Honnibal
8d2eab74da
Use PretrainableMaxouts
2017-05-08 14:24:55 +02:00
Matthew Honnibal
807cb2e370
Add PretrainableMaxouts
2017-05-08 14:24:43 +02:00
Matthew Honnibal
2e2268a442
Precomputable hidden now working
2017-05-08 11:36:37 +02:00
ines
94697e9afc
Fix typo
2017-05-08 02:00:37 +02:00
ines
0ee2a22b67
Merge branch 'pr/1024' into develop
2017-05-08 01:12:44 +02:00
ines
c4492d260a
Fix kwargs
2017-05-08 01:05:24 +02:00
Matthew Honnibal
10682d35ab
Get pre-computed version working
2017-05-08 00:38:35 +02:00
ines
b5a726c5cd
Tidy up deprecated.py
2017-05-07 23:29:22 +02:00
ines
59c3b9d4dd
Tidy up CLI and fix print functions
2017-05-07 23:25:29 +02:00
ines
311704674d
Add path2str compat function
2017-05-07 23:24:56 +02:00
ines
e34069db9f
Move is_package and get_model_package_path to util
2017-05-07 23:24:51 +02:00
ines
957ba676b4
Add model files base path to about.py
2017-05-07 23:22:35 +02:00
ines
8d8dd9ceb2
Don't set default value for model
2017-05-07 23:22:21 +02:00
Matthew Honnibal
35458987e8
Checkpoint -- nearly finished reimpl
2017-05-07 23:05:01 +02:00
Matthew Honnibal
4441866f55
Checkpoint -- nearly finished reimpl
2017-05-07 22:47:06 +02:00
Matthew Honnibal
6782eedf9b
Tmp GPU code
2017-05-07 11:04:24 -05:00
Matthew Honnibal
e420e5a809
Tmp
2017-05-07 07:31:09 -05:00
Matthew Honnibal
12039e80ca
Switch to single matmul for state layer
2017-05-07 14:26:34 +02:00
Matthew Honnibal
700979fb3c
CPU/GPU compat
2017-05-07 04:01:11 +02:00
Matthew Honnibal
f99f5b75dc
working residual net
2017-05-07 03:57:26 +02:00
Matthew Honnibal
bdf2dba9fb
WIP on refactor, with hidde pre-computing
2017-05-07 02:02:43 +02:00
Matthew Honnibal
b439e04f8d
Learning smoothly
2017-05-06 20:38:12 +02:00
Matthew Honnibal
08bee76790
Learns things
2017-05-06 18:24:38 +02:00
Matthew Honnibal
04ae1c01f1
Learns things
2017-05-06 18:21:02 +02:00
Matthew Honnibal
bcf4cd0a5f
Learns things
2017-05-06 17:37:36 +02:00
Matthew Honnibal
8e48b58cd6
Gradients look correct
2017-05-06 16:47:15 +02:00
Matthew Honnibal
7e04260d38
Data running through, likely errors in model
2017-05-06 14:22:20 +02:00
Matthew Honnibal
fa7c1990b6
Restore tok2vec function
2017-05-05 20:12:03 +02:00
Matthew Honnibal
efe9630e1c
Bug fixes
2017-05-05 20:09:50 +02:00
Matthew Honnibal
ef4fa594aa
Draft of NN parser, to be tested
2017-05-05 19:20:39 +02:00
Matthew Honnibal
7d1df50aec
Draft up Parser model
2017-05-04 13:31:40 +02:00
Matthew Honnibal
ccaf26206b
Pseudocode for parser
2017-05-04 12:17:59 +02:00
ines
b1f22c5a10
Fix formatting
2017-05-03 20:11:02 +02:00
ines
a04b5be1b2
Add glossary for annotation scheme ( closes #1034 )
...
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard
929f2792a7
Rennaming cls in module. cls is now a class
2017-05-03 15:41:07 +02:00
Gregory Howard
0e8c41ea4f
Adding method lemmatizer for every class
2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e
adding export japanese
2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224
Merge branch 'master' into master
2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4
Lazy imports language
2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d
Fix formatting
2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d
Raise custom ImportError if importing janome fails
2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b
Add newline
2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea
Add newline
2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135
Add newline
2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87
Add basic japanese support
2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb
Merge remote-tracking branch 'remotes/upstream/master'
2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2
Add dropout optin for parser and NER
...
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.
nlp.entity.update(doc, gold, drop=0.4)
This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.
This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b
Removing extra spaces
2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e
Adding unitest for tokenization in french (with title)
2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c
Add newline
2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2
Add newline
2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef
Add newline
2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21
Add newline
2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05
Add newline
2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849
Adding unitest for tokenization in french (with title)
2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45
Improvement of rules now title insentive and have same declaration format
2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27
Increment version
2017-04-26 20:25:41 +02:00
luvogels
b331929a7e
Merge branch 'master' of https://github.com/luvogels/spaCy
2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09
Update __init__.py
2017-04-26 18:27:55 +02:00
ines
527d51ac9a
Fetch shortcuts from GitHub and improve error handling
2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
26e31afc18
renamming tests
2017-04-25 17:46:01 +02:00
ghoward
c085c2d391
Adding some unitests
2017-04-25 17:44:16 +02:00
ghoward
55c6910f90
Look_up table for languages in spacy.
...
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b
Fix reporting if no dev data with train
2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
ines
401045433c
Simplify compat.fix_text
2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64
Increment version
2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b
Update fix for Issue #999
2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07
Add test for Issue #955
2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5
Ensure noun chunks can't be nested. Closes #955
2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445
Add test for Issue #999
2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52
Fix json dump for Python3
2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17
Remove xfail on Test #910
2017-04-23 16:28:55 +02:00
ines
3a9710f356
Pass dev_scores to print_progress correctly ( resolves #1008 )
...
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab
Persist the actions within spacy.parser.cfg
2017-04-20 17:02:44 +02:00
ines
25c70b4cc5
Move fix_text to spacy.compat (see #1002 )
2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee
Merge pull request #1002 from oroszgy/model_cli_fix
...
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c
Using ftfy for handling broken encoded strings.
2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046
Merge pull request #1001 from recognai/master
...
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb
fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing
2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10
Enable test
2017-04-20 02:25:24 +03:00
Ben Eyal
33af52599e
Redefine alphabetic characters
...
For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.
2017-04-20 02:25:02 +03:00
Ben Eyal
d8098a8be2
Use regex
instead of re
2017-04-20 02:22:52 +03:00
oeg
daaa42dd25
Merge remote-tracking branch 'upstream/master'
2017-04-19 23:30:36 +02:00
oeg
936a297241
fix(model): Fix tag map for fixing issues with tag SPACE
2017-04-19 23:30:21 +02:00
luvogels
c7cec7e5e2
Update __init__.py
2017-04-19 21:06:30 +02:00
luvogels
55e8cade36
Update __init__.py
2017-04-19 21:06:30 +02:00
luvogels
03abd0c8e6
Update __init__.py
2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang
538a8d6b12
Resolved merge conflict by incorporating both suggestions.
2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang
e821c48489
Norwegian language basics
2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang
3796c668d9
more norwegian
2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang
bc9557b21f
Norwegian language basics
2017-04-19 21:00:01 +02:00
ines
2bd89e7ade
Tidy up Hebrew tests and test for punctuation (see #995 )
2017-04-19 19:28:03 +02:00
ines
48da244058
Use spacy.compat.json_dumps for Python 2/3 compatibility ( resolves #991 )
2017-04-19 11:50:36 +02:00
ines
ddd5194088
Update Language docs and docstrings
2017-04-17 01:52:13 +02:00
ines
f62b740961
Use compat.json_dumps
2017-04-17 01:46:14 +02:00
ines
8e83f8e2fa
Update docstrings
2017-04-17 01:40:26 +02:00
ines
e2299dc389
Ensure path in save_to_directory
2017-04-17 01:40:14 +02:00
ines
82f5f1f98f
Replace str with compat.unicode_
2017-04-17 01:29:54 +02:00
ines
16a8521efa
Increment version
2017-04-16 22:38:38 +02:00
Matthew Honnibal
4efd6fb9d6
Fix training
2017-04-16 15:28:27 -05:00
Matthew Honnibal
17c9fffb9e
Fix naked except
2017-04-16 15:28:16 -05:00
ines
5610fdcc06
Get language name first if no model path exists
...
Makes sure spaCy fails early if no tokenizer exists, and allows
printing better error message.
2017-04-16 22:16:47 +02:00
ines
ad168ba88c
Set model name to empty string if path override exists
...
Required for parse_package_meta, which composes path of data_path and
model_name (needs to be fixed in the future)
2017-04-16 22:15:51 +02:00
ines
97647c46cd
Add docstring and todo note
2017-04-16 22:14:45 +02:00
ines
5c5f8c0a72
Check if full string is found in lang classes first
...
This allows users to set arbitrary strings. (Otherwise, custom lang
class "my_custom_class" would always load Burmese "my" tokenizer if one
was available.)
2017-04-16 22:14:38 +02:00
ines
13d30b6c01
xfail lemmatizer test that's causing problems (see #546 )
2017-04-16 21:18:39 +02:00
Matthew Honnibal
4931c56afc
Increment version
2017-04-16 13:59:38 -05:00
ines
6145b7c153
Remove redundant Path
2017-04-16 20:53:25 +02:00
Matthew Honnibal
fa89613444
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-16 13:42:56 -05:00
ines
1f9f867c70
Remove unused util function
2017-04-16 20:37:45 +02:00
ines
7670c745b6
Update spacy.load() and fix path checks
2017-04-16 20:37:45 +02:00
ines
d3759dfb32
Fix docstring
2017-04-16 20:37:45 +02:00
ines
ed7e19ad68
Remove unused import
2017-04-16 20:37:45 +02:00
ines
0084466a66
Remove unused utf8open util and replace os.path with ensure_path
2017-04-16 20:37:45 +02:00
Matthew Honnibal
89a4f262fc
Fix training methods
2017-04-16 13:00:37 -05:00
Matthew Honnibal
6a4221a6de
Allow lemma to be set from Python. Re #973
2017-04-16 18:07:53 +02:00
Matthew Honnibal
137b210bcf
Restore use of FTRL training
2017-04-16 18:02:42 +02:00
ines
d10bd0eaf9
Fix formatting
2017-04-16 13:42:34 +02:00
ines
8191e33cf1
Update link error message with info on permissions
2017-04-16 13:32:31 +02:00
ines
a3ddbc0444
Add note about --force flag to error message
2017-04-16 13:14:36 +02:00
ines
e3de035814
Add meta validation to check for required settings
...
Complain if no "lang", "name" or "version" is found (those settings are
used in directory / package names). Package will still build without,
but it'll inevitably fail somewhere down the line.
2017-04-16 13:13:17 +02:00
ines
a7574b7572
Add more options to read in meta data in package command
...
Add meta option to supply path to meta.json. If no meta path is set,
check if meta.json exists in input directory and use it. Otherwise,
prompt for details on the command line.
2017-04-16 13:06:02 +02:00
ines
13c8a42d2b
Fix typos
2017-04-16 13:03:58 +02:00
ines
31fa73293a
Move read_json out to own util function
2017-04-16 13:03:28 +02:00
Matthew Honnibal
45464d065e
Remove print statement
2017-04-15 16:11:43 +02:00
Matthew Honnibal
c76cb8af35
Fix training for new labels
2017-04-15 16:11:26 +02:00
Matthew Honnibal
4884b2c113
Refix StepwiseState
2017-04-15 16:00:28 +02:00
Matthew Honnibal
e6ee7e130f
Fix parse package meta
2017-04-15 13:38:53 +02:00
Matthew Honnibal
1a98e48b8e
Fix Stepwisestate'
2017-04-15 13:35:01 +02:00
ines
0739ae7b76
Tidy up and fix formatting and imports
2017-04-15 13:05:15 +02:00
ines
fefe6684cd
Fix symlink function to check for Windows
2017-04-15 12:17:27 +02:00
ines
35fb4febe2
Fix whitespace
2017-04-15 12:13:45 +02:00
ines
e1efd589c3
Fix json imports and use ujson
2017-04-15 12:13:34 +02:00
ines
958b12dec8
Use pathlib instead of os.path
2017-04-15 12:13:00 +02:00
ines
956dc36785
Move functions to deprecated
2017-04-15 12:12:31 +02:00
ines
c05ec4b89a
Add compat functions and remove old workarounds
...
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
26445ee304
Add compat module for Python2/3 and platform compatibility
2017-04-15 12:07:02 +02:00
ines
d24589aa72
Clean up imports, unused code, whitespace, docstrings
2017-04-15 12:05:47 +02:00
ines
561f2a3eb4
Use consistent formatting for docstrings
2017-04-15 11:59:21 +02:00
Matthew Honnibal
d13f0a7017
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-04-14 23:54:57 +02:00
Matthew Honnibal
354458484c
WIP on add_label bug during NER training
...
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal
33ba5066eb
Refactor Language.end_training, making new save_to_directory method
2017-04-14 23:51:24 +02:00
ines
84341c2975
Only compile list of models if data_path exists
2017-04-14 16:48:02 +02:00
Gyorgy Orosz
dd3244c08a
Made json dump to produce unicode strings in py2
2017-04-13 23:30:47 +02:00
Gyorgy Orosz
a9469c8173
Fixed typo
2017-04-13 15:24:14 +02:00
ines
41037f0f07
Remove unused imports
2017-04-13 13:52:11 +02:00
ines
1b92c8d5d5
Use unicode paths on Windows/Python 2 and catch other errors ( resolves #970 )
...
try/except here is quite dirty, but it'll at least make sure users see
an error message that explains what's going on
2017-04-10 17:49:51 +02:00
Matthew Honnibal
49e2de900e
Add costs property to StepwiseState, to show which moves are gold.
2017-04-10 11:37:04 +02:00
Matthew Honnibal
e26577b202
Increment version
2017-04-07 18:45:06 +02:00
Matthew Honnibal
40bf7ecf27
Increment version
2017-04-07 18:44:20 +02:00
Matthew Honnibal
1dca7eeb03
Add unicode declaration on new regression test
2017-04-07 18:09:23 +02:00
ines
887827fc6a
Merge branch 'develop'
2017-04-07 17:36:23 +02:00
ines
444dd511c5
Fix xpassing URL test case
2017-04-07 17:36:05 +02:00
ines
bf0f15e762
Add / to tokenizer infixes ( resolves #891 )
2017-04-07 17:30:44 +02:00
ines
00b9011a49
Fix whitespace
2017-04-07 17:29:59 +02:00
ines
f9869e4dc5
Merge branch 'master' into develop
2017-04-07 17:23:40 +02:00
Matthew Honnibal
4a6204dbad
Merge remote-tracking branch 'origin/develop'
2017-04-07 17:20:09 +02:00
Matthew Honnibal
0513c43bf0
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-07 17:07:10 +02:00
Matthew Honnibal
cc36c308f4
Fix noun_chunk rules around coordination
...
Closes #693 .
2017-04-07 17:06:40 +02:00
Matthew Honnibal
ab846256cf
Merge pull request #966 from recognai/master
...
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal
83dca920d4
Rename test #913 -> #957 , comment
...
Make test for #957 reference correct bug. Add comment.
Previous commit closes #957 .
2017-04-07 15:54:25 +02:00
Matthew Honnibal
be204ed714
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-07 15:50:14 +02:00
Matthew Honnibal
e7b1ee9efd
Switch to regex module for URL identification
...
The URL detection regex was failing on input such as 0.1.2.3, as this
input triggered excessive back-tracking in the builtin re module.
The solution was to switch to the regex module, which behaves better.
Closes #913 .
2017-04-07 15:47:36 +02:00
Matthew Honnibal
5887383fc0
Add test for Issue #913 : Hang from bad regex
2017-04-07 15:47:27 +02:00
ines
7ea1673072
Fix whitespace
2017-04-07 13:28:48 +02:00
ines
255650dbc2
Add connlu2json converter from explosion/spacy-dev-resources/#11
2017-04-07 13:05:12 +02:00
ines
789ce8a45e
Add convert command
2017-04-07 13:04:17 +02:00
ines
9952d3b08a
Fix whitespace
2017-04-07 13:02:05 +02:00
ines
47ddce6eb7
Remove unused variable
2017-04-07 13:01:48 +02:00
ines
dcf8ab0c47
Merge branch 'develop'
2017-04-07 12:00:09 +02:00
ines
75f9b4c6e2
Fix whitespace
2017-04-07 10:22:18 +02:00
oeg
c693d40791
feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests
2017-04-06 18:48:45 +02:00
oeg
010293fb2f
fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli
2017-04-06 17:33:15 +02:00
ines
808cd6cf7f
Add missing tags to verbs ( resolves #948 )
2017-04-03 18:12:52 +02:00
ines
ad8bf1829f
Import and combine Portuguese tokenizer exceptions (see #943 )
2017-04-01 10:37:42 +02:00
Ines Montani
f8b2d9c3b7
Merge pull request #943 from mamoit/master
...
Portuguese improvements
2017-04-01 10:32:00 +02:00
ines
3b667a24d4
Remove whitespace
2017-04-01 10:21:08 +02:00
ines
e71a1f4bd0
Fix download commands in error messages (see #946 )
2017-04-01 10:20:57 +02:00
ines
42382d5692
Fix download commands in error messages (see #946 )
2017-04-01 10:19:32 +02:00
ines
d4a59c254b
Remove whitespace
2017-04-01 10:19:01 +02:00
Matthew Honnibal
51882ee2b8
Fix check for setting ent_id in merge
2017-03-31 19:32:01 +02:00
Miguel Almeida
4fde64c4ea
Portuguese contractions and some abreviations
2017-03-31 15:52:55 +01:00
Miguel Almeida
465b240bcb
Review Portuguese stop words
...
Mainly to review typos and add missing masculines/feminines
2017-03-31 13:00:47 +01:00
Matthew Honnibal
fc3900e5b2
Allow ent_id to be set in Token
2017-03-31 14:00:14 +02:00
Matthew Honnibal
9720103428
Improve attribute handlign in doc.merge(). Still unsatisfying
2017-03-31 13:59:58 +02:00
Matthew Honnibal
cfff4e0f61
Improve test
2017-03-31 13:59:32 +02:00
Matthew Honnibal
1bb7b4ca71
Add comment
2017-03-31 13:59:19 +02:00
Matthew Honnibal
725249c59a
Add merge_phrase callback in matcher.pyx
2017-03-31 13:58:59 +02:00
Matthew Honnibal
e854f28304
Add test for Issue #758
...
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Miguel Almeida
c1d020b0a6
Remove "ista" from portuguese stop words
2017-03-31 12:26:13 +01:00
Miguel Almeida
17a1e7a119
Add Portuguese numbers and ordinals
2017-03-31 12:21:01 +01:00
Matthew Honnibal
47a3ef06a6
Unhack deprojetivization, moving it into pipeline
...
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898 .
2017-03-31 12:31:50 +02:00
Joshua Reeter
564daf6dec
Issue #934 symlink should not convert paths as_posix under windows.
2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita
c2d48974bc
Fix typos in Portuguese stop words
2017-03-30 21:59:18 +13:00
Matthew Honnibal
0fefdfcbda
Merge pull request #935 from ericzhao28/master
...
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862 )
2017-03-30 02:51:24 +02:00
ines
4759fd437d
Merge branch 'master' into develop
2017-03-29 10:37:13 +02:00
ines
7e4befec88
Add Hebrew to init and setup.py
2017-03-29 10:34:57 +02:00
Grégory Howard
9c2996b27f
correction of package.py (encoding on open instead of write)
2017-03-29 09:11:02 +02:00
Eric Zhao
aafdf6ffb8
Add option to use label karg to determine ent_type in doc.merge
2017-03-28 23:35:03 -07:00
ines
7198cf1c8a
Remove unused import
2017-03-26 20:56:05 +02:00
ines
7ceaa1614b
Add experimental model init command
2017-03-26 20:51:40 +02:00
Matthew Honnibal
83ba6c247c
Fix init of Language without model
2017-03-26 16:46:00 +02:00
Matthew Honnibal
fa107f95f6
Remove unused train_config command
2017-03-26 09:28:59 -05:00
Matthew Honnibal
df83921f0a
Increment version
2017-03-26 09:27:32 -05:00
Matthew Honnibal
92ac3af21d
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-26 09:26:59 -05:00
Matthew Honnibal
a9b1f23c7d
Enable regression loss for parser
2017-03-26 09:26:30 -05:00
ines
c00d997924
Merge branch 'develop'
2017-03-26 15:57:00 +02:00
Matthew Honnibal
2efdbc08ff
Make training work with directories
2017-03-26 08:46:44 -05:00
ines
007a2492bd
Remove train_config command for now
2017-03-26 15:40:50 +02:00
ines
b297fab062
Update error message for missing commands
2017-03-26 15:40:02 +02:00
ines
7f95023fc0
Fix formatting
2017-03-26 15:37:37 +02:00
ines
5901c8f7f0
Update spacy train CLI documentation
2017-03-26 15:33:48 +02:00
Matthew Honnibal
9dcb58aaaf
Merge CLI changes
2017-03-26 07:30:45 -05:00
Matthew Honnibal
6b7f7a2060
Connect parser L1 option to train CLI
2017-03-26 07:24:07 -05:00
Matthew Honnibal
ed2b106f4d
Fix circular import in lemmatizer
2017-03-26 07:17:07 -05:00
Matthew Honnibal
dec5571bf3
Update train CLI
2017-03-26 07:16:52 -05:00
ines
53cf2f1c0e
Make dev data optional
2017-03-26 11:48:17 +02:00
Matthew Honnibal
5eac089fbe
Merge branch 'master' into develop
2017-03-26 04:45:43 -05:00
ines
0fc56e2544
Update flag and defaults
2017-03-26 11:42:11 +02:00
Matthew Honnibal
2f63806ddb
Update config when adding label. Re #910
2017-03-25 22:35:44 +01:00
Matthew Honnibal
b94286de30
Fix regression test
2017-03-25 22:35:07 +01:00
Matthew Honnibal
c748907a66
Fix errors in previous commit
2017-03-25 22:25:01 +01:00
Matthew Honnibal
4f400fa486
Prevent lemmatization of base nouns
...
Update lemmatizer's base-form check, for change in morphology class.
Closes #903 .
2017-03-25 21:51:12 +01:00
Matthew Honnibal
850d35dcb3
Make morphology use int attributes internally
...
The morphology class was calling the lemmatizer inconsistently,
which some string-valued attributes. This caused Issue #903 .
2017-03-25 21:49:10 +01:00
Matthew Honnibal
4454c1b23f
Block lemmatization of base-form adjectives
...
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912 .
2017-03-25 21:29:57 +01:00
ines
97814f8da6
Update Windows Python 2 link workaround to use helper functions
2017-03-25 14:04:27 +01:00
ines
fdec758113
Add is_windows and is_python2 utility functions
2017-03-25 14:04:02 +01:00
Ines Montani
09837158e4
Merge pull request #921 from solresol/master
...
Possible solution to #909
2017-03-25 13:51:55 +01:00
Greg Baker
b7f714b498
Possible solution to #909
2017-03-25 21:36:38 +11:00
Ines Montani
97cb4d5e3c
Merge branch 'master' into master
2017-03-25 10:03:47 +01:00
Iddo Berger
da135bd823
add hebrew tokenizer
2017-03-24 18:27:44 +03:00
Matthew Honnibal
f40fbc3710
Add test for Issue #910 : Resuming entity training
2017-03-23 23:38:57 +01:00
Matthew Honnibal
9c9cd99144
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-23 11:11:24 +01:00
ines
0035fd9efe
Add spacy train work in progress
2017-03-23 11:08:41 +01:00
ines
d5ebf583a4
Fix formatting
2017-03-23 11:08:30 +01:00
ines
3f20efe165
Merge branch 'develop'
...
# Conflicts:
# spacy/util.py
2017-03-22 17:14:15 +01:00
Ines Montani
f86a3a92d5
Merge pull request #899 from raphael0202/duplicate_keys
...
Remove duplicate keys in [en|fi] language data dicts
2017-03-22 10:20:11 +01:00
Ines Montani
87a2c85e1b
Merge pull request #900 from raphael0202/unused_imports
...
Remove unused import statements
2017-03-22 10:10:43 +01:00
ines
ce065e5d65
Fix imports
2017-03-22 10:02:14 +01:00
Andrew Poliakov
07199c3e8b
Fix infinite recursion in spacy.info
2017-03-22 11:43:22 +03:00
Raphaël Bournhonesque
f332bf05be
Remove unused import statements
2017-03-21 21:08:54 +01:00
ines
c3a9f73896
Fix writing to file
2017-03-21 12:35:22 +01:00
ines
d74aa428ad
Fix path
2017-03-21 12:26:00 +01:00
ines
83a999ea83
Change default license from MIT to CC
2017-03-21 12:24:43 +01:00
ines
ae46647560
Fix brackets
2017-03-21 12:21:42 +01:00
ines
3e134b5b2b
Make sure paths in copytree and rmtree are strings
2017-03-21 12:15:33 +01:00
ines
cf0094187e
Fetch MANIFEST.in from GitHub as well
2017-03-21 11:32:38 +01:00
ines
09b24bc5a9
Add docs for package command
2017-03-21 11:19:21 +01:00
ines
3f4e3fda1d
Update command and fetch file templates from GitHub
...
While feature is still experimental, this allows files to be modified
without having to ship a new version of spaCy.
2017-03-21 11:17:36 +01:00
ines
5230ed5b98
Move directory check and overwriting/creating dirs to own function
2017-03-21 02:06:53 +01:00
ines
46bc3c36b0
Fix typo
2017-03-21 02:06:37 +01:00
ines
64e38f304e
Only import shutil
2017-03-21 02:06:29 +01:00
ines
448a916d0d
Add --force option to override directory
2017-03-21 02:05:34 +01:00
ines
8eb9a2b355
Fix formatting
2017-03-21 02:05:14 +01:00
ines
b2bcdec0f6
Update docstring
2017-03-20 22:50:55 +01:00
ines
bf240132d7
Add cli.package command to build model packages
2017-03-20 22:50:13 +01:00
ines
a54e3c2efe
Remove empty line
2017-03-20 22:49:36 +01:00
ines
5aea327a5b
Add util function to get raw user input
2017-03-20 22:48:56 +01:00
ines
a6c0361803
Handle raw_input vs input in Python 2 and 3
2017-03-20 22:48:32 +01:00
ines
adbcac6591
Fix spacing
2017-03-20 22:48:21 +01:00
Matthew Honnibal
692eb0603d
Fix high memory usage in download command
...
Due to PyPi issue #2984 , installing large packages via pip causes
a large spike in memory usage. The recommended fix is to disable
caching.
2017-03-20 18:24:44 +01:00
ines
f830213c4c
Remove compatibility check test
...
Will only cause problems when incrementing version and not updating
table. Also depends on external URL, which is bad.
2017-03-20 13:20:26 +01:00
Matthew Honnibal
f314d3d044
Increment version
2017-03-20 12:58:24 +01:00
Matthew Honnibal
b487b8735a
Decrease beam density, and fix Python 3 problem in beam
2017-03-20 12:56:05 +01:00
Ines Montani
b6ee241e26
Fix print statements
2017-03-20 11:46:37 +01:00
ines
b8f8d5d8bf
Make sure model_path is a Posix path
...
Otherwise, formatting the success message with model_path.as_posix()
fails when using a local path for linking (linking still works, but the
error message is confusing)
2017-03-19 11:57:13 +01:00
ines
fe0ff00fe1
Fix spacing
2017-03-19 11:55:37 +01:00
ines
5712da6095
Add regression test for #891
2017-03-19 11:48:01 +01:00
Raphaël Bournhonesque
7f579ae834
Remove duplicate keys in [en|fi] data dicts
2017-03-19 11:40:29 +01:00
ines
8de5108af6
Exclude common cache directories from mode list in cli.info
...
This means models called "cache" etc. won't show up in the list, but it
seems worth it.
2017-03-19 01:44:43 +01:00
Matthew Honnibal
6ee2ea1128
Increment version
2017-03-19 01:40:52 +01:00
Matthew Honnibal
797f286c38
Use import to find data package
2017-03-19 01:39:36 +01:00
Matthew Honnibal
5941fb9e92
Make spacy/data a package
2017-03-18 20:04:22 +01:00
Matthew Honnibal
bc10d06bc2
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-18 19:32:54 +01:00
Matthew Honnibal
583628c350
Import metadata into __init__
2017-03-18 19:30:03 +01:00
Matthew Honnibal
1754e0db9b
Call pip via subprocess, to make it use virtualenv
2017-03-18 19:29:36 +01:00
ines
1277abcde2
Remove print statement
2017-03-18 19:14:58 +01:00
Matthew Honnibal
dcec104643
Remove unused import
2017-03-18 18:57:45 +01:00
Matthew Honnibal
703eb7bdbd
Fix link module
2017-03-18 18:57:31 +01:00
Matthew Honnibal
f6c6c89546
Add empty data directory
2017-03-18 18:32:29 +01:00
ines
7d33104180
Use distutils.sysconfig.get_python_lib
...
site.getsitepackages seems to not work as expected in Python 2
2017-03-18 18:20:40 +01:00
Matthew Honnibal
1a53fcc685
Fix CLI for Python 2
2017-03-18 18:14:03 +01:00
ines
aefb898e37
Add title-case version of morph rules ( resolves #686 )
2017-03-18 17:27:11 +01:00
ines
64ec17abc1
Pass xpassing tests and add xfails for failures
2017-03-18 17:20:46 +01:00
ines
d0b85faf69
Pass regression test for #401 ( resolves #401 )
...
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines
be9daefbdd
Remove actual model downloading from tests
2017-03-18 17:01:10 +01:00
ines
850650221a
Use correct command in deprecated download command message
2017-03-18 17:01:01 +01:00
ines
0dd7710556
Make sure paths are paths
2017-03-18 16:48:52 +01:00
Matthew Honnibal
de0e6385b4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-18 16:17:28 +01:00
Matthew Honnibal
fe442cac53
Fix #717 : Set correct lemma for contracted verbs
2017-03-18 16:16:10 +01:00
ines
ad934a9abd
Add regression test for #693
2017-03-18 16:12:30 +01:00
ines
f57c616830
Add regression test for #704 and test new model ( resolves #704 )
...
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal
413138de79
Fix #719 : Lemmatizer can no longer output empty string
2017-03-18 16:02:06 +01:00
ines
ab1451f997
Don't mark compatibility test as slow
2017-03-18 15:17:39 +01:00
ines
ec3e810662
Add directory cli and set up command line interface
2017-03-18 15:14:48 +01:00
ines
cd94ea1095
Use info module for spacy.info()
2017-03-18 13:01:26 +01:00
ines
e3e25c0a33
Add spacy.info module
...
Print info about spaCy installation, local setup and models. Allow
export in Markdown format to copy-paste into GitHub issues.
2017-03-18 13:01:16 +01:00
ines
0eafc0f2c6
Add util functions to print data as table or markdown list
2017-03-18 13:00:14 +01:00
ines
6b9b444065
Fix imports
2017-03-18 12:59:41 +01:00
ines
a035ebd32a
Use pathlib.Path instead of os.path
2017-03-18 12:59:21 +01:00
ines
9605cf39cc
Handle default path in Language classes
2017-03-18 12:58:45 +01:00
Matthew Honnibal
ac4b88cce9
Fix auto-linking in download command
2017-03-17 21:36:13 +01:00
ines
8a34c3e666
Fix shortcut name
2017-03-17 20:07:34 +01:00
Matthew Honnibal
6420f86f02
Merge changes to __init__.py
2017-03-17 19:51:45 +01:00
ines
e01fbacf81
Update resolve_model_name
2017-03-17 19:26:28 +01:00
ines
aedefef49d
Add function to resolve model names and link them
2017-03-17 18:47:05 +01:00
Matthew Honnibal
d013aba7b5
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 18:30:53 +01:00
Matthew Honnibal
854cfce7cf
Make vocabs more compatible across versions
...
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal
1cc841e600
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 08:18:11 -05:00
Matthew Honnibal
4bfc55b532
Auto-add words to vocab when loading vectors
...
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
ines
0e533ad0cc
Mark compatibility table test as slow (temporary)
...
Prevent Travis from running test test until models repo is published
2017-03-17 13:11:36 +01:00
ines
279b1d1965
Update version
2017-03-17 12:43:08 +01:00
ines
8af4b9e4df
Fix compatibility.json link
2017-03-17 12:43:03 +01:00
Matthew Honnibal
a630726b13
Fix typo in tests
2017-03-16 20:50:36 -05:00
Matthew Honnibal
f98b30583f
Fix tests
2017-03-16 19:48:00 -05:00
Matthew Honnibal
db51abf685
Fix tests
2017-03-16 18:53:47 -05:00
Matthew Honnibal
adb0b7e43b
Fix loading when no package found
2017-03-16 18:30:23 -05:00
Matthew Honnibal
5c66cffafd
Add tag map for Spanish
2017-03-16 18:05:15 -05:00
Matthew Honnibal
c4351e1165
Update base-form check in lemmatizer, for UD 2.0 morphology
2017-03-16 17:59:31 -05:00
Matthew Honnibal
1e10383e1b
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:41:13 -05:00
Matthew Honnibal
859315863a
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:40:07 -05:00
Matthew Honnibal
fea9fe08af
Merge pull request #866 from juanmirocks/master
...
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
ffd4a19383
Increment version
2017-03-16 17:35:57 -05:00
Matthew Honnibal
28bb546939
Merge pull request #883 from ericzhao28/master
...
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
fd60961825
Fix spacing
2017-03-16 23:23:26 +01:00
Matthew Honnibal
890747d8ff
Fix trailing whitespace on morphology features
2017-03-16 17:07:37 -05:00
Matthew Honnibal
af41a9790c
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:41:37 +01:00
Matthew Honnibal
303a56f173
Get absolute path for linking
2017-03-16 20:41:23 +01:00
ines
3d484c3faf
Don't print in parse_package_meta and accept on_erro callback instead
...
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
d8c984b65e
Don't exit if no model meta data is present
2017-03-16 20:33:33 +01:00
Matthew Honnibal
2524efc0ac
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:20:41 +01:00
ines
8253581057
Link model automatically if not direct download
2017-03-16 19:54:51 +01:00
Matthew Honnibal
8843b84bd1
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 12:00:42 -05:00
Matthew Honnibal
55f813bfbb
Don't reapply the model during training
2017-03-16 11:59:43 -05:00
Matthew Honnibal
c90dc7ac29
Clean up state initiatisation in transition system
2017-03-16 11:59:11 -05:00
Matthew Honnibal
a46933a8fe
Clean up FTRL parsing stuff.
2017-03-16 11:58:20 -05:00
ines
618ce3b425
Add .meta to Language object
...
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
ines
e348d4434c
Add spacy.info(model_name) to show model meta
...
Allows "previewing" model before loading and making sure it's linked
correctly.
2017-03-16 17:13:40 +01:00
ines
eea3b35e3f
Update model loading to support links
...
Remove match_best_version check, fetch model language from meta instead
of directory name, and don't make too many assumptions – if model is
downloaded via downloader, version should match anyway. (Otherwise,
users should be free to add and load whichever models they want.)
2017-03-16 17:13:08 +01:00
ines
5f3f04bd0a
Add util function to load and parse package meta.json
2017-03-16 17:10:05 +01:00
ines
7f920c2f75
Don't break text in when rendering print_msg
2017-03-16 17:09:50 +01:00
ines
16a63d9676
Add docstring
2017-03-16 17:09:11 +01:00
ines
68c04fa897
Move sys_exit() function to util
2017-03-16 17:08:58 +01:00
ines
ccd1a79988
Add spacy.link module to link model directories to shortcuts
2017-03-16 17:01:51 +01:00
Matthew Honnibal
2611ac2a89
Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens
2017-03-16 09:38:28 -05:00
ines
595d89698a
Add basestring
2017-03-16 10:01:14 +01:00
ines
7b2eca36e4
Revert "Fix formatting and remove unused code"
...
This reverts commit d7898d586f
.
2017-03-16 09:58:41 +01:00
ines
2f0db1dd36
Use small English model as default
2017-03-16 09:54:40 +01:00
Matthew Honnibal
3d0833c3df
Fix off-by-1 in parse features fill_context
2017-03-15 19:55:35 -05:00
Matthew Honnibal
4ef68c413f
Approximate cost in Break transition, to speed things up a bit.
2017-03-15 16:40:27 -05:00
Matthew Honnibal
8543db8a5b
Use ftrl optimizer in parser
2017-03-15 11:56:37 -05:00
ines
4cfc8ffbd2
Reformat pickle tests
2017-03-15 17:39:54 +01:00
ines
2a0fcf1354
Add tests for new download module
2017-03-15 17:39:43 +01:00
ines
71956c94db
Handle deprecated language-specific model downloading
2017-03-15 17:37:55 +01:00
ines
58b884b6d4
Refactor download script and about.py to use new download method
2017-03-15 17:37:18 +01:00
ines
f5d1a39a5b
Add util functions for printing and wrapping messages
2017-03-15 17:35:57 +01:00
ines
d7898d586f
Fix formatting and remove unused code
2017-03-15 17:35:41 +01:00
ines
b672e95045
Fix formatting
2017-03-15 17:35:04 +01:00
ines
0474e706a0
Remove unused deprecated functions for sputnik
2017-03-15 17:34:54 +01:00
ines
b13e7f79b4
Fix formatting and remove unused imports
2017-03-15 17:33:57 +01:00
ines
1101fd3855
Fix formatting and remove unused imports
2017-03-15 17:33:39 +01:00
ines
842782c128
Move fix_deprecated_glove_vectors_loading to deprecated.py
2017-03-15 17:33:29 +01:00
Matthew Honnibal
4cab8ac136
Update morph exceptions test
2017-03-15 09:31:34 -05:00
Matthew Honnibal
d719f8e77e
Use nogil in parser, and set L1 to 0.0 by default
2017-03-15 09:31:01 -05:00
Matthew Honnibal
c61c501406
Update beam-parser to allow parser to maintain nogil
2017-03-15 09:30:22 -05:00
Matthew Honnibal
3d4e389d23
Whitespace
2017-03-15 09:29:42 -05:00
Matthew Honnibal
7769bc31e3
Add beam-search classes
2017-03-15 09:27:41 -05:00
Matthew Honnibal
c79b3129e3
Fix setting of empty lexeme in initial parse state
2017-03-15 09:26:53 -05:00
Matthew Honnibal
d864708072
Add more morphology names in attrs.pyx
2017-03-15 09:26:16 -05:00
Matthew Honnibal
b382dc902c
Add morph rules in Language
2017-03-15 09:24:40 -05:00
Matthew Honnibal
8dbff4f5f4
Wire up English lemma and morph rules.
2017-03-15 09:23:22 -05:00
Matthew Honnibal
f70be44746
Use lemmatizer in code, not from downloaded model.
2017-03-15 04:52:50 -05:00
ines
42ba740dde
Revert "Merge branch 'debug'"
...
This reverts commit 89b79d1178
, reversing
changes made to 02bdf490a1
.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e
Update regression test
2017-03-13 15:16:11 +01:00
ines
02bdf490a1
Remove regression test to see if it caused pytest Travis error
2017-03-13 13:00:22 +01:00
ines
17018750ac
Add regression test for #717
2017-03-13 12:58:22 +01:00
ines
2883ebfca2
Remove print statement
2017-03-13 12:30:42 +01:00
ines
98c13d8aa9
Add regression test for #401
2017-03-13 12:28:41 +01:00
ines
444d665f9d
Add regression test for #686
2017-03-13 12:23:35 +01:00
ines
46b17e5b51
Add regression test for #719
2017-03-13 12:17:35 +01:00
ines
c8ae682ff9
Add regression test for #636
2017-03-13 12:08:31 +01:00
ines
337f9601f2
Add missing unicode declaration
2017-03-13 12:08:19 +01:00
ines
d70386ec6e
Update docstring in #886 regression test
2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8
Add regression test for #886
2017-03-13 11:44:58 +01:00
ines
eec3f21c50
Add WordNet license
2017-03-12 13:58:24 +01:00
ines
f9e603903b
Rename stop_words.py to word_sets.py and include more sets
...
NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)
2017-03-12 13:58:22 +01:00
ines
f24f9b4b7b
Remove unused code
2017-03-12 13:58:22 +01:00
ines
1da29a7146
Use new Lemmatizer data and remove file import
...
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
0957737ee8
Add Python-formatted lemmatizer data and rules
2017-03-12 13:58:22 +01:00
ines
c89e30d1a3
Add test for English time exceptions ("1a.m." etc.)
2017-03-12 13:58:22 +01:00