spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-16 22:57:22 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	5cac951a16	Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.	2017-05-14 00:55:01 +02:00
Matthew Honnibal	f8c02b4341	Remove cupy imports from parser, so it can work on CPU	2017-05-14 00:37:53 +02:00
Matthew Honnibal	613ba79e2e	Fiddle with sizings for parser	2017-05-13 17:20:23 -05:00
Matthew Honnibal	e6d71e1778	Small fixes to parser	2017-05-13 17:19:04 -05:00
Matthew Honnibal	188c0f6949	Clean up unused import	2017-05-13 17:18:27 -05:00
Matthew Honnibal	f85c8464f7	Draft support of regression loss in parser	2017-05-13 17:17:27 -05:00
ines	1465c6c221	Add API docs for util functions	2017-05-13 21:23:12 +02:00
ines	144161c58c	Update links to dev resources	2017-05-13 21:23:02 +02:00
ines	1694c24e52	Add docstrings, error messages and fix consistency	2017-05-13 21:22:49 +02:00
ines	ee7dcf65c9	Fix expand_exc to make sure it returns combined dict	2017-05-13 21:22:25 +02:00
ines	824d09bb74	Move resolve_load_name to deprecated	2017-05-13 21:21:47 +02:00
ines	0095d5322b	Update adding languages docs	2017-05-13 18:54:10 +02:00
ines	a4a37a783e	Remove import from non-existing module	2017-05-13 16:00:09 +02:00
ines	1d94c0e98a	Update table of contents	2017-05-13 15:42:51 +02:00
ines	a48e21755e	Add section on testing language tokenizers	2017-05-13 15:39:27 +02:00
ines	5858857a78	Update languages list in conftest	2017-05-13 15:37:54 +02:00
ines	326e677882	Fix syntax highlighting colour of keyword	2017-05-13 15:37:43 +02:00
ines	9f004394aa	Use thicker & round dotted lines in graphic	2017-05-13 15:37:28 +02:00
ines	2f54fefb5d	Update adding languages docs	2017-05-13 14:54:58 +02:00
ines	9003fd25e5	Fix error messages if model is required (resolves #1051 ) Rename about.__docs__ to about.__docs_models__.	2017-05-13 13:14:02 +02:00
ines	24e973b17f	Rename about.__docs__ to about.__docs_models__	2017-05-13 13:09:00 +02:00
ines	9d85cda8e4	Fix models error message and use about.__docs_models__ (see #1051 )	2017-05-13 13:05:47 +02:00
ines	6b942763f0	Tidy up imports	2017-05-13 13:04:40 +02:00
ines	3665acc0de	Update adding languages docs	2017-05-13 12:39:36 +02:00
ines	6e1dbc608e	Fix parse_tree test	2017-05-13 12:34:20 +02:00
ines	573f0ba867	Replace deepcopy	2017-05-13 12:34:14 +02:00
ines	bd428c0a70	Set defaults for light and flat kwargs	2017-05-13 12:34:05 +02:00
ines	c5669450a0	Fix formatting	2017-05-13 12:33:57 +02:00
ines	8c2a0c026d	Fix parse_tree test	2017-05-13 12:32:45 +02:00
ines	6129016e15	Replace deepcopy	2017-05-13 12:32:37 +02:00
ines	df68bf45ce	Set defaults for light and flat kwargs	2017-05-13 12:32:23 +02:00
ines	b9dea345e5	Remove old import	2017-05-13 12:32:11 +02:00
ines	293ee359c5	Fix formatting	2017-05-13 12:32:06 +02:00
ines	2e4db1beb9	Fix formatting	2017-05-13 12:02:39 +02:00
Matthew Honnibal	ad590feaa8	Fix test, which imported English incorrectly	2017-05-13 11:36:19 +02:00
ines	3454f2aca8	Update showcase	2017-05-13 03:32:03 +02:00
ines	e506811a93	Update description	2017-05-13 03:27:50 +02:00
ines	4eefb288e3	Port over PR #1055	2017-05-13 03:25:32 +02:00
Ines Montani	9e292e822e	Merge pull request #1047 from pasupulaphani/patch-1 Update _data.json	2017-05-13 03:24:38 +02:00
Ines Montani	8d742ac8ff	Merge pull request #1055 from recognai/master Enable pruning out rare words from clusters data	2017-05-13 03:22:56 +02:00
Matthew Honnibal	ee1d35bdb0	Fix merge conflict	2017-05-13 03:20:19 +02:00
Matthew Honnibal	12158cc06f	Merge branch 'kengz-master'	2017-05-13 03:19:18 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
ines	67726d1837	Update data model docs	2017-05-13 03:10:56 +02:00
ines	915b50c736	Update adding languages docs	2017-05-13 03:10:50 +02:00
ines	7f331eafcd	Add SVG object	2017-05-13 03:10:41 +02:00
ines	d5c83a5810	Fix image mixin to allow figure with no args	2017-05-13 03:10:35 +02:00
ines	a74376dca9	Add flow chart graphics	2017-05-13 03:10:21 +02:00
Matthew Honnibal	827b5af697	Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick.	2017-05-12 16:09:15 -05:00
oeg	cdaefae60a	feature(populate_vocab): Enable pruning out rare words from clusters data	2017-05-12 16:15:19 +02:00

... 71 72 73 74 75 ...

8849 Commits