spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 02:46:35 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	079dad28a7	* Update for faster beam training	2015-06-04 19:32:32 +02:00
Matthew Honnibal	f8843906ad	Merge branch 'constituency' Add beam parsing and training from JSON files, with Levenshtein alignment.	2015-06-03 06:07:24 +02:00
Matthew Honnibal	ae653b850a	* Remove unused import from gold.pyx	2015-06-03 06:07:15 +02:00
Matthew Honnibal	a2627b6102	* Fix bug in refactored init_transition	2015-06-03 06:01:26 +02:00
Matthew Honnibal	dd0867645d	* Remove stray const from State header	2015-06-03 00:10:04 +02:00
Matthew Honnibal	6c47b10a6e	* Make optimization to children_in_buffer: stop searching when we would cross a bracket.	2015-06-02 21:05:24 +02:00
Matthew Honnibal	a513ec500f	* Have oracle functions take a struct instead of a Python object	2015-06-02 20:01:06 +02:00
Matthew Honnibal	d1b55310a1	* Refactor _advance_beam function	2015-06-02 18:38:41 +02:00
Matthew Honnibal	0786d9b3c7	* Refactor TransitionSystem, adding set_valid method	2015-06-02 18:38:07 +02:00
Matthew Honnibal	bd82a49994	* Add set_scores method to Model	2015-06-02 18:37:10 +02:00
Matthew Honnibal	a3964957f6	* Add profiling for _state.pyx	2015-06-02 18:36:27 +02:00
Matthew Honnibal	e822df0867	* Fix bugs in new greedy/beam parser	2015-06-02 02:01:33 +02:00
Matthew Honnibal	66dfa95847	* Revise greedy_parse/beam_parse ownership goof	2015-06-02 01:34:19 +02:00
Matthew Honnibal	75658b2ed3	* Remove use of new beam.loss property, to maintain compatibility with older versions of thinc for now.	2015-06-02 00:57:09 +02:00
Matthew Honnibal	7c29362d60	* Rename parser class in parser.pxd, now that beam parsing is supported	2015-06-02 00:53:49 +02:00
Matthew Honnibal	58d5ac0944	* Add beam search capabilities to Parser. Rename GreedyParser to Parser.	2015-06-02 00:28:02 +02:00
Matthew Honnibal	62424e6c76	* Remove unused regularize argument from _ml.Model	2015-06-02 00:27:07 +02:00
Matthew Honnibal	adeb57cb1e	* Fix long line	2015-06-01 23:07:00 +02:00
Matthew Honnibal	e09a08bd00	* Add copy_state function	2015-06-01 23:06:30 +02:00
Matthew Honnibal	c7876aa8b6	* Add get_valid method	2015-06-01 23:06:00 +02:00
Matthew Honnibal	d82f9d958d	* Remove regularization cruft from _ml, move score from .pxd file to .pyx	2015-05-31 18:48:05 +02:00
Matthew Honnibal	5e99ff94c8	* Edits to arc eager oracle. Couldn't figure out how the non-monotonic lines made sense. They seem covered by children_in_stack	2015-05-31 15:14:37 +02:00
Matthew Honnibal	6c5632b71c	* Roll back proposed change to Break transition while investigate effect	2015-05-31 06:49:52 +02:00
Matthew Honnibal	6bba793df3	* Disable the Zipf-reweighting thing while investigate effect	2015-05-31 06:48:43 +02:00
Matthew Honnibal	e77940565d	* Add length cap to distance feature	2015-05-31 05:25:30 +02:00
Matthew Honnibal	fd596351ba	* Fix valency features	2015-05-31 05:24:33 +02:00
Matthew Honnibal	87d6551d19	* Allow gold parse to cut non-projective arcs	2015-05-31 01:11:56 +02:00
Matthew Honnibal	c4f0914b4e	* Fix POS tag evaluation in scorer.py: do evaluate punctuation tags	2015-05-30 18:24:32 +02:00
Matthew Honnibal	9e39a206da	* Fix efficiency of JSON reading, by using ujson instead of stream	2015-05-30 17:54:52 +02:00
Matthew Honnibal	76300bbb1b	* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.	2015-05-30 01:25:46 +02:00
Matthew Honnibal	b76bbbd12c	* Read json files recursively from a directory, instead of requiring a single .json file	2015-05-29 03:52:55 +02:00
Matthew Honnibal	8f31d3b864	* Relax constraint on Break transition for non-monotonic parsing.	2015-05-28 23:39:52 +02:00
Matthew Honnibal	6b2e5c4b8a	* Avoid NER scoring for sentences with some missing NER values.	2015-05-28 22:39:08 +02:00
Matthew Honnibal	d25d31442d	* Hackishly support broken NER annotations. Should fix this.	2015-05-27 19:14:31 +02:00
Matthew Honnibal	7a2725bca4	* Read input json in a streaming way	2015-05-27 19:13:11 +02:00
Matthew Honnibal	6a1c91675e	* Add file to read ENAMEX ner data	2015-05-27 17:36:23 +02:00
Matthew Honnibal	732fa7709a	* Edits to align_raw script, for use in prepare_treebank	2015-05-27 04:23:31 +02:00
Matthew Honnibal	4010b9b6d9	* Pass parameter for regularization in parser.pyx	2015-05-27 03:18:50 +02:00
Matthew Honnibal	4c6058baa7	* Fix evaluation of NER in scorer.py	2015-05-27 03:18:16 +02:00
Matthew Honnibal	6016ee83a6	* Fix reading of NER in gold.pyx	2015-05-27 03:17:50 +02:00
Matthew Honnibal	04bda8648d	* Pass parameter for regularization to model	2015-05-27 03:16:58 +02:00
Matthew Honnibal	f69fe6a635	* Fix heads problem in read_conll	2015-05-27 01:14:54 +02:00
Matthew Honnibal	0eec1d12af	* Add comment about zipf reweighting	2015-05-27 01:14:07 +02:00
Matthew Honnibal	4d37b66c55	* Make Zipf regularization a bit more efficient	2015-05-27 01:12:50 +02:00
Matthew Honnibal	7fc24821bc	* Experiment with Zipfian corruptions when calculating prediction	2015-05-26 22:17:15 +02:00
Matthew Honnibal	eba7b34f66	* Add flag to disable loading of word vectors	2015-05-25 01:02:42 +02:00
Matthew Honnibal	3593babd35	* Add functions for Levenshtein distance alignment	2015-05-24 21:50:48 +02:00
Matthew Honnibal	744f06abf5	* Add script to read OntoNotes source documents	2015-05-24 21:49:58 +02:00
Matthew Honnibal	fc75210941	* Move spacy.syntax.conll to spacy.gold	2015-05-24 21:35:02 +02:00
Matthew Honnibal	765b61cac4	* Update spacy.scorer, to use P/R/F to support tokenization errors	2015-05-24 20:07:18 +02:00
Matthew Honnibal	efe7a7d7d6	* Clean unused functions from spacy.syntax.conll	2015-05-24 20:06:46 +02:00
Matthew Honnibal	78487f3e66	* Update parser oracle for missing heads	2015-05-24 20:05:58 +02:00
Matthew Honnibal	1044a13413	* Begin refactoring scorer to use recall over gold dependencies	2015-05-24 17:40:15 +02:00
Matthew Honnibal	acd1245ad4	* Remove cruft from conll.pyx --- unused stuff about evlauation, which now lives in spacy.scorer	2015-05-24 17:35:49 +02:00
Matthew Honnibal	20f1d868a3	* Tmp commit. Working on whole document parsing	2015-05-24 02:49:56 +02:00
Matthew Honnibal	f2ee9c4feb	* Comment out constituency parsing stuff, so that code compiles	2015-05-20 16:55:05 +02:00
Matthew Honnibal	8ee7c541f1	* Update Constituent definition	2015-05-20 16:03:26 +02:00
Matthew Honnibal	9dfc9c039c	* Work on constituency parsing.	2015-05-20 16:02:51 +02:00
Matthew Honnibal	5a5710e711	* Fix Span.subtree property	2015-05-13 21:53:15 +02:00
Matthew Honnibal	badf030b6c	* Add parse navigation to Span objects	2015-05-13 21:45:19 +02:00
Matthew Honnibal	ca320afe86	* Add docstring for ents attribute	2015-05-13 21:20:47 +02:00
Matthew Honnibal	ba07b925a7	* Fix compile error in conll.pyx	2015-05-12 22:33:47 +02:00
Matthew Honnibal	f1e0272b18	* Disable c-parsing transitions	2015-05-12 22:33:25 +02:00
Matthew Honnibal	03a6626545	* Tmp commit	2015-05-12 20:27:56 +02:00
Matthew Honnibal	9568ebed08	* Fix off-by-one in head reading	2015-05-12 20:27:56 +02:00
Matthew Honnibal	69840d8cc3	* Tweak verbose output printing in scorer.py	2015-05-12 20:27:56 +02:00
Matthew Honnibal	0605af6838	* Fix head misalignment in read_conll, when periods are ignored	2015-05-12 20:27:56 +02:00
Matthew Honnibal	d2ac8d8007	* Add ctnt field to State, in preparation for constituency parsing	2015-05-12 20:27:56 +02:00
Matthew Honnibal	ab67693393	* Add read_json_file to conll.pyx	2015-05-12 20:27:55 +02:00
Matthew Honnibal	aff9359a8d	* Update ner.pyx to expect brackets from gold_tuples	2015-05-12 20:27:55 +02:00
Matthew Honnibal	0ad72a77ce	* Write JSON files, with both dependency and PSG parses	2015-05-12 20:27:55 +02:00
Matthew Honnibal	d48218f4b2	* Add left_edge and right_edge properties	2015-05-12 20:27:55 +02:00
Matthew Honnibal	53cf77e1c8	* Bug fix: when non-monotonically correct a dependency, make sure to delete the old one from the child list	2015-05-12 20:26:41 +02:00
Matthew Honnibal	a4e2af54f9	* Add support for l/r edge to add_dep, and move inlined methods into _state.pyx where possible	2015-05-12 20:26:41 +02:00
Matthew Honnibal	d634038eb6	* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token	2015-05-12 20:26:41 +02:00
Matthew Honnibal	03ebf70a66	* Inc version to 0.84	2015-05-12 02:38:51 +02:00
Matthew Honnibal	e73eaf2d05	* Replace some assertions with proper errors	2015-05-08 16:52:17 +02:00
Matthew Honnibal	fb8d50b3d5	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-04-30 12:45:15 +02:00
Matthew Honnibal	ed8e8c3bd0	* Whitespace	2015-04-29 14:22:47 +02:00
Matthew Honnibal	378c2a6435	* Fix POS model: make it use tag instead of pos in history features	2015-04-29 00:02:53 +02:00
Matthew Honnibal	763ef01575	* Fix two bugs in feature calculation	2015-04-28 23:25:09 +02:00
Matthew Honnibal	b3fd48c97b	* Fix missing root labels bug identified in Issue #57	2015-04-28 20:45:51 +02:00
Jordan Suchow	3a8d9b37a6	Remove trailing whitespace	2015-04-19 13:01:38 -07:00
Jordan Suchow	5f0f940a1f	Remove unused imports	2015-04-19 01:05:22 -07:00
Matthew Honnibal	cc4e395927	* Add some ad hoc regexes, for multi-word location prepositions	2015-04-17 04:44:24 +02:00
Matthew Honnibal	f7ffd94e6a	* Add Token.conjuncts property	2015-04-17 01:40:53 +02:00
Matthew Honnibal	684d0e5e85	* Download updated data	2015-04-16 04:29:15 +02:00
Matthew Honnibal	2ef170a991	* Fix Issue #54 : Error merging multi-word token when there's a mid-token match.	2015-04-16 04:28:06 +02:00
Matthew Honnibal	42617548af	* Disable merge_mwes by default	2015-04-16 04:20:31 +02:00
Matthew Honnibal	99dbf8a38c	* Fix error type in lookup_transition	2015-04-16 01:36:22 +02:00
Matthew Honnibal	77d0700caf	* Add on X way regexes	2015-04-16 01:35:46 +02:00
Matthew Honnibal	9f16848b60	* Add (N0w, N1w) unigram pair to NER features, prompted by failure to detect 'this weekend'	2015-04-15 06:01:18 +02:00
Matthew Honnibal	c6707778dd	* Fix Issue #51 : Handle non-ascii lemmas correctly	2015-04-13 22:28:59 +02:00
Matthew Honnibal	bf0aff5124	* Fix bug in Tokens.ents where entity wasn't being emitted if another started immediately after	2015-04-13 21:34:33 +02:00
Matthew Honnibal	2b84a90bbb	* Fix Issue #50 : Python 3 compatibility of v0.80	2015-04-13 05:59:43 +02:00
Matthew Honnibal	fbd48c571d	* Rearrange code in tokens.pyx	2015-04-13 05:41:25 +02:00
Matthew Honnibal	507048dc45	* Rename StandardError to Exception, for Python 3 compatibility	2015-04-12 07:28:34 +02:00
Matthew Honnibal	761a19113a	* Fix /tmp moving thing in download.py	2015-04-12 07:04:10 +02:00
Matthew Honnibal	248a2b4b0f	* Remove Spans class	2015-04-12 04:07:29 +02:00
Matthew Honnibal	1d05e6da00	* Add ne_iob and ne_type features to NER	2015-04-10 19:07:08 +02:00
Matthew Honnibal	4df8a3d90f	* Add ne_iob and ne_type attributes to context vector	2015-04-10 05:02:15 +02:00
Matthew Honnibal	8c354c432b	* Add ValueError condition to ner_tag reading	2015-04-10 04:59:59 +02:00
Matthew Honnibal	435cccf098	* Add read_conll03_file function to conll.pyx	2015-04-10 04:59:11 +02:00
Matthew Honnibal	99c9ecfc18	* Fix bug in prefix, suffix and word shape features in parser and NER	2015-04-10 03:53:33 +02:00
Matthew Honnibal	cff2b13fef	* Fix Issue #44 : Broken Token.string attribute when single word sentence	2015-04-07 06:08:25 +02:00
Matthew Honnibal	6640386b25	* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.	2015-04-07 06:00:57 +02:00
Matthew Honnibal	b64b2bd910	* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.	2015-04-07 06:00:30 +02:00
Matthew Honnibal	f9e510a893	* Whitespace	2015-04-07 04:53:59 +02:00
Matthew Honnibal	66c7ccf6cc	* Fix Spans.orth_	2015-04-07 04:53:40 +02:00
Matthew Honnibal	b8d34531c4	* Add support for units to English.__init__, by loading and applying regular expressions	2015-04-07 04:02:32 +02:00
Matthew Honnibal	0ea5af88b6	* Add multi-word expression RegexMatcher	2015-04-07 03:45:40 +02:00
Matthew Honnibal	2fee67cfa3	* Add regular expressions for English multi-word expressions	2015-04-07 03:45:18 +02:00
Matthew Honnibal	5a075ea3fc	* Ensure NER moves are available for single-word tokens	2015-04-05 22:30:58 +02:00
Matthew Honnibal	a60a366b2c	* Support 'punct' dep label in conll.pyx	2015-04-05 22:30:19 +02:00
Matthew Honnibal	021c972137	* Print parse if verbose in scorer	2015-04-05 22:29:30 +02:00
Matthew Honnibal	fbf19049cf	* Add ent_type_ property	2015-03-31 02:01:29 +02:00
Matthew Honnibal	e70b87efeb	* Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data	2015-03-30 01:37:41 +02:00
Matthew Honnibal	557856e84c	* Allow regular expressions to specify labels for merged spans	2015-03-27 17:40:52 +01:00
Matthew Honnibal	a3af6b7c3d	* Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty.	2015-03-27 17:39:16 +01:00
Matthew Honnibal	db5a43318c	* Improve print_state debug printer	2015-03-27 17:29:58 +01:00
Matthew Honnibal	1705eccbbe	* Remove whitespace	2015-03-27 15:22:39 +01:00
Matthew Honnibal	3feb52374c	* Break apart a condition, for ease of debug printing	2015-03-27 15:21:38 +01:00
Matthew Honnibal	b32f581acb	* Fix bug in ArcEager.get_labels	2015-03-27 15:21:06 +01:00
Matthew Honnibal	5f2a4ff36d	* Fix spans.lemma_	2015-03-26 16:45:38 +01:00
Matthew Honnibal	f4cc222ec3	* Fix NER scoring	2015-03-26 16:45:38 +01:00
Matthew Honnibal	1320bd19db	* Move Span class to own file	2015-03-26 16:45:38 +01:00
Matthew Honnibal	6f47a667cf	* Move Span class to own file	2015-03-26 16:45:38 +01:00
Matthew Honnibal	f02c39dfaf	* Compare to is not None, for more robustness	2015-03-26 16:44:48 +01:00
Matthew Honnibal	8f68b864c4	* Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens	2015-03-26 16:44:48 +01:00
Matthew Honnibal	e854ba0a13	* Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter	2015-03-26 16:44:47 +01:00
Matthew Honnibal	6a6085f8b9	* Clean up GreedyParser.train function a bit	2015-03-26 16:44:47 +01:00
Matthew Honnibal	b3157927e6	* Clean up unused feature templates	2015-03-26 16:44:47 +01:00
Matthew Honnibal	411bf377d4	* Remove dependency on ner_util module	2015-03-26 16:44:47 +01:00
Matthew Honnibal	01c892f583	* Add comment to fill_context	2015-03-26 16:44:47 +01:00
Matthew Honnibal	2741179aff	* Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features.	2015-03-26 16:44:47 +01:00
Matthew Honnibal	2b2dec95d3	* Add comment to set_parse	2015-03-26 16:44:47 +01:00
Matthew Honnibal	e770fade1e	* Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up...	2015-03-26 16:44:47 +01:00
Matthew Honnibal	71648205d9	* Add support for debug feature set. Just use unigrams for this.	2015-03-26 16:44:47 +01:00
Matthew Honnibal	3b70b304b2	* Add words to gold_tuples from gold conll file	2015-03-26 16:44:47 +01:00
Matthew Honnibal	2e12dec76e	* Adjust scorer to account for tokenization mistakes	2015-03-26 16:44:47 +01:00
Matthew Honnibal	05d6065e2e	* Add assertion	2015-03-26 16:44:46 +01:00
Matthew Honnibal	377e9b29b1	* Whitespace	2015-03-26 16:44:46 +01:00
Matthew Honnibal	670959f40c	* Fix iteration order on Tokens.rights	2015-03-26 16:44:46 +01:00
Matthew Honnibal	231ce2dae5	* Assign ROOT label by default. May be papering over another bug.	2015-03-26 16:44:46 +01:00
Matthew Honnibal	9f4ad8fdfb	* Assign root words the ROOT label via the Break transition. Something is still wrong here...	2015-03-26 16:44:46 +01:00
Matthew Honnibal	f729164c01	* Fix bug in label assignment: ensure null-label transitions receive the label 0	2015-03-26 16:44:46 +01:00
Matthew Honnibal	7237c805c7	* Load tag for specials.json token	2015-03-26 16:44:46 +01:00
Matthew Honnibal	567388e38d	* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags	2015-03-26 16:44:45 +01:00
Matthew Honnibal	3105c7f8ba	* Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels	2015-03-26 16:44:45 +01:00
Matthew Honnibal	801bf14f4f	* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.	2015-03-26 16:44:45 +01:00

1 2 3 4 5 ...

743 Commits