spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	6b34a2f34b	* Move morphological analysis into its own module, morphology.pyx	2014-12-09 21:16:17 +11:00
Matthew Honnibal	b962fe73d7	* Make suffixes file use full-power regex, so that we can handle periods properly	2014-12-09 19:04:27 +11:00
Matthew Honnibal	accdbe989b	* Remove Tokens.extend method	2014-12-09 17:09:23 +11:00
Matthew Honnibal	495e1c7366	* Use fused type in Tokens.push_back, simplifying the use of the cache	2014-12-09 16:50:01 +11:00
Matthew Honnibal	302e09018b	* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas	2014-12-09 14:48:01 +11:00
Matthew Honnibal	99bbbb6feb	* Work on morphological processing	2014-12-08 21:12:15 +11:00
Matthew Honnibal	7b68f911cf	* Add WordNet lemmatizer	2014-12-08 01:39:13 +11:00
Matthew Honnibal	c20dd79748	* Fiddle with const correctness and comments	2014-12-08 00:03:55 +11:00
Matthew Honnibal	b031c7c430	* Remove language-general context module	2014-12-07 23:53:01 +11:00
Matthew Honnibal	ef4398b204	* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules	2014-12-07 23:52:41 +11:00
Matthew Honnibal	327383e38a	* Remove unused code in tagger.pyx	2014-12-07 22:16:17 +11:00
Matthew Honnibal	9f17467c2e	* Fix EMPTY_TOKEN	2014-12-07 22:07:41 +11:00
Matthew Honnibal	3819a88e1b	* Add support for tag dictionary, and fix error-code for predict method	2014-12-07 22:07:16 +11:00
Matthew Honnibal	f00afe12c4	* Load POS tagger in load() function if path exists	2014-12-07 22:05:57 +11:00
Matthew Honnibal	5fe5e6e66b	* Move context functions to header, inlining them.	2014-12-07 21:59:04 +11:00
Matthew Honnibal	5caabec789	* Link in tagger, to work on integrating POS tagging	2014-12-07 15:29:41 +11:00
Matthew Honnibal	0c7aeb9de7	* Begin revising tagger, focussing on POS tagging	2014-12-07 15:29:04 +11:00
Matthew Honnibal	f5c4f2eb52	* Revise context, focussing on POS tagging for now	2014-12-07 15:28:22 +11:00
Matthew Honnibal	e27b912ef9	* Remove need for confusing _data pointer to be stored on Tokens	2014-12-05 16:31:30 +11:00
Matthew Honnibal	1c9253701d	* Introduce a TokenC struct, to handle token indices, pos tags and sense tags	2014-12-05 15:56:14 +11:00
Matthew Honnibal	187372c7f3	* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached	2014-12-05 03:29:50 +11:00
Matthew Honnibal	75b8dfb348	* Remove upper_pc from lexeme.pyx	2014-12-04 22:14:34 +11:00
Matthew Honnibal	49f3780ff5	* Fiddle with lexeme attrs	2014-12-04 21:22:38 +11:00
Matthew Honnibal	564082e48e	* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...	2014-12-04 20:51:29 +11:00
Matthew Honnibal	69bb022204	* Add as_array and count_by method	2014-12-04 20:46:55 +11:00
Matthew Honnibal	e1b1f45cc9	* Add STEM attribute to lexeme	2014-12-04 20:46:20 +11:00
Matthew Honnibal	d7952634ca	* Make the string-store serve const pointers to Utf8Str	2014-12-03 16:01:47 +11:00
Matthew Honnibal	7e04c22f8f	* const added to Lexicon interface. Seems to work.	2014-12-03 15:58:17 +11:00
Matthew Honnibal	d70d31aa45	* Introduce first attempt at const-ness	2014-12-03 15:44:25 +11:00
Matthew Honnibal	4560ada85b	* Add typedef for attr_t. Change flag_t to flags_t	2014-12-03 11:06:31 +11:00
Matthew Honnibal	e600f7b327	* Move String struct stuff into the utf8string module, from spacy.lang	2014-12-03 11:06:00 +11:00
Matthew Honnibal	e170faf5b0	* Hack Tokens to work without tagger.pyx	2014-12-03 11:05:15 +11:00
Matthew Honnibal	b463a7eb86	* Make flag-setting a language-specific thing	2014-12-03 11:04:32 +11:00
Matthew Honnibal	71b009e323	* Fix bug in refactored StringStore.__getitem__	2014-12-03 11:02:24 +11:00
Matthew Honnibal	14097311ae	* Make StringStore.__getitem__ accept unicode-typed keys.	2014-12-03 01:33:20 +11:00
Matthew Honnibal	522bb0346e	* Work on get_array method of Tokens	2014-12-02 23:48:05 +11:00
Matthew Honnibal	8c2938fe01	* Rename Lexicon._dict to Lexicon._map	2014-12-02 23:46:59 +11:00
Matthew Honnibal	33dfb4933c	* Remove taggers from Language class. Work on doc strings	2014-11-26 19:53:55 +11:00
Matthew Honnibal	80baa2e3db	* Work on beam parser	2014-11-20 19:49:33 +11:00
Matthew Honnibal	5c3016bac8	* Tmp commit of ner code	2014-11-14 18:27:47 +11:00
Matthew Honnibal	33c421bcf8	* More feature tweaks	2014-11-12 23:59:16 +11:00
Matthew Honnibal	41dedfb14e	* Add label features for NER parsing	2014-11-12 23:55:10 +11:00
Matthew Honnibal	cf55b48ba6	* Switch to predict label on shift. Big increase in accuracy.	2014-11-12 23:50:12 +11:00
Matthew Honnibal	8f84e8a78b	* Neaten oracle	2014-11-12 23:38:07 +11:00
Matthew Honnibal	7e0a9077dd	* Add context files	2014-11-12 23:22:36 +11:00
Matthew Honnibal	3b0b902384	* IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86	2014-11-12 23:21:09 +11:00
Matthew Honnibal	e6bb8aa3a9	* Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style	2014-11-12 00:54:50 +11:00
Matthew Honnibal	c788633429	* Add tokens_from_list method to Language	2014-11-11 23:43:14 +11:00
Matthew Honnibal	95282d4993	* Use the dynamic oracle 'follow' strategy	2014-11-11 21:11:17 +11:00
Matthew Honnibal	5aaf7a024d	* Move ner features to ner subdir	2014-11-11 21:09:03 +11:00
Matthew Honnibal	ff8989b63c	* Use greedy NER parser	2014-11-11 21:08:35 +11:00
Matthew Honnibal	0d943ab358	* Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger.	2014-11-11 17:17:54 +11:00
Matthew Honnibal	399239760b	* Fix moves for new State struct	2014-11-10 22:16:05 +11:00
Matthew Honnibal	82247169f2	* Implement validation and oracle on pystate, for testing	2014-11-10 22:15:32 +11:00
Matthew Honnibal	3709ed9d6d	* Add curr field to State, to handle entity being built	2014-11-10 22:14:36 +11:00
Matthew Honnibal	af9ed18cf1	* Bug fixes to NER	2014-11-10 17:39:23 +11:00
Matthew Honnibal	9f2587f5ec	* Work on shift-reduce NER	2014-11-10 16:28:56 +11:00
Matthew Honnibal	f307eb2e36	* Refactor context extraction, and start breaking out gold standards into their own functions	2014-11-09 15:43:07 +11:00
Matthew Honnibal	602f993af9	* Moving tagger to accept multiple correct answers	2014-11-09 15:18:33 +11:00
Matthew Honnibal	f37d896a42	* Upd NER feats. With adadelta learner, getting 76.9 on NER	2014-11-07 04:43:54 +11:00
Matthew Honnibal	68d1cdad62	* When encoding POS/NER tags, accept '-' as a missing value	2014-11-07 04:42:31 +11:00
Matthew Honnibal	949a6245f9	* Increase default number of iterations from 5 to 10	2014-11-07 04:42:04 +11:00
Matthew Honnibal	3cab1d9a29	* Refine word_shape feature, by trimming the max sequence length	2014-11-07 04:41:29 +11:00
Matthew Honnibal	b4454cf036	* Add extra context tokens	2014-11-07 04:40:36 +11:00
Matthew Honnibal	50309e6e49	* Fix context vector, importing all features	2014-11-05 22:11:39 +11:00
Matthew Honnibal	07a23768de	* Play with NER feats a bit. Up to 82.00 training on MUC7.	2014-11-05 21:47:17 +11:00
Matthew Honnibal	4ecbe8c893	* Complete refactor of Tagger features, to use a generic list of context names.	2014-11-05 20:45:29 +11:00
Matthew Honnibal	0a8c84625d	* Moving feature context stuff to a generalized place	2014-11-05 19:55:10 +11:00
Matthew Honnibal	3733444101	* Generalize tagger code, in preparation for NER and supersense tagging.	2014-11-05 03:42:14 +11:00
Matthew Honnibal	abbe3e44b0	* Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates.	2014-11-05 00:37:59 +11:00
Matthew Honnibal	954c970415	* Add __iter__ method to tokens	2014-11-04 01:07:08 +11:00
Matthew Honnibal	f07457a91f	* Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff	2014-11-04 01:06:43 +11:00
Matthew Honnibal	ae52f9f38c	* Remove vocab10k from tokens	2014-11-03 00:23:20 +11:00
Matthew Honnibal	32fb50dc35	* Remove non_sparse method --- features wanting this can do it easily enough.	2014-11-03 00:15:47 +11:00
Matthew Honnibal	b5ae1471db	* Fiddle with POS tag features	2014-11-03 00:15:03 +11:00
Matthew Honnibal	70ea862703	* Remove vocab10k field, and add flags for gazetteers	2014-11-03 00:13:51 +11:00
Matthew Honnibal	711ed0f636	* Whitespace	2014-11-02 14:22:32 +11:00
Matthew Honnibal	fcd9490d56	* Add pos_tag method to Language	2014-11-02 14:21:43 +11:00
Matthew Honnibal	829bb2bdbe	* Add mappings to Twitter POS tag corpus	2014-11-02 13:21:19 +11:00
Matthew Honnibal	437cd2217d	* Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs.	2014-11-02 13:20:37 +11:00
Matthew Honnibal	3352e89e21	* Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web	2014-11-02 13:19:54 +11:00
Matthew Honnibal	8335706321	* Add LIKE_URL and LIKE_NUMBER flag features	2014-11-02 13:19:23 +11:00
Matthew Honnibal	5484fbea69	* Implement is_number	2014-11-01 19:13:24 +11:00
Matthew Honnibal	f685218e21	* Add is_urlish function	2014-11-01 17:39:34 +11:00
Matthew Honnibal	09a3e54176	* Delete print statements from stringstore	2014-10-31 17:45:26 +11:00
Matthew Honnibal	b186a66bae	* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype	2014-10-31 17:44:39 +11:00
Matthew Honnibal	a8ca078b24	* Restore lexemes field to lexicon	2014-10-31 17:43:25 +11:00
Matthew Honnibal	6c807aa45f	* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries	2014-10-31 17:43:00 +11:00
Matthew Honnibal	aaf6953fe0	* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.	2014-10-31 17:42:15 +11:00
Matthew Honnibal	f67cb9a5a3	* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.	2014-10-31 17:42:04 +11:00
Matthew Honnibal	ea8f1e7053	* Tighten interfaces	2014-10-30 18:14:42 +11:00
Matthew Honnibal	ea85bf3a0a	* Tighten the interface to Language	2014-10-30 18:01:27 +11:00
Matthew Honnibal	c6fcd03692	* Small efficiency tweak to lexeme init	2014-10-30 17:56:11 +11:00
Matthew Honnibal	87c2418a89	* Fiddle with data types on Lexeme, to compress them to a much smaller size.	2014-10-30 15:42:15 +11:00
Matthew Honnibal	ac88893232	* Fix Token after lexeme changes	2014-10-30 15:30:52 +11:00
Matthew Honnibal	e6b87766fe	* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme	2014-10-30 15:21:38 +11:00
Matthew Honnibal	889b7b48b4	* Fix POS tagger, so that it loads correctly. Lexemes are being read in.	2014-10-30 13:38:55 +11:00
Matthew Honnibal	67c8c8019f	* Update lexeme serialization, using a binary file format	2014-10-30 01:01:00 +11:00
Matthew Honnibal	13909a2e24	* Rewriting Lexeme serialization.	2014-10-29 23:19:38 +11:00
Matthew Honnibal	234d49bf4d	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
Matthew Honnibal	08ce602243	* Large refactor, particularly to Python API	2014-10-24 00:59:17 +11:00
Matthew Honnibal	7baef5b7ff	* Fix padding on tokens	2014-10-23 04:01:17 +11:00
Matthew Honnibal	96b835a3d4	* Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags.	2014-10-23 03:20:02 +11:00
Matthew Honnibal	e5e951ae67	* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.	2014-10-23 01:57:59 +11:00
Matthew Honnibal	ea1d4a81eb	* Refactoring get_atoms, improving tokens API	2014-10-22 13:10:56 +11:00
Matthew Honnibal	ad49e2482e	* Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text.	2014-10-22 12:57:06 +11:00
Matthew Honnibal	0a0e41f6c8	* Add prefix and suffix features	2014-10-22 12:56:09 +11:00
Matthew Honnibal	7018b53d3a	* Improve array features in tokens	2014-10-22 12:55:42 +11:00
Matthew Honnibal	43d5964e13	* Add function to read detokenization rules	2014-10-22 12:54:59 +11:00
Matthew Honnibal	224bdae996	* Add POS utilities	2014-10-22 10:17:57 +11:00
Matthew Honnibal	5ebe14f353	* Add greedy pos tagger	2014-10-22 10:17:26 +11:00
Matthew Honnibal	12742f4f83	* Add detokenize method and test	2014-10-18 18:07:29 +11:00
Matthew Honnibal	99f5e59286	* Have tokenizer emit tokens for whitespace other than single spaces	2014-10-14 20:25:57 +11:00
Matthew Honnibal	43743a5d63	* Work on efficiency	2014-10-14 18:22:41 +11:00
Matthew Honnibal	6fb42c4919	* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang	2014-10-14 16:17:45 +11:00
Matthew Honnibal	2805068ca8	* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer	2014-10-14 15:21:03 +11:00
Matthew Honnibal	65d3ead4fd	* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id	2014-10-14 15:19:07 +11:00
Matthew Honnibal	868e558037	* Preparations in place to handle hyphenation etc	2014-10-10 20:23:23 +11:00
Matthew Honnibal	ff79dbac2e	* More slight cleaning for lang.pyx	2014-10-10 20:11:22 +11:00
Matthew Honnibal	3d82ed1e5e	* More slight cleaning for lang.pyx	2014-10-10 19:50:07 +11:00
Matthew Honnibal	02e948e7d5	* Remove counts stuff from Language class	2014-10-10 19:25:01 +11:00
Matthew Honnibal	71ee921055	* Slight cleaning of tokenizer code	2014-10-10 19:17:22 +11:00
Matthew Honnibal	59b41a9fd3	* Switch to new data model, tests passing	2014-10-10 08:11:31 +11:00
Matthew Honnibal	1b0e01d3d8	* Revising data model of lexeme. Compiles.	2014-10-09 19:53:30 +11:00
Matthew Honnibal	e40caae51f	* Update Lexicon class to expect a list of lexeme dict descriptions	2014-10-09 14:51:35 +11:00
Matthew Honnibal	51d75b244b	* Add serialize/deserialize functions for lexeme, transport to/from python dict.	2014-10-09 14:10:46 +11:00
Matthew Honnibal	d73d89a2de	* Add i attribute to lexeme, giving lexemes sequential IDs.	2014-10-09 13:50:05 +11:00
Matthew Honnibal	096ef2b199	* Rename external hashing lib, from trustyc to preshed	2014-09-26 18:40:03 +02:00
Matthew Honnibal	11a346fd5e	* Remove hashing modules, which are now taken over by external lib	2014-09-26 18:39:40 +02:00
Matthew Honnibal	93505276ed	* Add German tokenizer files	2014-09-25 18:29:13 +02:00
Matthew Honnibal	2e44fa7179	* Add util.py	2014-09-25 18:26:22 +02:00
Matthew Honnibal	b15619e170	* Use PointerHash instead of locally provided _hashing module	2014-09-25 18:23:35 +02:00
Matthew Honnibal	ed446c67ad	* Add typedefs file	2014-09-17 23:10:32 +02:00
Matthew Honnibal	316a57c4be	* Remove own memory classes, which have now been broken out into their own package	2014-09-17 23:10:07 +02:00
Matthew Honnibal	ac522e2553	* Switch from own memory class to cymem, in pip	2014-09-17 23:09:24 +02:00
Matthew Honnibal	6266cac593	* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks	2014-09-17 20:02:26 +02:00
Matthew Honnibal	5a20dfc03e	* Add memory management code	2014-09-17 20:02:06 +02:00
Matthew Honnibal	0152831c89	* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.	2014-09-16 18:01:46 +02:00
Matthew Honnibal	143e51ec73	* Refactor tokenization, splitting it into a clearer life-cycle.	2014-09-16 13:16:02 +02:00
Matthew Honnibal	c396581a0b	* Fiddle with the way strings are interned in lexeme	2014-09-15 06:34:45 +02:00
Matthew Honnibal	0bb547ab98	* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance	2014-09-15 06:34:10 +02:00
Matthew Honnibal	7959141d36	* Add a few abbreviations, to get tests to pass	2014-09-15 06:32:18 +02:00
Matthew Honnibal	d235299260	* Few nips and tucks to hash table	2014-09-15 05:03:44 +02:00
Matthew Honnibal	e68a431e5e	* Pass only the tokens vector to _tokenize, instead of the whole python object.	2014-09-15 04:01:38 +02:00
Matthew Honnibal	08cef75ffd	* Switch to using a heap-allocated vector in tokens	2014-09-15 03:46:14 +02:00
Matthew Honnibal	f77b7098c0	* Upd Tokens to use vector, with bounds checking.	2014-09-15 03:22:40 +02:00
Matthew Honnibal	0f6bf2a2ee	* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.	2014-09-15 02:08:39 +02:00
Matthew Honnibal	df24e3708c	* Move EnglishTokens stuff to Tokens	2014-09-15 01:31:44 +02:00
Matthew Honnibal	bd08cb09a2	* Remove short-circuiting of initial_size argument for PointerHash	2014-09-15 01:30:49 +02:00
Matthew Honnibal	f3393cf57c	* Improve interface for PointerHash	2014-09-13 17:29:58 +02:00

1 2 3 4 5 ...

363 Commits