spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Author	SHA1	Message	Date
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
adrianeboyd	d359da9687	Replace Entity/MatchStruct with SpanC (#4459 ) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC	2019-10-18 11:01:47 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	dbc53b9870	rename to KBEntryC	2019-06-26 15:55:26 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	735fc2a735	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
svlandeg	d849eb2455	adding kb_id as field to token, el as nlp pipeline component	2019-03-22 11:34:46 +01:00
Matthew Honnibal	9a2d1cc6e0	Add length attribute to MorphAnalysisC	2019-03-08 00:08:57 +01:00
Matthew Honnibal	fed0371db7	Remove enums from morphology	2019-03-07 17:14:57 +01:00
Matthew Honnibal	b9ade7d4e0	Add MorphAnalysisC struct	2019-03-07 14:03:07 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Matthew Honnibal	8aa7882762	Make NORM a token attribute (#3029 ) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.	2018-12-08 10:49:10 +01:00
Matthew Honnibal	3bba8e9245	Update structs	2018-09-24 23:58:08 +02:00
Matthew Honnibal	18063803de	Make TokenC.sent_tart an int, to allow ternary value	2017-10-08 19:58:54 +02:00
Matthew Honnibal	84e66ca6d4	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
Matthew Honnibal	f51e6a6c16	Adjust lexeme sizing for attr_t being 64 bit	2017-05-28 12:51:09 +02:00
Matthew Honnibal	3ea98e2043	Remove vector member from lexeme	2017-05-28 11:46:24 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Wolfgang Seeker	03fb498dbe	introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately	2016-03-10 13:01:34 +01:00
Matthew Honnibal	9ec7b9c454	* Clean up unused Constituent struct.	2015-11-03 23:48:21 +11:00
Matthew Honnibal	1e99fcd413	* Rename .repvec to .vector in C API	2015-11-03 23:47:59 +11:00
Matthew Honnibal	7ac6cacc26	* Remove const qualifier on LexemeC.repvec	2015-09-15 14:42:51 +10:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	1d7f2d3abc	* Hack on morphology structs	2015-08-26 19:18:36 +02:00
Matthew Honnibal	815bda201d	* Remove UniStr struct	2015-07-22 13:39:17 +02:00
Matthew Honnibal	128b6d9714	* Move Utf8Str struct to strings module, as that's the only place it's relevant	2015-07-20 12:06:41 +02:00
Matthew Honnibal	4dddc8a69b	* Fix type declarations for attr_t. Remove unused id_t.	2015-07-18 22:39:57 +02:00
Matthew Honnibal	95e57c2780	* Remove unnecessary key and id properties from Utf8String.	2015-07-17 01:40:18 +02:00
Matthew Honnibal	aa82caf8f5	* Add TokenC.spacy attr	2015-07-13 19:48:07 +02:00
Matthew Honnibal	1d3a592edf	* Remove the senses attr from LexemeC, to keep data compatibility	2015-07-08 19:24:44 +02:00
Matthew Honnibal	e23d1582a2	* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.	2015-07-01 18:50:37 +02:00
Matthew Honnibal	a7bf7b0626	* Rename sent_start to sent_end, to reflect its new usage in the Break transition	2015-06-23 05:39:43 +02:00
Matthew Honnibal	8ee7c541f1	* Update Constituent definition	2015-05-20 16:03:26 +02:00
Matthew Honnibal	03a6626545	* Tmp commit	2015-05-12 20:27:56 +02:00
Matthew Honnibal	d2ac8d8007	* Add ctnt field to State, in preparation for constituency parsing	2015-05-12 20:27:56 +02:00
Matthew Honnibal	d634038eb6	* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token	2015-05-12 20:26:41 +02:00
Jordan Suchow	3a8d9b37a6	Remove trailing whitespace	2015-04-19 13:01:38 -07:00
Matthew Honnibal	8057a95f20	* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.	2015-03-26 16:44:44 +01:00
Matthew Honnibal	b3eda03c9c	* Tmp	2015-03-26 16:44:44 +01:00
Matthew Honnibal	135756ac3d	* Tmp commit of NER refactoring	2015-03-26 16:44:42 +01:00
Matthew Honnibal	b139aa92ba	* Start setting out how NER will be implemented in the data model	2015-03-26 16:44:41 +01:00
Matthew Honnibal	75f9b7d6bf	* Add L2 norm field to LexemeC struct	2015-02-07 08:43:17 -05:00
Matthew Honnibal	08ca5c8970	* Add sent_end flag to TokenC struct	2015-01-31 13:44:16 +11:00
Matthew Honnibal	12b034e3ef	* Move POS tag definitions to parts_of_speech.pxd	2015-01-25 16:31:07 +11:00
Matthew Honnibal	fda94271af	* Rename NORM1 and NORM2 attrs to lower and norm	2015-01-24 06:17:03 +11:00
Matthew Honnibal	5ed8b2b98f	* Rename sic to orth	2015-01-23 02:08:25 +11:00

1 2

57 Commits