svlandeg
5f002e9ced
annotate kb_id through ents in doc
2019-03-14 16:31:46 +01:00
svlandeg
173d45ec5f
adding kb_id as field to token, el as nlp pipeline component
2019-03-06 19:34:18 +01:00
Matthew Honnibal
8aa7882762
Make NORM a token attribute ( #3029 )
...
See #3028 . The solution in this patch is pretty debateable.
What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.
The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?
Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
2018-12-08 10:49:10 +01:00
Matthew Honnibal
18063803de
Make TokenC.sent_tart an int, to allow ternary value
2017-10-08 19:58:54 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
f51e6a6c16
Adjust lexeme sizing for attr_t being 64 bit
2017-05-28 12:51:09 +02:00
Matthew Honnibal
3ea98e2043
Remove vector member from lexeme
2017-05-28 11:46:24 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
58e83fe34b
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.
2016-09-21 14:54:55 +02:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Matthew Honnibal
9ec7b9c454
* Clean up unused Constituent struct.
2015-11-03 23:48:21 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
7ac6cacc26
* Remove const qualifier on LexemeC.repvec
2015-09-15 14:42:51 +10:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
1d7f2d3abc
* Hack on morphology structs
2015-08-26 19:18:36 +02:00
Matthew Honnibal
815bda201d
* Remove UniStr struct
2015-07-22 13:39:17 +02:00
Matthew Honnibal
128b6d9714
* Move Utf8Str struct to strings module, as that's the only place it's relevant
2015-07-20 12:06:41 +02:00
Matthew Honnibal
4dddc8a69b
* Fix type declarations for attr_t. Remove unused id_t.
2015-07-18 22:39:57 +02:00
Matthew Honnibal
95e57c2780
* Remove unnecessary key and id properties from Utf8String.
2015-07-17 01:40:18 +02:00
Matthew Honnibal
aa82caf8f5
* Add TokenC.spacy attr
2015-07-13 19:48:07 +02:00
Matthew Honnibal
1d3a592edf
* Remove the senses attr from LexemeC, to keep data compatibility
2015-07-08 19:24:44 +02:00
Matthew Honnibal
e23d1582a2
* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.
2015-07-01 18:50:37 +02:00
Matthew Honnibal
a7bf7b0626
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
2015-06-23 05:39:43 +02:00
Matthew Honnibal
8ee7c541f1
* Update Constituent definition
2015-05-20 16:03:26 +02:00
Matthew Honnibal
03a6626545
* Tmp commit
2015-05-12 20:27:56 +02:00
Matthew Honnibal
d2ac8d8007
* Add ctnt field to State, in preparation for constituency parsing
2015-05-12 20:27:56 +02:00
Matthew Honnibal
d634038eb6
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token
2015-05-12 20:26:41 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Matthew Honnibal
8057a95f20
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
2015-03-26 16:44:44 +01:00
Matthew Honnibal
b3eda03c9c
* Tmp
2015-03-26 16:44:44 +01:00
Matthew Honnibal
135756ac3d
* Tmp commit of NER refactoring
2015-03-26 16:44:42 +01:00
Matthew Honnibal
b139aa92ba
* Start setting out how NER will be implemented in the data model
2015-03-26 16:44:41 +01:00
Matthew Honnibal
75f9b7d6bf
* Add L2 norm field to LexemeC struct
2015-02-07 08:43:17 -05:00
Matthew Honnibal
08ca5c8970
* Add sent_end flag to TokenC struct
2015-01-31 13:44:16 +11:00
Matthew Honnibal
12b034e3ef
* Move POS tag definitions to parts_of_speech.pxd
2015-01-25 16:31:07 +11:00
Matthew Honnibal
fda94271af
* Rename NORM1 and NORM2 attrs to lower and norm
2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f
* Rename sic to orth
2015-01-23 02:08:25 +11:00
Matthew Honnibal
45264e356b
* Rename vec to repvec
2015-01-22 02:04:24 +11:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
46da3d74d2
* Tmp. Refactoring, introducing a Lexeme PyObject.
2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
b8b65903fc
* Tmp
2014-12-24 17:42:00 +11:00
Matthew Honnibal
e1c1a4b868
* Tmp
2014-12-21 05:36:29 +11:00
Matthew Honnibal
780cbd68b1
* Move all struct definitions to structs.pxd, to avoid circular dependencies
2014-12-20 06:51:33 +11:00