Commit Graph

10400 Commits

Author SHA1 Message Date
Knut O. Hellan
a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Patrick Hogan
8c0586fd9c Update example and sign contributor agreement (#3916)
* Sign contributor agreement for askhogan

* Remove unneeded `seen_tokens` which is never used within the scope
2019-07-08 10:27:20 +02:00
Rokas Ramanauskas
61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
svlandeg
b7a0c9bf60 fixing the context/prior weight settings 2019-07-03 17:48:09 +02:00
svlandeg
0ea52c86b8 remove redundancy 2019-07-03 15:02:10 +02:00
svlandeg
668b17ea4a deuglify kb deserializer 2019-07-03 15:00:42 +02:00
svlandeg
8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg
3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg
2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg
c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg
1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg
68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
Ines Montani
4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
svlandeg
dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
Ines Montani
37f744ca00 Auto-format [ci skip] 2019-06-26 14:48:09 +02:00
Ines Montani
d361e380b8 Fix matcher callback example (closes #3862) 2019-06-26 14:47:26 +02:00
Ines Montani
6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
svlandeg
1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg
bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg
ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
Bram Vanroy
f22704621e Update CITATION (#3873)
As discussed in https://github.com/explosion/spaCy/pull/2167 the citation should look slightly different.
2019-06-24 11:03:16 +02:00
svlandeg
f4af47ce4a Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki 2019-06-24 10:57:07 +02:00
svlandeg
b58bace84b small fixes 2019-06-24 10:55:04 +02:00
Ines Montani
c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Ines Montani
ae2c208735 Auto-format [ci skip] 2019-06-20 10:36:38 +02:00
Ines Montani
872121955c Update error code 2019-06-20 10:35:51 +02:00
Ines Montani
e1be80e3ec Merge branch 'master' into pr/3864 2019-06-20 10:35:37 +02:00
Guillaume Claret
d7a519a922 Typo (#3865)
* Typo

* Add contributor agreement
2019-06-20 10:31:19 +02:00
Björn Böing
ebf5a04d6c Update pretrain docs and add unsupported loss_func error (#3860)
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`

* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.

* Add missing quotation marks
2019-06-20 10:30:44 +02:00
Alejandro Alcalde
4866a7ee9e Changed learning rate by its param name. (#3855)
* Changed learning rate by its param name.

I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful.

* Signing SCA
2019-06-20 10:29:20 +02:00
svlandeg
b76a43bee4 unicode strings 2019-06-19 13:26:33 +02:00
svlandeg
0b0959b363 UTF8 encoding 2019-06-19 13:11:39 +02:00
svlandeg
cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg
791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
svlandeg
a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
svlandeg
6332af40de baseline performances: oracle KB, random and prior prob 2019-06-17 14:39:40 +02:00
svlandeg
24db1392b9 reprocessing all of wikipedia for training data 2019-06-16 21:14:45 +02:00
Ines Montani
81c12640ab Auto-format [ci skip] 2019-06-16 14:33:20 +02:00
Greg Werner
9041a72d7f Update tokenizer.md for construction example (#3790)
* Update tokenizer.md for construction example

Self contained example.  You should really say what nlp is so that the example will work as is

* Update CONTRIBUTOR_AGREEMENT.md

* Restore contributor agreement

* Adjust construction examples
2019-06-16 14:32:56 +02:00
Kabir Khan
1e19f34e29 Add optional id property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Suraj Rajan
46c78d0a41 Dependency tree pattern matcher (#3465)
* Functional dependency tree pattern matcher

* Tests fail due to inconsistent behaviour

* Renamed dependencymatcher and added optimizations
2019-06-16 13:25:32 +02:00
Paul O'Leary McCann
3f52e12335 Change vector training to work with latest gensim (fix #3749) (#3757) 2019-06-16 13:24:06 +02:00
BreakBB
d8573ee715 Update error raising for CLI pretrain to fix #3840 (#3843)
* Add check for empty input file to CLI pretrain

* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key

* Skip empty values for correct pretrain keys and log a counter as warning

* Add tests for CLI pretrain core function make_docs.

* Add a short hint for the `tokens` key to the CLI pretrain docs

* Add success message to CLI pretrain

* Update model loading to fix the tests

* Skip empty values and do not create docs out of it
2019-06-16 13:22:57 +02:00
svlandeg
81731907ba performance per entity type 2019-06-14 19:55:46 +02:00
svlandeg
b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00