Commit Graph

11384 Commits

Author SHA1 Message Date
Ines Montani
8137b24928
Merge pull request #5028 from explosion/refactor/remove-symlinks
Remove symlinks, data dir and related stuff
2020-02-19 00:20:23 +01:00
Ines Montani
a3335d36b8 Merge branch 'develop' into refactor/remove-symlinks 2020-02-18 17:22:20 +01:00
Ines Montani
a138acb220
Merge pull request #5027 from explosion/chore/sync-develop-master
Sync develop with master, tidy up, auto-format
2020-02-18 17:22:03 +01:00
Ines Montani
09cbeaef27 Remove symlinks, data dir and related stuff 2020-02-18 17:20:17 +01:00
Ines Montani
e3f40a6a0f Tidy up and auto-format 2020-02-18 15:38:18 +01:00
Ines Montani
1278161f47 Tidy up and fix issues 2020-02-18 15:17:03 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Ines Montani
80e95d02b1 Allow spacy attr in token pattern 2020-02-18 14:32:53 +01:00
Jan Jessewitsch
c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Sofie Van Landeghem
72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
adrianeboyd
3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
adrianeboyd
0c47a53b5e
Use int only in key2row for better performance (#4990)
Cast all keys and rows to `int` in `vectors.key2row` for more efficient
access and serialization.
2020-02-16 17:19:41 +01:00
adrianeboyd
5b102963bf
Require HEAD for is_parsed in Doc.from_array() (#5011)
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.

Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
2020-02-16 17:17:09 +01:00
Sofie Van Landeghem
2572460175
add tok2vec parameters to train script to facilitate init_tok2vec (#5021) 2020-02-16 17:16:41 +01:00
Sofie Van Landeghem
a27c77ce62
add message when cli train script throws exception (#5009)
* add message when cli train script throws exception

* fix formatting
2020-02-15 15:50:17 +01:00
Christos Aridas
ff8e71f46d
Update streamlit app (#5017)
* Update streamlit app [ci skip]

* Add all labels by default

* Tidy up and auto-format

Co-authored-by: Ines Montani <ines@ines.io>
2020-02-15 15:49:09 +01:00
nlptechbook
979a3fd1f5
Update universe.json (#5022)
e-book is available from https://nostarch.com/NLPPython
2020-02-15 15:44:55 +01:00
questoph
5352fc8fc3 Update tokenizer_exceptions.py 2020-02-14 12:02:15 +01:00
questoph
d1f0b397b5 Update punctuation.py 2020-02-13 22:18:51 +01:00
svlandeg
2729d9164d cleanup 2020-02-12 22:59:37 +01:00
svlandeg
6bbd816569 formatting 2020-02-12 22:50:27 +01:00
svlandeg
34986c7bfd test versions of required libs across different places 2020-02-12 22:49:50 +01:00
svlandeg
2079948711 add build dependencies back to pyproject.toml 2020-02-12 22:49:21 +01:00
svlandeg
6e717c62ed avoid the tests interacting with eachother through the global Underscore variable 2020-02-12 13:21:31 +01:00
svlandeg
7939c63886 use English instead of model 2020-02-12 12:26:27 +01:00
svlandeg
46628d8890 add some asserts 2020-02-12 12:12:52 +01:00
svlandeg
51d37033c8 remove old comment 2020-02-12 12:10:05 +01:00
svlandeg
65f5b48b5d add comment 2020-02-12 12:06:27 +01:00
svlandeg
05dedaa2cf add unit test 2020-02-12 12:00:13 +01:00
svlandeg
ecbb9c4b9f load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00
Ines Montani
2ed49404e3
Improve setup.py and call into Cython directly (#4952)
* Improve setup.py and call into Cython directly

* Add numpy to setup_requires

* Improve clean helper

* Update setup.cfg

* Try if it builds without pyproject.toml

* Update MANIFEST.in
2020-02-11 17:46:18 -05:00
adrianeboyd
99a543367d
Set GPU before loading any models in train CLI (#4989)
Set the GPU before loading any existing models in the train CLI so that
you can start with a base model and train on GPU.
2020-02-11 17:45:41 -05:00
adrianeboyd
842dfddbb9
Standardize Greek tag map setup (#4997)
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
2020-02-11 17:44:56 -05:00
Sofie Van Landeghem
1c01842588
add pyx and pxd files to the distribution (#5000) 2020-02-11 17:42:17 -05:00
Sofie Van Landeghem
9b84f987bd
fix grad_clip naming (#4967) 2020-02-10 20:33:16 -05:00
Antti Ajanki
e1f777b151
Improvements for Finnish tokenizer (#4985)
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
2020-02-10 20:32:43 -05:00
Sofie Van Landeghem
781e95cf53
Ensure doc.similarity returns a float (on develop) (#4969) 2020-02-10 20:31:49 -05:00
Julin S
479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
adrianeboyd
5d8cb60e43
Update lower pin for srsly to 1.0.1 (#4976) 2020-02-10 20:30:54 -05:00
Ines Montani
9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Filip Bednárik
d4f4060bf3
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language

* Add SNK Tags

* Disable formatting lint for TAGS

* Add example sentences for Slovak language

* Add slovak numerals in base form

* Add lex_attrs to sk init

* Add contributor agreement
2020-02-03 13:03:59 +01:00
Sofie Van Landeghem
cabd60fa1e
Small fixes to as_example (#4957)
* label in span not writable anymore

* Revert "label in span not writable anymore"

This reverts commit ab442338c8.

* fixing yield - remove redundant list
2020-02-03 13:02:12 +01:00
Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
Ines Montani
abd5c06374 Adjust formatting [ci skip] 2020-02-03 13:00:02 +01:00
Martin A. Kayser
02a44c5be2
Adding a note on retrieving the string rep of the match_id (#4904)
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
Omri Mendels
6ff947e1f9
Added presidio-research to universe.json (#4950)
* Added presidio-research to universe.json

Added a reference to Presidio Research, the data-science toolbox for Microsoft Presidio.

* Updated url
2020-02-03 12:57:55 +01:00
Matthew Honnibal
71b93f33bb Set dev version 2020-01-30 15:41:45 +01:00
Matthew Honnibal
9df0b1360d Fix ml_datasets 2020-01-30 10:35:18 +01:00
Matthew Honnibal
ba6d78132d Fix dev version 2020-01-30 10:35:09 +01:00