Commit Graph

11377 Commits

Author SHA1 Message Date
Jan Jessewitsch
c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Sofie Van Landeghem
72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
adrianeboyd
3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
adrianeboyd
0c47a53b5e
Use int only in key2row for better performance (#4990)
Cast all keys and rows to `int` in `vectors.key2row` for more efficient
access and serialization.
2020-02-16 17:19:41 +01:00
adrianeboyd
5b102963bf
Require HEAD for is_parsed in Doc.from_array() (#5011)
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.

Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
2020-02-16 17:17:09 +01:00
Sofie Van Landeghem
2572460175
add tok2vec parameters to train script to facilitate init_tok2vec (#5021) 2020-02-16 17:16:41 +01:00
Sofie Van Landeghem
a27c77ce62
add message when cli train script throws exception (#5009)
* add message when cli train script throws exception

* fix formatting
2020-02-15 15:50:17 +01:00
Christos Aridas
ff8e71f46d
Update streamlit app (#5017)
* Update streamlit app [ci skip]

* Add all labels by default

* Tidy up and auto-format

Co-authored-by: Ines Montani <ines@ines.io>
2020-02-15 15:49:09 +01:00
nlptechbook
979a3fd1f5
Update universe.json (#5022)
e-book is available from https://nostarch.com/NLPPython
2020-02-15 15:44:55 +01:00
questoph
5352fc8fc3 Update tokenizer_exceptions.py 2020-02-14 12:02:15 +01:00
questoph
d1f0b397b5 Update punctuation.py 2020-02-13 22:18:51 +01:00
svlandeg
6e717c62ed avoid the tests interacting with eachother through the global Underscore variable 2020-02-12 13:21:31 +01:00
svlandeg
7939c63886 use English instead of model 2020-02-12 12:26:27 +01:00
svlandeg
46628d8890 add some asserts 2020-02-12 12:12:52 +01:00
svlandeg
51d37033c8 remove old comment 2020-02-12 12:10:05 +01:00
svlandeg
65f5b48b5d add comment 2020-02-12 12:06:27 +01:00
svlandeg
05dedaa2cf add unit test 2020-02-12 12:00:13 +01:00
svlandeg
ecbb9c4b9f load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00
adrianeboyd
99a543367d
Set GPU before loading any models in train CLI (#4989)
Set the GPU before loading any existing models in the train CLI so that
you can start with a base model and train on GPU.
2020-02-11 17:45:41 -05:00
adrianeboyd
842dfddbb9
Standardize Greek tag map setup (#4997)
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
2020-02-11 17:44:56 -05:00
Sofie Van Landeghem
1c01842588
add pyx and pxd files to the distribution (#5000) 2020-02-11 17:42:17 -05:00
Antti Ajanki
e1f777b151
Improvements for Finnish tokenizer (#4985)
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
2020-02-10 20:32:43 -05:00
Julin S
479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
adrianeboyd
5d8cb60e43
Update lower pin for srsly to 1.0.1 (#4976) 2020-02-10 20:30:54 -05:00
Ines Montani
9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Filip Bednárik
d4f4060bf3
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language

* Add SNK Tags

* Disable formatting lint for TAGS

* Add example sentences for Slovak language

* Add slovak numerals in base form

* Add lex_attrs to sk init

* Add contributor agreement
2020-02-03 13:03:59 +01:00
Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
Ines Montani
abd5c06374 Adjust formatting [ci skip] 2020-02-03 13:00:02 +01:00
Martin A. Kayser
02a44c5be2
Adding a note on retrieving the string rep of the match_id (#4904)
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
Omri Mendels
6ff947e1f9
Added presidio-research to universe.json (#4950)
* Added presidio-research to universe.json

Added a reference to Presidio Research, the data-science toolbox for Microsoft Presidio.

* Updated url
2020-02-03 12:57:55 +01:00
Matthew Honnibal
d031440de2
Update setup.cfg 2020-01-29 17:35:46 +01:00
Paco Nathan
49fefb6139 Submitting PyTextRank for inclusion in the spaCy uniVerse (#4942)
* submitting PyTextRank for consideration of including in the spaCy uniVerse

* including SCA
2020-01-28 11:37:54 +01:00
adrianeboyd
a938566b62 Fix Sentencizer.pipe() for empty doc (#4940) 2020-01-28 11:36:49 +01:00
adrianeboyd
7ad000fce7 Update docs for train CLI --use_gpu option (#4927) 2020-01-20 17:02:47 +01:00
Yohei Tamura
708a4d27eb fix nlp.evaluate (#4924) (#4925)
* new file:   test_issue4924.py

* modified:   spacy/gold.pyx

* modified:   test_issue4924.py for python2
2020-01-20 12:17:46 +01:00
Kabir Khan
b9afcd56e3 Fix ent_ids and labels properties when id attribute used in patterns (#4900)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set
2020-01-16 02:01:31 +01:00
Sofie Van Landeghem
fbfc418745 run normal textcat train script with transformers (#4834)
* keep trf tok2vec and wordpiecer components during update

* also support transformer models for other example scripts
2020-01-16 02:01:23 +01:00
adrianeboyd
90c52128dc Improve train CLI with base model (#4911)
Improve train CLI with a provided base model so that you can:

* add a new component
* extend an existing component
* replace an existing component

When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
Bram Vanroy
718704022a Changes to spacy_conll in universe (#4914)
* Update information on spacy_conll

* Typo fix
2020-01-16 01:56:39 +01:00
Matthew Honnibal
1785eebfe0
Merge pull request #4909 from svlandeg/bugfix/cnn_window
bugfix typo conv_window
2020-01-14 11:23:14 +01:00
svlandeg
ee828d5a9a bugfix typo conv_window 2020-01-14 09:02:58 +01:00
Sofie Van Landeghem
c70ccd543d Friendly error warning for NEL example script (#4881)
* make model positional arg and raise error if no vectors

* small doc fixes
2020-01-14 01:51:14 +01:00
adrianeboyd
d24bca62f6 Add CJK to character classes (#4884)
* Add CJK character class as uncased

* Incorporate Chinese URL test case

Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer
b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd
aef83e8070 Mark most Hungarian tokenizer test cases as slow (#4883)
* Mark most Hungarian tokenizer test cases as slow

Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:

* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests

* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f Reduce mem usage in training Entity Linker (#4811)
* move nlp processing for el pipe to batch training instead of preprocessing

* adding dev eval back in, and limit in articles instead of entities

* use pipe whenever possible

* few more small doc changes

* access dev data through generator

* tqdm description

* small fixes

* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d add warning in debug_data for punctuation in entities (#4853) 2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00