Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json ( #4953 )
...
* Fix for Issue 4665 - conllu2json
- Allowing HEAD to be an underscore
* Added contributor agreement
2020-02-03 13:01:48 +01:00
Ines Montani
abd5c06374
Adjust formatting [ci skip]
2020-02-03 13:00:02 +01:00
Martin A. Kayser
02a44c5be2
Adding a note on retrieving the string rep of the match_id ( #4904 )
...
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
Omri Mendels
6ff947e1f9
Added presidio-research to universe.json ( #4950 )
...
* Added presidio-research to universe.json
Added a reference to Presidio Research, the data-science toolbox for Microsoft Presidio.
* Updated url
2020-02-03 12:57:55 +01:00
Matthew Honnibal
d031440de2
Update setup.cfg
2020-01-29 17:35:46 +01:00
Paco Nathan
49fefb6139
Submitting PyTextRank
for inclusion in the spaCy uniVerse ( #4942 )
...
* submitting PyTextRank for consideration of including in the spaCy uniVerse
* including SCA
2020-01-28 11:37:54 +01:00
adrianeboyd
a938566b62
Fix Sentencizer.pipe() for empty doc ( #4940 )
2020-01-28 11:36:49 +01:00
adrianeboyd
7ad000fce7
Update docs for train CLI --use_gpu option ( #4927 )
2020-01-20 17:02:47 +01:00
Yohei Tamura
708a4d27eb
fix nlp.evaluate ( #4924 ) ( #4925 )
...
* new file: test_issue4924.py
* modified: spacy/gold.pyx
* modified: test_issue4924.py for python2
2020-01-20 12:17:46 +01:00
Kabir Khan
b9afcd56e3
Fix ent_ids and labels properties when id attribute used in patterns ( #4900 )
...
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
2020-01-16 02:01:31 +01:00
Sofie Van Landeghem
fbfc418745
run normal textcat train script with transformers ( #4834 )
...
* keep trf tok2vec and wordpiecer components during update
* also support transformer models for other example scripts
2020-01-16 02:01:23 +01:00
adrianeboyd
90c52128dc
Improve train CLI with base model ( #4911 )
...
Improve train CLI with a provided base model so that you can:
* add a new component
* extend an existing component
* replace an existing component
When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
Bram Vanroy
718704022a
Changes to spacy_conll in universe ( #4914 )
...
* Update information on spacy_conll
* Typo fix
2020-01-16 01:56:39 +01:00
Matthew Honnibal
1785eebfe0
Merge pull request #4909 from svlandeg/bugfix/cnn_window
...
bugfix typo conv_window
2020-01-14 11:23:14 +01:00
svlandeg
ee828d5a9a
bugfix typo conv_window
2020-01-14 09:02:58 +01:00
Sofie Van Landeghem
c70ccd543d
Friendly error warning for NEL example script ( #4881 )
...
* make model positional arg and raise error if no vectors
* small doc fixes
2020-01-14 01:51:14 +01:00
adrianeboyd
d24bca62f6
Add CJK to character classes ( #4884 )
...
* Add CJK character class as uncased
* Incorporate Chinese URL test case
Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer
b216ff43c9
Update vectors-similarity.md ( #4889 )
...
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd
aef83e8070
Mark most Hungarian tokenizer test cases as slow ( #4883 )
...
* Mark most Hungarian tokenizer test cases as slow
Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:
* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests
* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f
Reduce mem usage in training Entity Linker ( #4811 )
...
* move nlp processing for el pipe to batch training instead of preprocessing
* adding dev eval back in, and limit in articles instead of entities
* use pipe whenever possible
* few more small doc changes
* access dev data through generator
* tqdm description
* small fixes
* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d
add warning in debug_data for punctuation in entities ( #4853 )
2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d
Add trailing whitespace to multiline test text ( #4877 )
2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509
Fix and improve URL pattern ( #4882 )
...
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd
serialize ENT_ID ( #4852 )
...
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7
remove extra word typo ( #4875 )
...
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802
Update index.md [ci skip]
2020-01-04 01:52:18 +01:00
Al Johri
1aa2d4dac9
stop rendering mathjax by default in displacy ( #4840 )
...
* stop rendering mathjax by default in displacy
* Replace f-string and add comment
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c
Adds script shebang ( #4846 )
2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578
Fixes typos ( #4843 )
...
* Fixes typos
* Fixes typo
* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038
Add n_process to Language.pipe documentation ( #4842 ) [ci skip]
...
* Add n_process to documentation
* Auto-format and add default [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7
sign contributor agreement for AlJohri ( #4839 ) [ci skip]
2019-12-29 14:17:28 +01:00
Ines Montani
3431ac42de
Fix typo
2019-12-21 21:17:45 +01:00
Ines Montani
7c69d30de5
Tidy up and expect warning
2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28
facilitate larger training files ( #4827 )
...
* add warning for large file and change start var to long
* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
cb4145adc7
Tidy up and auto-format
2019-12-21 19:04:17 +01:00
Olamilekan Wahab
a741de7cf6
Adding support for Yoruba Language ( #4614 )
...
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313
Divide models into core and starters [ci skip]
2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a
Allow setting ner_missing_tag on docs_to_json
2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117
Documentation for PhraseMatcher constructor ( #4826 )
...
* add max_length as argument for init PhraseMatcher
* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a
Restore tqdm imports ( #4804 )
...
* set 4.38.0 to minimal version with color bug fix
* set imports back to proper place
* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466
Update universe [ci skip]
2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659
NEL requires sentences to be set ( #4801 )
2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e
add char_span to Span ( #4793 )
2019-12-13 15:54:58 +01:00
Sofie Van Landeghem
f9b541f9ef
More robust set entities method in KB ( #4794 )
...
* add unit test for setting entities with duplicate identifiers
* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495
Update doc.md ( #4796 )
2019-12-11 18:21:40 +01:00
Sofie Van Landeghem
5355b0038f
Update EL example ( #4789 )
...
* update EL example script after sentence-central refactor
* version bump
* set incl_prior to False for quick demo purposes
* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4
Add destructors for states in TransitionSystem ( #4686 )
2019-12-10 13:23:27 +01:00
Matthew Honnibal
45efdb1ef7
Merge branch 'master' of https://github.com/explosion/spaCy
2019-12-10 00:54:18 +01:00
Matthew Honnibal
0a3175d46f
Require thinc v7.4.0.dev0
2019-12-10 00:47:51 +01:00
adrianeboyd
c208eb6e4d
Fix int value handling in Matcher ( #4749 )
...
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00