svlandeg
c4d030dbf6
remove accidental commit
2020-03-09 18:10:54 +01:00
svlandeg
1724a4f75b
additional information if doc is empty
2020-03-09 18:08:18 +01:00
Renaud Richardet
eccf6b1686
small typo in code sample
2020-03-09 14:49:11 +01:00
Adriane Boyd
0c31f03ec5
Update docs [ci skip]
2020-03-09 13:41:17 +01:00
Adriane Boyd
1139247532
Revert changes to token_match priority from #4374
...
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns
* Add lookahead and potentially slow lookbehind back to the default URL
pattern
* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882
* Revert changes to Hungarian tokenizer
* Revert (xfail) several URL tests to their status before #4374
* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani
1d6aec805d
Fix formatting and update docs for v2.2.4
2020-03-09 11:17:20 +01:00
Ines Montani
5f68004264
Port over gitignore changes from develop
...
Prevents stale files when switching branches
2020-03-09 11:05:00 +01:00
Mark Abraham
0345135167
Tokenizer to_disk and from_disk now ensure paths ( #5116 )
...
* Tokenizer to_disk and from_disk now ensure strings are converted to paths
Fixes #5115
* Sign contributor agreement
2020-03-08 13:25:56 +01:00
Yohei Tamura
31755630a7
fix typ ( #5106 )
2020-03-08 13:24:38 +01:00
adrianeboyd
9dd98a4b27
Improve Makefile ( #5105 )
...
* Explicitly upgrade pip
* Include spacy-lookups-data in pex
2020-03-08 13:24:19 +01:00
adrianeboyd
993758c58f
Remove unnecessary iterator in Language.pipe ( #5101 )
...
Remove iterator over `raw_texts` with `iterator.tee()` in
`Language.pipe` that is never consumed and consumes memory
unnecessarily.
2020-03-08 13:22:25 +01:00
Ines Montani
cd79c7bd26
Merge pull request #5110 from dhpollack/dhp/fix-minor-svg-error
...
fix typo in svg file - caused documentation build error
2020-03-06 15:32:43 +01:00
Sofie Van Landeghem
1a2b8fc264
set vector of merged entity ( #5085 )
...
* merge_entities sets the vector in the vocab for the merged token
* add unit test
* import unicode_literals
* move code to _merge function
* only set vector if vocab has non-zero vectors
2020-03-06 14:45:28 +01:00
David Pollack
80004930ed
fix typo in svg file
2020-03-05 17:04:33 +01:00
Matthew Honnibal
3440a72ecb
Update Makefile ( #5099 )
2020-03-04 19:28:16 +01:00
Ines Montani
31faab3647
Merge pull request #5097 from mirfan899/master
...
Basque language support added.
2020-03-04 17:20:23 +01:00
Ines Montani
99d8ee506f
Merge pull request #5100 from adrianeboyd/feature/bump-srsly-1.0.2
...
Require srsly >=1.0.2
2020-03-04 16:32:52 +01:00
Adriane Boyd
4d655b1d45
Require srsly >=1.0.2
2020-03-04 13:50:37 +01:00
Muhammad Irfan
224a7f8e94
examples
2020-03-04 15:49:06 +05:00
Muhammad Irfan
03376c9d9b
Basque language added and tested.
2020-03-04 11:58:56 +05:00
adrianeboyd
9be90dbca3
Improve token head verification ( #5079 )
...
* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
2020-03-03 21:44:51 +01:00
adrianeboyd
8c20dae6f7
Fix model-final/model-best meta from train CLI ( #5093 )
...
* Fix model-final/model-best meta
* include speed and accuracy from final iteration
* combine with speeds from base model if necessary
* Include token_acc metric for all components
2020-03-03 21:43:25 +01:00
Sofie Van Landeghem
a0998868ff
prevent updating cfg if the Model was already defined ( #5078 )
2020-03-03 13:58:56 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing ( #5081 )
...
* restore load_nlp.VECTORS in the child process
* add unit test
* fix test
* remove unnecessary import
* add utf8 encoding
* import unicode_literals
2020-03-03 13:58:22 +01:00
adrianeboyd
d078b47c81
Break out of infinite loop as intended ( #5077 )
2020-03-03 12:29:05 +01:00
adrianeboyd
697bec764d
Normalize IS_SENT_START to SENT_START for Matcher ( #5080 )
2020-03-03 12:22:39 +01:00
adrianeboyd
2281c4708c
Restore empty tokenizer properties ( #5026 )
...
* Restore empty tokenizer properties
* Check for types in tokenizer.from_bytes()
* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem
c6b12ab02a
Bugfix/get doc ( #5049 )
...
* new (broken) unit test
* fixing get_doc method
2020-03-02 11:49:28 +01:00
adrianeboyd
65d7bab10f
Initialize all values in a2b/b2a in new align ( #5063 )
2020-02-27 18:43:00 +01:00
Matthew Honnibal
b4e0d2bf50
Improve Makefile ( #5067 )
...
* Improve pex making
* Update gitignore
2020-02-26 20:59:10 +01:00
Adriane Boyd
9f740a9891
Add a few more Danish tokenizer exceptions
2020-02-26 14:59:03 +01:00
Ines Montani
1c212215cd
Merge pull request #5064 from adrianeboyd/feature/german-tokenization
...
Improve German tokenization
2020-02-26 13:41:44 +01:00
Ines Montani
56978f5cd8
Merge pull request #5060 from svlandeg/feature/update-thinc
...
update thinc
2020-02-26 13:40:23 +01:00
Adriane Boyd
d1f703d78d
Improve German tokenization
...
Improve German tokenization with respect to Tiger.
2020-02-26 13:06:52 +01:00
Ines Montani
54da6a2a07
Update pyproject.toml
2020-02-26 12:51:53 +01:00
Ines Montani
ed9358420e
Merge branch 'master' into pr/5060
2020-02-26 12:51:29 +01:00
adrianeboyd
ff184b7a9c
Add tag_map argument to CLI debug-data and train ( #4750 ) ( #5038 )
...
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2020-02-26 12:10:38 +01:00
svlandeg
18ff97589d
update spacy to 2.2.4.dev0
2020-02-26 10:50:05 +01:00
svlandeg
62406a9513
update from thinc 7.4.0.dev2 to 7.4.0
2020-02-26 10:30:35 +01:00
Ines Montani
c7e3c034d2
Merge pull request #5061 from explosion/fix/pyproject-toml-master
...
Update pyproject.toml
2020-02-25 20:22:26 +01:00
Ines Montani
dc36ec98a4
Update pyproject.toml
2020-02-25 16:46:14 +01:00
Ines Montani
acb4e3c7ba
Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape
...
Fix formatting in Token API
2020-02-25 14:57:25 +01:00
Ines Montani
d50152b917
Merge pull request #5019 from questoph/master
...
Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)
2020-02-25 14:48:50 +01:00
Ines Montani
4440a072d2
Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore
...
load Underscore state when multiprocessing
2020-02-25 14:46:02 +01:00
Ines Montani
38fc05986c
Merge pull request #5058 from bryant1410/patch-1
...
Add missing comma in a dependency specification
2020-02-25 14:44:29 +01:00
svlandeg
d848a68340
thinc 7.4.0.dev2
2020-02-25 12:07:42 +01:00
Santiago Castro
54d8665ff7
Add missing comma in a dependency specification
...
Conda is complaining that it can't parse that line otherwise.
2020-02-24 16:15:28 -05:00
svlandeg
b49a3afd0c
use clean_underscore fixture
2020-02-23 15:49:20 +01:00
Ines Montani
4890db6339
Auto-format and fix image [ci skip]
2020-02-23 13:56:50 +01:00
Tom Keefe
ddf63b97a8
make idx available via to_array ( #5030 )
2020-02-22 14:13:06 +01:00