Commit Graph

4952 Commits

Author SHA1 Message Date
Matthew Honnibal
7137ad8b0b Make label filtering clearer for projectivisation 2018-02-26 12:02:01 +01:00
Matthew Honnibal
b8d52cb285 Fix inconsistent label freq cutoff for projectivisation 2018-02-26 12:01:44 +01:00
Matthew Honnibal
7b66ec896a Revert "Revert "Improve parser oracle around sentence breaks.""
This reverts commit 36e481c584.
2018-02-26 10:57:37 +01:00
Matthew Honnibal
36e481c584 Revert "Improve parser oracle around sentence breaks."
This reverts commit 50817dc9ad.
2018-02-26 10:53:55 +01:00
Matthew Honnibal
5faae803c6 Add option to not use Janome for Japanese tokenization 2018-02-26 09:39:46 +01:00
Matthew Honnibal
9b406181cd Add Chinese.Defaults.use_jieba setting, for UD 2018-02-25 15:12:38 +01:00
Matthew Honnibal
9ccd0c643b Add Vietnamese 2018-02-25 15:00:46 +01:00
Matthew Honnibal
d4fdb97c87 Fix alignment for words with spaces 2018-02-25 14:55:00 +01:00
Matthew Honnibal
6d2c1ef52c Fix SP tag in generic tag map 2018-02-24 16:04:56 +01:00
Matthew Honnibal
5cc3bd1c1d Update alignment tests 2018-02-24 16:03:58 +01:00
Matthew Honnibal
6138439469 Fix many-to-one alignment 2018-02-24 16:03:50 +01:00
Matthew Honnibal
4890ee1732 Fix scoring of tokenization for punct 2018-02-24 10:32:32 +01:00
Matthew Honnibal
12b39f87da Move cython declarations in matcher.pyx 2018-02-24 10:32:18 +01:00
Matthew Honnibal
01d1b7abdf Support many-to-one alignment in GoldParse 2018-02-24 10:17:01 +01:00
Matthew Honnibal
7865746574 Support many-to-one alignment 2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831 Poke matcher test for appveyor 2018-02-23 23:53:48 +01:00
Matthew Honnibal
968dabdde4 Fix bug in multi-task objective 2018-02-23 23:48:09 +01:00
Matthew Honnibal
2c9c8b8d72 Try comming out emoji test in matcher 2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe Try to find test that fails on appveyor 2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3 Try to find test failing on appveyor 2018-02-23 20:59:21 +01:00
Matthew Honnibal
4492a33a9d Fix sent_start multi-task objective when alignment fails 2018-02-23 16:50:59 +01:00
Matthew Honnibal
5fa44e93f1 Set unicode_literals in matcher 2018-02-23 16:48:54 +01:00
Matthew Honnibal
12264f9296 Add multi-task objective for sentence segmentation 2018-02-23 16:25:57 +01:00
Matthew Honnibal
e7deadb519 Set version to 2.1.0.dev1 2018-02-23 16:22:24 +01:00
Matthew Honnibal
7b575a119e Try to reduce memory usage of test_matcher 2018-02-23 15:34:37 +01:00
Matthew Honnibal
24563f4026 Fix data typing in align 2018-02-23 15:08:06 +01:00
Matthew Honnibal
7a5ba20692 Fix integer typing in _align 2018-02-23 14:51:24 +01:00
Matthew Honnibal
875411b875 Set unicode types in _align.pyx and test 2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3 Fix broken span.as_doc test 2018-02-23 14:22:24 +01:00
dejanmarich
71c261d58b
Update stop_words.py
Added more words
2018-02-23 10:31:01 +01:00
Matthew Honnibal
3e6c1111b7 Remove obsolete test 2018-02-23 03:22:07 +01:00
Matthew Honnibal
a4fdec524a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-gold 2018-02-22 21:44:28 +01:00
Matthew Honnibal
50817dc9ad Improve parser oracle around sentence breaks. 2018-02-22 19:22:26 +01:00
Matthew Honnibal
307aefe131 Increment version to v2.0.9 2018-02-22 17:07:53 +01:00
Feng Niu
1c60384bed return on empty doc 2018-02-21 15:39:04 -08:00
Feng Niu
7eb1cd100b unbound doc var 2018-02-21 15:05:37 -08:00
Feng Niu
8df75b229c fix unbound vars in es.syntax_iterators 2018-02-21 13:11:17 -08:00
alldefector
4244e285c2
Fix Spanish noun_chunks failure caused by typo 2018-02-21 12:43:21 -08:00
Matthew Honnibal
661873ee4c Randomize the rebatch size in parser 2018-02-21 21:02:07 +01:00
Matthew Honnibal
0872cf611d Don't lower-case lemmas of proper nouns 2018-02-21 16:01:16 +01:00
Matthew Honnibal
a0ddb803fd Make error when no label found more helpful 2018-02-21 16:00:59 +01:00
Matthew Honnibal
ea2fc5d45f Improve length and freq cutoffs in parser 2018-02-21 16:00:38 +01:00
Matthew Honnibal
e5757d4bf0 Add labels property to parser 2018-02-21 16:00:00 +01:00
Matthew Honnibal
eff4ae809a Fix nonproj label filter 2018-02-21 15:59:04 +01:00
Matthew Honnibal
e624405cda Temporarily remove cutoff when filtering labels in nonproj 2018-02-21 13:53:40 +01:00
Matthew Honnibal
f466f0186e Use new alignment implementation in GoldParse 2018-02-20 21:16:35 +01:00
Matthew Honnibal
c0734ba526 Make alignment work with strings 2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98 Add tests for new Levenshtein alignment 2018-02-20 17:32:25 +01:00
Matthew Honnibal
930c980570 Add improved Levenshtein alignment implementation 2018-02-20 17:31:56 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
Polish tag map
2018-02-18 19:05:58 +01:00
Jim O'Regan
664407de5d missing PrepCase attribute 2018-02-18 14:46:12 +00:00
Jim O'Regan
95f0673fbc fix typo/missing here too 2018-02-18 14:38:27 +00:00
Matthew Honnibal
2bccad8815 Fix incorrect matcher test 2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-18 14:40:42 +01:00
Matthew Honnibal
cf0e320f2b Add doc.is_sentenced attribute, re #1959 2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
1cf774bdc1 Add output options return_matches and as_tuples to Matcher 2018-02-18 14:00:45 +01:00
Matthew Honnibal
dd9b0945af Fix inconsistencies in the symbols table 2018-02-18 13:51:31 +01:00
Matthew Honnibal
66496ac8e1 Set version to v2.1.0.dev0 2018-02-18 13:48:39 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b Update matcher branch with v2.0.8 master 2018-02-18 13:42:58 +01:00
ines
6bba1db4cc Drop six and related hacks as a dependency 2018-02-18 13:29:56 +01:00
Matthew Honnibal
b30b09192a
Merge pull request #1665 from jimregan/animacy
typo in "inan", add "nhum"
2018-02-18 13:26:53 +01:00
Matthew Honnibal
1b3c98e01b Set version to v2.0.8 2018-02-18 12:16:31 +01:00
Matthew Honnibal
f9f46e5a07 Revert matcher fixes from GregDubbin 2018-02-18 10:59:28 +01:00
Matthew Honnibal
86405e4ad1 Fix CLI for multitask objectives 2018-02-18 10:59:11 +01:00
Matthew Honnibal
a34749b2bf Add multitask objectives options to train CLI 2018-02-17 22:03:54 +01:00
Matthew Honnibal
8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal
d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal
262d0a3148 Fix overwriting of lexical attributes when loading vectors during training 2018-02-17 18:11:11 +01:00
Matthew Honnibal
c0caf7cf27 Fix LANG symbol 2018-02-17 18:10:50 +01:00
Matthew Honnibal
0bf2f6be29 Add missing symbol for LANG attr. Fixes inconsistent numeric ID 2018-02-17 17:37:02 +01:00
Matthew Honnibal
97a228a4ce Increment to v2.0.8.dev0 2018-02-17 16:54:36 +01:00
Matthew Honnibal
f7dc64d2a3 Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-17 16:47:35 +01:00
Aaron Marquez
ea571e8325 Merge branch 'master' into issue-1959 2018-02-16 15:14:09 -08:00
Matthew Honnibal
7d5c720fc3 Fix multitask objective when no pipeline provided 2018-02-15 23:50:21 +01:00
Aaron Marquez
f0d3672e17 Changed loading EN model 2018-02-15 14:28:38 -08:00
Aaron Marquez
3765d84d57 Fix issue #1959 2018-02-15 12:51:49 -08:00
Aaron Marquez
7ba4111554 Add test for issue-1959 2018-02-15 12:46:22 -08:00
Matthew Honnibal
59b7cf9db8 Add get_beam_parse method in ArcEager, for Prodigy 2018-02-15 21:03:16 +01:00
Matthew Honnibal
3e541de440 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-15 21:02:55 +01:00
Thomas Opsomer
5d24a81c0b add test for span.sent when doc not parsed 2018-02-15 16:59:16 +01:00
Thomas Opsomer
deab391cbf correct check on sent_start & raise if no boundaries 2018-02-15 16:58:30 +01:00
Matthew Honnibal
afbd46adfb Remove length cap in PhraseMatcher 2018-02-15 16:10:54 +01:00
Matthew Honnibal
4533c7408d Update matcher tests 2018-02-15 15:39:47 +01:00
Matthew Honnibal
1c19605426 Move matcher2.pyx to matcher.pyx 2018-02-15 15:27:03 +01:00
Matthew Honnibal
9ebf2fe7c3 Make helper function to get longest matches 2018-02-15 15:26:15 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Thomas Opsomer
b902731313 Find span sentence when only sentence boundaries (no parser) 2018-02-14 22:18:54 +01:00
Matthew Honnibal
d19dc67886 Make get_action nogil, for efficiency 2018-02-14 12:16:36 +01:00
Matthew Honnibal
7885b92b45 Refactor matcher2, hopefully making it faster 2018-02-14 12:11:17 +01:00
Matthew Honnibal
00261eea27 Make tests refer to matcher2 2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
262cbe356e Remove caching, as doesn't seem to help for now. 2018-02-13 17:15:20 +01:00
Matthew Honnibal
f43d53f2c5 Remove print statement 2018-02-13 17:15:07 +01:00
Matthew Honnibal
dcd8d89aef Update test for 850, making it work with matcher2 2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f Remove re comparisons tests, as matcher behaves differently 2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1 Fix matcher test 2018-02-13 16:28:06 +01:00
Matthew Honnibal
9efda9e9ab Add PhraseMatcher in matcher2.pyx 2018-02-13 16:27:46 +01:00
Johannes Dollinger
012e874d09 Add contributor agreement for emulbreh 2018-02-13 13:40:33 +01:00
Johannes Dollinger
bf94c13382 Don't fix random seeds on import 2018-02-13 12:42:23 +01:00
Matthew Honnibal
0004331895 Update notes on matcher2 2018-02-13 11:45:45 +01:00
Matthew Honnibal
b4cc39eb74 Fix zero-width quantifiers. Passes test_matcher 2018-02-13 11:45:32 +01:00
Matthew Honnibal
1b01685f47 Fix ZERO_PLUS operator 2018-02-12 12:28:03 +01:00
Matthew Honnibal
9115c3ba0a Add TODO in notes 2018-02-12 12:06:48 +01:00
Matthew Honnibal
b00326a7fe Move pattern_id out of TokenPattern 2018-02-12 12:05:54 +01:00
Matthew Honnibal
d34c732635 Add Python notes for rethinking matcher 2018-02-12 10:19:29 +01:00
Matthew Honnibal
d7c9b53120 Pass kwargs into pipeline components during begin_training 2018-02-12 10:18:39 +01:00
Matthew Honnibal
fae5c0dc18 Work on matcher2 2018-02-12 10:17:43 +01:00
4altinok
ca8728035d added new lex feat to token 2018-02-11 18:55:48 +01:00
4altinok
edd7202a06 added new symbol 2018-02-11 18:55:32 +01:00
4altinok
ed1ac2969e added new lexical feat to lexeme 2018-02-11 18:51:48 +01:00
4altinok
94fb0b75e3 code for is_currency 2018-02-11 18:51:32 +01:00
4altinok
3deef1497a removed 18 and replaced 18 with is_currency 2018-02-11 18:51:09 +01:00
4altinok
471d3c9e23 added lex test for is_currency 2018-02-11 18:50:50 +01:00
ines
c63e99da8a Fix typo in glossary (resolves #1964)
Co-Authored-By: SThomasP <sthomasp@users.noreply.github.com>
2018-02-10 11:58:41 +01:00
Lyndon White
6ee5dff51c
Make python 3.4 compat module loading (fix #1733) 2018-02-09 23:03:35 +08:00
Matthew Honnibal
e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal
fd9fd275c5 Make test for #1945 more precise 2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180 Add test for #1945: PhraseMatcher regression 2018-02-07 01:29:23 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe Copied French syntax iterator to simplify future changes 2018-02-05 14:45:05 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Ali Zarezade
9df9da34a3
Fix init_model issue
Fixing issue #1928
2018-02-03 17:21:34 +03:30
Matthew Honnibal
ebe84e45e5 Increment version to 2.0.7 2018-02-02 03:39:16 +01:00
Matthew Honnibal
e4b1f57599 Increment version 2018-02-02 02:33:23 +01:00
Matthew Honnibal
069531c351 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-02 02:32:58 +01:00
Matthew Honnibal
f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
ines
f1d3deffac Add Russian example sentences (see #1107) 2018-02-01 20:09:40 +01:00
Matthew Honnibal
6b1126c312 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-01 02:57:52 +01:00
ines
3c1fb9d02d Make validate command fail more gracefully if version not found
Mostly relevant during develoment when working with .dev versions
2018-01-31 22:06:28 +01:00
Motoki Wu
54062b7326 added tests for issue #1915 2018-01-30 18:30:19 -08:00
Motoki Wu
f4a7d1a423 make to sure pass in **cfg to each component when training 2018-01-30 18:29:54 -08:00
ines
4046823699 Only check component in factories if string (see #1911) 2018-01-30 16:29:07 +01:00
ines
ce10d320c4 Fix component check in self.factories (see #1911) 2018-01-30 16:09:37 +01:00
Ole Henrik Skogstrøm
e40465487c Added french syntax iterator with explenation 2018-01-30 15:44:29 +01:00
ines
8901814248 Improve error handling if pipeline component is not callable (resolves #1911)
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
a437ba87a3 Set release=True 2018-01-29 21:26:04 +01:00
Adam Binford
9238749aaf Removed test to avoid network requests 2018-01-29 14:48:20 -05:00
Adam Binford
1a2c2f7d7f Fixed auto linking after download and added simple test to check 2018-01-29 14:25:21 -05:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Matthew Honnibal
0c1e7f0c86
Merge pull request #1893 from azarezade/master
Add Persian language
2018-01-29 18:18:33 +01:00
Matthew Honnibal
cbdab75b36 Increment version 2018-01-28 23:46:22 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
Fix sentence boundaries serialization (issue #1834)
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100 Limit parser model size, to hopefully reduce memory during CI tests 2018-01-28 21:00:32 +01:00
Thomas Opsomer
515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Thomas Opsomer
45d62561f7 add test for the issue 2018-01-28 19:49:56 +01:00
ines
6d978e5c35 Don't use deprecated Doc.merge call in displaCy
As reported here: https://stackoverflow.com/a/48464412/6400719
2018-01-27 11:25:05 +01:00
Ali Zarezade
bb6bd3d8ae add persian language 2018-01-27 13:27:26 +03:30