Commit Graph

1198 Commits

Author SHA1 Message Date
ines
6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
Matthew Honnibal
de9fd091ac Fix #2014: token.pos_ not writeable 2018-03-27 21:21:11 +02:00
Matthew Honnibal
1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
d2118792e7 Merge changes from master 2018-03-27 13:38:41 +02:00
Matthew Honnibal
7d4687162f Update doc.ents test 2018-03-26 07:14:35 +02:00
Matthew Honnibal
938436455a Add test for ent_iob during span merge 2018-03-25 22:16:19 +02:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1 Fix unicode declaration on test 2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6 Add test for textcat 2018-03-16 12:39:45 +01:00
ines
f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
ines
d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2 2018-03-12 23:03:05 +01:00
Matthew Honnibal
53b3249e06 Add tests for arc eager oracle 2018-03-10 23:42:56 +01:00
Matthew Honnibal
5cc3bd1c1d Update alignment tests 2018-02-24 16:03:58 +01:00
Matthew Honnibal
7865746574 Support many-to-one alignment 2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831 Poke matcher test for appveyor 2018-02-23 23:53:48 +01:00
Matthew Honnibal
2c9c8b8d72 Try comming out emoji test in matcher 2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe Try to find test that fails on appveyor 2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3 Try to find test failing on appveyor 2018-02-23 20:59:21 +01:00
Matthew Honnibal
7b575a119e Try to reduce memory usage of test_matcher 2018-02-23 15:34:37 +01:00
Matthew Honnibal
875411b875 Set unicode types in _align.pyx and test 2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3 Fix broken span.as_doc test 2018-02-23 14:22:24 +01:00
Matthew Honnibal
3e6c1111b7 Remove obsolete test 2018-02-23 03:22:07 +01:00
Matthew Honnibal
c0734ba526 Make alignment work with strings 2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98 Add tests for new Levenshtein alignment 2018-02-20 17:32:25 +01:00
Matthew Honnibal
2bccad8815 Fix incorrect matcher test 2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-18 14:40:42 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b Update matcher branch with v2.0.8 master 2018-02-18 13:42:58 +01:00
ines
6bba1db4cc Drop six and related hacks as a dependency 2018-02-18 13:29:56 +01:00
Matthew Honnibal
f9f46e5a07 Revert matcher fixes from GregDubbin 2018-02-18 10:59:28 +01:00
Matthew Honnibal
f7dc64d2a3 Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-17 16:47:35 +01:00
Aaron Marquez
f0d3672e17 Changed loading EN model 2018-02-15 14:28:38 -08:00
Aaron Marquez
7ba4111554 Add test for issue-1959 2018-02-15 12:46:22 -08:00
Thomas Opsomer
5d24a81c0b add test for span.sent when doc not parsed 2018-02-15 16:59:16 +01:00
Matthew Honnibal
4533c7408d Update matcher tests 2018-02-15 15:39:47 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Matthew Honnibal
00261eea27 Make tests refer to matcher2 2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
dcd8d89aef Update test for 850, making it work with matcher2 2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f Remove re comparisons tests, as matcher behaves differently 2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1 Fix matcher test 2018-02-13 16:28:06 +01:00
4altinok
471d3c9e23 added lex test for is_currency 2018-02-11 18:50:50 +01:00
Matthew Honnibal
fd9fd275c5 Make test for #1945 more precise 2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180 Add test for #1945: PhraseMatcher regression 2018-02-07 01:29:23 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Matthew Honnibal
f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
Motoki Wu
54062b7326 added tests for issue #1915 2018-01-30 18:30:19 -08:00
ines
8901814248 Improve error handling if pipeline component is not callable (resolves #1911)
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
Fix sentence boundaries serialization (issue #1834)
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100 Limit parser model size, to hopefully reduce memory during CI tests 2018-01-28 21:00:32 +01:00
Thomas Opsomer
45d62561f7 add test for the issue 2018-01-28 19:49:56 +01:00
Kit
52ef51f36e
Add test for issue #1889 2018-01-25 22:56:48 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
edb71a280e Add test for #1883: Unpickling Matcher 2018-01-24 15:42:33 +01:00
Matthew Honnibal
42a18ef903 Add test for #1868: Vocab.__contains__ with ints 2018-01-23 23:27:05 +01:00
greg
85ab99e692 Correct test examples 2018-01-23 15:00:14 -05:00
Matthew Honnibal
91e916cb67 Add comment to new test 2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad Add test for #1727 2018-01-23 19:11:01 +01:00
Matthew Honnibal
7e6dc283db Fix unicode import in test 2018-01-22 23:55:44 +01:00
greg
686735b94e Fix matcher import 2018-01-22 16:53:05 -05:00
Matthew Honnibal
4ce7d24fd5 Add test for #1799: Set left and right edges (and thus sentences) in non-projective parses. 2018-01-22 20:18:38 +01:00
greg
7072b395c9 Add greedy matcher tests 2018-01-16 15:46:13 -05:00
Matthew Honnibal
ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7 Fix test 2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58 Add test for #1757: Comparison against None 2018-01-15 15:55:01 +01:00
Matthew Honnibal
9e413449f6 Fix unicode error in new test 2018-01-15 15:39:00 +01:00
Matthew Honnibal
6b215d2dd3 Add test for Issue #1537 2018-01-15 15:20:56 +01:00
ines
5babb7d6f6 Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-14 17:31:09 +01:00
ines
793890cb4d Remove test for removed deprecation warning 2018-01-14 17:31:06 +01:00
Matthew Honnibal
1a1cca6052 Fix vectors.resize() on Py3. Closes #1539 2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304 Make set_vector add word to vocab. Fixes #1807 2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow 2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769 2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769 2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes 2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769 2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7) 2018-01-12 20:30:51 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769) 2018-01-08 03:42:04 +01:00
Søren Lind Kristiansen
62de5da1ff Remove unsused dummy variable 2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8 Remove dummy variable from function calls 2018-01-05 09:37:05 +01:00
Kevin Humphreys
597df5bf83 add test 2018-01-03 13:00:05 -08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc Fix missing import 2017-12-22 16:21:44 +01:00
ines
8dc1c27841 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-22 16:01:00 +01:00
ines
b10ba848b8 xfail test that causes MemoryError on Python 2 on Windows
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization 2017-12-20 21:05:34 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Søren Lind Kristiansen
15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698 2017-12-12 10:36:11 +01:00
Isaac Sijaranamual
38021fbb00 Switch from python 3 only TemporaryDirectory to pytest's tmpdir 2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
568130ce7c Adds regression test_issue1622 2017-12-10 23:00:48 +01:00
Matthew Honnibal
36b47e3fa6 Fix (and test) vector pickling 2017-12-07 09:53:30 +01:00
Canbey Bilgili
abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Vadim Mazaev
4ba7ddf651 Bugfixies 2017-11-30 12:29:38 +03:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
Solving #1494
2017-11-29 23:11:23 +01:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
Add support for Russian
2017-11-29 23:10:01 +01:00
ines
a31506e060 Fix off-by-one error in nlp.add_pipe(after=name) (fixes #1654) 2017-11-28 20:37:55 +01:00
ines
b62739fbfe Add regression test for #1654 2017-11-28 20:27:54 +01:00
ines
2e50dbb9d7 Simplify test 2017-11-28 20:27:27 +01:00
Felix Sonntag
724ae7dc55 Fixed issue of infix capturing prefixes 2017-11-28 17:17:12 +01:00
Søren Lind Kristiansen
0ffd27b0f6 Add several Danish alternative spellings 2017-11-27 13:35:41 +01:00
Vadim Mazaev
53e7c38637 Fixed tests depends on pymorphy2 2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Ines Montani
a7bb8f1b42
Merge pull request #1637 from sorenlind/da_tokenization
Improve Danish tokenization
2017-11-26 15:41:38 +00:00
ines
c699aec089 Add offsets_from_biluo_tags helper and tests (see #1626) 2017-11-26 16:38:01 +01:00
Søren Lind Kristiansen
6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c Add test for tokenization of 'i.' for Danish. 2017-11-24 11:29:37 +01:00
Matthew Honnibal
30ba81f881
Merge pull request #1576 from ligser/master
Actually reset caches in pipe [wip]
2017-11-23 12:54:48 +01:00
ines
c90fe92e15 Fix displaCy test 2017-11-22 05:04:39 +01:00
ines
a6f33ac27d Fix displaCy test 2017-11-22 04:19:28 +01:00
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
Burton DeWilde
635792997c Add regression test for #1612 2017-11-20 12:05:35 -06:00
ines
d70a64d78b Fix syntax error and formatting in test (see #1617) 2017-11-20 14:01:25 +01:00
ines
17849dee4b Fix French test (see #1617) 2017-11-20 13:59:59 +01:00
Felix Sonntag
8be3392302 Added regression text for 1494 2017-11-19 16:30:35 +01:00
Motoki Wu
b818afaa0e Added failing test for Issue #1207.
The noun chunk iterator should work for `Doc` but not for `Span`.
2017-11-17 17:04:27 -08:00
ines
a3d4dd1a5d Test adding of lots of pipeline components (see #1585)
Just to make sure that there's no error now or in the future with adding a large number of pipeline components.
2017-11-15 17:28:06 +01:00
Roman Domrachev
505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Roman Domrachev
3e21680814 Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
Roman Domrachev
4e378dc4a4 Remove all obsolete code and test only initial problem 2017-11-14 20:45:04 +03:00
Roman
47ce2347b0
Create test that fails when actual cleanup caused 2017-11-14 20:28:13 +03:00
Roman Domrachev
3d247d2bb8 Get back previous testcase 2017-11-14 18:01:37 +03:00
Roman Domrachev
a2745b0e84 StringStore now actually cleaned
Do not lose docs in ref tracking
2017-11-14 17:45:50 +03:00
Roman Domrachev
ee60a52ee7 Fix test imports and last batch cleanup 2017-11-11 11:32:16 +03:00
Roman Domrachev
3c600adf23 Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
ines
ee97fd3cb4 Add regression test for #1547 2017-11-11 00:14:03 +01:00
ines
2df27db671 Add unicode declaration 2017-11-11 00:13:56 +01:00
ines
1c218397f6 Ensure path in Doc.to_disk/from_disk (resolves ##1521)
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
a5ea0fdf5a Fix #1518: vocab.vectors.resize() didn't work 2017-11-08 22:18:37 +01:00
Matthew Honnibal
4194bc5744 Xfail flakey serialization test 2017-11-08 13:55:13 +01:00
ines
42a0fbf291 Fix textcat simple train example 2017-11-07 01:25:54 +01:00
ines
5f43953536 Move test 2017-11-06 23:14:10 +01:00
Matthew Honnibal
1831dbd065 Add test of simple textcat workflow 2017-11-06 22:04:29 +01:00
Matthew Honnibal
2f7e9f390d Make test less flakey 2017-11-06 17:34:50 +01:00
Matthew Honnibal
407b08017e Make test less flakey 2017-11-06 17:31:40 +01:00
Matthew Honnibal
102f797933 Fix lemma ordering in test 2017-11-06 17:02:17 +01:00
Matthew Honnibal
63c6ae4191 Fix lemmatizer test 2017-11-06 11:57:06 +01:00
Matthew Honnibal
00435d8f0c Add extra beam parsing test 2017-11-05 14:39:57 +01:00
ines
5e7d98f72a Remove test for #1491 2017-11-03 22:10:57 +01:00
ines
718f1c50fb Add regression test for #1491 2017-11-03 21:11:20 +01:00
Matthew Honnibal
144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal
d6e831bf89 Fix lemmatizer tests 2017-11-03 19:46:34 +01:00
ines
eef930c73e Assert instead of print 2017-11-03 18:50:57 +01:00
ines
f0986df94b Add test for #1488 (passes on v2.0.0a18?) 2017-11-03 14:44:36 +01:00
Matthew Honnibal
711278b667 Make test less flakey 2017-11-03 14:36:08 +01:00
Matthew Honnibal
0a534ae96a Fix test for backprop d_pad 2017-11-03 14:04:16 +01:00
Matthew Honnibal
a22f96c3f1 Add test for backpropagating padding 2017-11-03 00:48:54 +01:00
ines
3af281a334 Update test model name 2017-11-01 23:02:00 +01:00
ines
8c2260e18c Move span tests to /doc 2017-11-01 16:56:35 +01:00
ines
260cb37224 Catch deprecation warning 2017-11-01 16:49:18 +01:00
ines
5914faafbb Fix .merge tests to not use deprecated API 2017-11-01 16:49:11 +01:00
Matthew Honnibal
9e0ebee81c Add Token.is_sent_start property, so can deprecate Token.sent_start 2017-11-01 13:27:14 +01:00
Matthew Honnibal
c047498f87 Fix vectors test 2017-11-01 13:24:47 +01:00
Matthew Honnibal
86eba61fae Fix token.vector when vectors are missing 2017-11-01 00:47:35 +01:00
Ines Montani
d11659463b
Merge pull request #1152 from jimregan/develop-irish
[WIP] attempt a port from #1147
2017-11-01 00:23:43 +01:00
Jim O'Regan
08b0bfd153 merge 2017-10-31 22:55:59 +00:00
Jim O'Regan
00ecfa5417 Ó, not O 2017-10-31 22:54:42 +00:00
Ines Montani
25b1d6cd91
Fix syntax error 2017-10-31 22:36:03 +01:00
Matthew Honnibal
92dc127569 Fix test for Python 3 2017-10-31 22:21:55 +01:00
Jim O'Regan
fe4b10346a replace example sentence until I get around to adding a punctuation.py 2017-10-31 20:24:53 +00:00
Matthew Honnibal
77d8f5de9a Revise and simplify Vectors class 2017-10-31 18:25:08 +01:00
Jim O'Regan
d4a8160c36 change quotes 2017-10-31 15:15:44 +00:00
Jim O'Regan
34ca59691b no idea what is wrong here 2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e merge 2017-10-31 14:07:45 +00:00
Matthew Honnibal
cb5217012f Fix vector remapping 2017-10-31 11:40:46 +01:00
Matthew Honnibal
9c11ee4a1c WIP on vectors fixes 2017-10-31 11:22:56 +01:00
Matthew Honnibal
368fdb389a WIP on refactoring and fixing vectors 2017-10-31 02:00:26 +01:00
Explosion Bot
72aea8f105 Update vectors.add() to allow setting keys to rows 2017-10-30 10:03:08 +01:00
Matthew Honnibal
64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
Ines Montani
4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
Matthew Honnibal
b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
ines
de1e5f35d5 Merge branch 'develop' into feature/disable-pipes 2017-10-25 16:33:12 +02:00
ines
c0b55ebdac Fix PhraseMatcher.__contains__ and add more tests 2017-10-25 16:31:11 +02:00
ines
657a4d91bc Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:19:05 +02:00
ines
1a722dac31 Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:18:18 +02:00
Matthew Honnibal
b5de768852 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-25 14:44:16 +02:00
Matthew Honnibal
094512fd47 Fix model-mark on regression test. 2017-10-25 14:44:00 +02:00
Matthew Honnibal
e70f80f29e Add Language.disable_pipes() 2017-10-25 13:46:41 +02:00
Ines Montani
d3bf488e16 Merge pull request #1171 from mollerhoj/support-danish
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal
908809d488 Update tests 2017-10-24 17:05:15 +02:00
Matthew Honnibal
30e67fa808 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 16:08:23 +02:00
Matthew Honnibal
63f0bde749 Add test for #1250: Tokenizer cache clobbered special-case attrs 2017-10-24 16:07:18 +02:00
ines
090aed940a Add test for currently failing span.as_doc case 2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc Fix whitespace 2017-10-24 16:00:56 +02:00
Matthew Honnibal
4bea65a1a8 Fix Issue #1450: Off-by-1 in * and ? matches
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1 Normalize imports in regression test 2017-10-24 14:25:49 +02:00
Matthew Honnibal
b66b8f028b Fix #1375 -- out-of-bounds on token.nbor() 2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3 Add failing test for bug #1375 -- no out-of-bounds error for token.nbor() 2017-10-24 12:05:25 +02:00
Ines Montani
facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
Matthew Honnibal
ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b Merge pull request #1442 from explosion/feature/fix-sp
💫Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba Merge pull request #1440 from ramananbalakrishnan/develop
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
Matthew Honnibal
490ad3eaf0 Check that empty strings are handled. Closes #1242 2017-10-21 00:52:14 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d Fix #1434: Matcher failed on ending ? if no token 2017-10-20 16:49:36 +02:00
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
ines
bf415fd778 Add test for serializing extension attrs (see #1085) 2017-10-19 00:53:08 +02:00
Matthew Honnibal
fe844148f6 Test pickling hooks 2017-10-17 19:43:52 +02:00
Matthew Honnibal
374819edf8 Test user_data deserialization, re #1085 2017-10-17 19:28:54 +02:00
Matthew Honnibal
8ca97f32a3 Fix doc pickling test 2017-10-17 18:19:57 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
Matthew Honnibal
462caf835a Fix SBD test 2017-10-12 21:18:22 +02:00
Ines Montani
37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
51519251c2 Fix underscore method test 2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf Fix formatting 2017-10-11 13:34:11 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
ines
15fe0fd82d Fix tests 2017-10-11 13:27:18 +02:00
ines
e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
Matthew Honnibal
fd47f8e89f Fix failing test 2017-10-11 08:38:34 +02:00
Matthew Honnibal
462b2e26b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 08:23:04 +02:00
Matthew Honnibal
2c118ab3a6 Add tests for Doc creation 2017-10-11 03:21:23 +02:00
Matthew Honnibal
d84136b4a9 Update add label test 2017-10-10 22:57:41 +02:00
Matthew Honnibal
e0a9b02b67 Merge Span._ and Span.as_doc methods 2017-10-09 22:00:15 -05:00
Matthew Honnibal
09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
Matthew Honnibal
f0f2739ae3 Add test for serialization issue raised in #1105 2017-10-10 03:57:58 +02:00
ines
de374dc72a Merge branch 'feature/pipeline-management' into feature/dot-underscore 2017-10-09 14:37:51 +02:00
Matthew Honnibal
2534cd57d7 Add bandaid solution to the 'shadowing' problem in #864 2017-10-09 08:59:35 +02:00
Matthew Honnibal
d8a2506023 Merge pull request #1401 from explosion/feature/add-parser-action
💫 Allow labels to be added to pre-trained parser and NER modes
2017-10-09 04:57:51 +02:00
Matthew Honnibal
689349e32f Merge pull request #1400 from explosion/feature/sentence-parsing
💫 Force parser to respect preset sentence boundaries
2017-10-09 04:31:43 +02:00
Matthew Honnibal
fad2b8315f Merge branch 'develop' into feature/add-parser-action 2017-10-09 04:13:04 +02:00
Matthew Honnibal
6c79841c0d Fix tests for history features 2017-10-09 04:12:24 +02:00
Matthew Honnibal
dde87e6b0d Add tests for adding parser actions 2017-10-09 03:42:35 +02:00
Matthew Honnibal
81a64119db Fix string-to-unicode problem 2017-10-09 00:59:49 +02:00
Matthew Honnibal
02c2af7119 Fix test 2017-10-09 00:29:37 +02:00
Matthew Honnibal
5a67efeccc Add tests for sentence segmentation presetting 2017-10-09 00:02:23 +02:00
Matthew Honnibal
9bd8191739 Add tests for Underscore 2017-10-07 18:56:19 +02:00
Matthew Honnibal
3b67eabfea Allow empty dictionaries to match any token in Matcher
Often patterns need to match "any token". A clean way to denote this
is with the empty dict {}: this sets no constraints on the token,
so should always match.

The problem was that having attributes length==0 was used as an
end-of-array signal, so the matcher didn't handle this case correctly.

This patch compiles empty token spec dicts into a constraint
NULL_ATTR==0. The NULL_ATTR attribute, 0, is always set to 0 on the
lexeme -- so this always matches.
2017-10-07 03:36:15 +02:00
ines
0adadcb3f0 Fix beam parse model test 2017-10-07 02:15:15 +02:00
ines
b38a8f4a94 Fix and update pipe methods tests 2017-10-07 02:06:23 +02:00
Matthew Honnibal
3a65a0c970 Start adding tests for new pipeline management 2017-10-07 01:48:23 +02:00
ines
61a503a611 Fix parser test 2017-10-07 00:38:51 +02:00
Matthew Honnibal
c6cd81f192 Wrap try/except around model saving 2017-10-05 08:14:24 -05:00
Matthew Honnibal
fd4baff475 Update tests 2017-10-05 08:12:27 -05:00
Matthew Honnibal
40edb65ee7 Make test work for Python 2.7 2017-10-04 16:36:50 +02:00
Matthew Honnibal
db05d4d582 Add test for #1380. Passes without fix? 2017-10-04 14:56:31 +02:00
Matthew Honnibal
4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00
Ines Montani
959c46eabe Merge pull request #1365 from wannaphongcom/develop
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun
7b5263ffa4 fix thai test 2017-09-26 23:54:15 +07:00
Matthew Honnibal
41cc5c4c17 Merge branch 'develop' into feature/phrasematcher 2017-09-26 09:59:17 -05:00
Wannaphong Phatthiyaphaibun
5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
Matthew Honnibal
74f08e1ad5 Update test 2017-09-26 06:45:56 -05:00
Matthew Honnibal
20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal
cc408fc189 Make PhraseMatcher API like Matcher API 2017-09-20 22:20:35 +02:00
Matthew Honnibal
43ad250dd5 Update matcher tests 2017-09-20 21:54:49 +02:00
Matthew Honnibal
c013e5996f Fix parser test 2017-09-17 13:13:20 -05:00
ines
ece30c28a8 Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Matthew Honnibal
ebf8942564 Fix test for Python3 2017-09-16 16:22:38 +02:00
Matthew Honnibal
8c945310fb Excuse emoji failure on narrow unicode builds 2017-09-16 16:21:13 +02:00
Matthew Honnibal
3fa5b40b5c Add test for hash consistency 2017-09-16 11:21:35 +02:00
Jim O'Regan
7de709483b missed adding here 2017-09-11 10:51:21 +01:00
Jim O'Regan
b1b6123867 add ga_tokenizer 2017-09-11 10:31:41 +01:00
Jim O'Regan
187be6d372 copy/paste error 2017-09-11 09:33:17 +01:00
Jim O'Regan
c283e9edfe first stab at test 2017-09-11 08:57:48 +01:00
Matthew Honnibal
456bb8a74c Unxfail and close #1305 2017-09-06 19:14:17 +02:00
Matthew Honnibal
99e44fbdbb Update regression test 2017-09-06 19:13:51 +02:00
Matthew Honnibal
497a9308a8 Xfail new lemmatizer test 2017-09-06 18:41:22 +02:00
Matthew Honnibal
5384fff5ce Add test for 1305: Incorrect lemmatization of VBZ for English 2017-09-06 18:40:18 +02:00
Matthew Honnibal
d5fbf27335 Fix test 2017-09-04 16:45:11 +02:00
Matthew Honnibal
cb4839033c Fix loader for EN tests 2017-09-04 15:19:18 +02:00
Matthew Honnibal
644d6c9e1a Improve lemmatization tests, re #1296 2017-09-04 15:17:44 +02:00
Jim Geovedi
fbc62a09c7 added {pre,suf,in}fix tests 2017-08-20 13:43:00 +07:00
Jim Geovedi
713d7c0aa0 added indonesian lang test 2017-08-20 12:17:14 +07:00
Jim Geovedi
fa544e6c9a Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-20 11:49:40 +07:00
Matthew Honnibal
41c2218c53 Fix test for vectors 2017-08-19 22:09:12 +02:00
Matthew Honnibal
ef87562741 Restore vectors test utils 2017-08-19 20:35:16 +02:00
Matthew Honnibal
1391f9da37 Restore vectors tests 2017-08-19 20:34:58 +02:00
Matthew Honnibal
d55d6e1cfa Fix comparison of Token from different docs. Closes #1257 2017-08-19 16:39:32 +02:00
Matthew Honnibal
4fda02c7e6 Add test for new Span.to_array method 2017-08-19 16:24:38 +02:00
Matthew Honnibal
c606b4a42c Add test for Doc.char_span 2017-08-19 16:18:23 +02:00
Matthew Honnibal
42d47c1e5c Fix tagger serialization 2017-08-19 04:16:32 +02:00
Matthew Honnibal
2da96a0ec7 Fix beam test 2017-08-19 04:15:46 +02:00
Matthew Honnibal
a7309a217d Update tagger serialization 2017-08-18 23:12:05 +02:00
Matthew Honnibal
de7e8703e3 Restore tests for beam parser 2017-08-18 22:27:42 +02:00
Matthew Honnibal
52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal
92ebab6073 Update beam-update tests 2017-08-13 08:56:02 +02:00
Matthew Honnibal
24b45b45c6 Add test for beam update 2017-08-12 17:15:28 -05:00
Matthew Honnibal
b353e4d843 Work on parser beam training 2017-08-12 14:47:45 -05:00
Jim Geovedi
cc4772cac2 reworks 2017-08-03 13:08:38 +07:00
Jim Geovedi
783f7d8b86 added test set for Indonesian language 2017-07-29 18:21:07 +07:00