ansgar-t
9732988951
escape html in displacy.render ( #2378 ) ( closes #2361 )
...
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361 )
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
330c039106
Merge branch 'master' into develop
2018-05-26 18:30:52 +02:00
Jani Monoses
ec62cadf4c
Updates to Romanian support ( #2354 )
...
* Add back Romanian in conftest
* Romanian lex_attr
* More tokenizer exceptions for Romanian
* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Ines Montani
cae4457c38
💫 Add .similarity warnings for no vectors and option to exclude warnings ( #2197 )
...
* Add logic to filter out warning IDs via environment variable
Usage: SPACY_WARNING_EXCLUDE=W001,W007
* Add warnings for empty vectors
* Add warning if no word vectors are used in .similarity methods
For example, if only tensors are available in small models – should hopefully clear up some confusion around this
* Capture warnings in tests
* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
...
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
ines
5401c55c75
Merge branch 'master' into develop
2018-05-20 16:49:40 +02:00
ines
b59e3b157f
Don't require attrs argument in Doc.retokenize and allow both ints and unicode ( resolves #2304 )
2018-05-20 15:15:37 +02:00
Matthew Honnibal
8661218fe8
Refactor parser ( #2308 )
...
* Work on refactoring greedy parser
* Compile updated parser
* Fix refactored parser
* Update test
* Fix refactored parser
* Fix refactored parser
* Readd beam search after refactor
* Fix beam search after refactor
* Fix parser
* Fix beam parsing
* Support oracle segmentation in ud-train CLI command
* Avoid relying on final gold check in beam search
* Add a keyword argument sink to GoldParse
* Bug fixes to beam search after refactor
* Avoid importing fused token symbol in ud-run-test, untl that's added
* Avoid importing fused token symbol in ud-run-test, untl that's added
* Don't modify Token in global scope
* Fix error in beam gradient calculation
* Default to beam_update_prob 1
* Set a more aggressive threshold on the max violn update
* Disable some tests to figure out why CI fails
* Disable some tests to figure out why CI fails
* Add some diagnostics to travis.yml to try to figure out why build fails
* Tell Thinc to link against system blas on Travis
* Point thinc to libblas on Travis
* Try running sudo=true for travis
* Unhack travis.sh
* Restore beam_density argument for parser beam
* Require thinc 6.11.1.dev16
* Revert hacks to tests
* Revert hacks to travis.yml
* Update thinc requirement
* Fix parser model loading
* Fix size limits in training data
* Add missing name attribute for parser
* Fix appveyor for Windows
2018-05-15 22:17:29 +02:00
Matthew Honnibal
546dd99cdf
Merge master into develop -- mostly Arabic and website
2018-05-15 18:14:28 +02:00
Matthew Honnibal
581d318971
Fix conftest
2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3
Add Arabic language ( #2314 )
...
* added support for Arabic lang
* added Arabic language support
* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87
Lemmatizer ro ( #2319 )
...
* Add Romanian lemmatizer lookup table.
Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).
The original dataset is licensed under the Open Database License.
* Fix one blatant issue in the Romanian lemmatizer
* Romanian examples file
* Add ro_tokenizer in conftest
* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Douglas Knox
9b49a40f4e
Test and fix for Issue #2219 ( #2272 )
...
Test and fix for Issue #2219 : Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c
Port Japanese mecab tokenizer from v1 ( #2036 )
...
* Port Japanese mecab tokenizer from v1
This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.
As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.
Things to check:
1. Is this the right way to use a token extension?
2. What's the right way to implement a JapaneseTagger? The approach in
#1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?
-POLM
* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
Matthew Honnibal
9d147e12c4
Merge remote-tracking branch 'origin/master' into develop
2018-05-01 18:18:51 +02:00
Matthew Honnibal
b43bfd3524
Fix arc-eager oracle tests
2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0
Fix textcat test
2018-05-01 15:18:39 +02:00
Matthew Honnibal
adbb1f7533
Add better arc-eager oracle tests
2018-05-01 15:14:55 +02:00
Mr Roboto
6f5ccda19c
Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False ( #2230 )
...
* Fixes issue #2228
* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
2c4a6d66fa
Merge master into develop. Big merge, many conflicts -- need to review
2018-04-29 14:49:26 +02:00
ines
1c6d77610c
Add remove_extension method on Doc, Token and Span ( closes #2242 )
2018-04-28 23:33:09 +02:00
ines
abdb853ebf
Simplify underscore tests
2018-04-28 23:30:33 +02:00
Suraj Krishnan Rajan
69d041148f
Implement Fast-Text vectors with subword features
2018-04-21 01:34:14 +05:30
Jens Dahl Møllerhøj
e5055e3cf6
Add Danish lemmatizer ( #2184 )
...
* add danish lemmatizer
* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
10462816bc
Fix tests for Python 2
2018-04-03 18:51:31 +02:00
ines
62b4b527d7
Don't raise error if set_extension has getter and setter ( closes #2177 )
...
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
Suraj Rajan
1cdbb7c97c
[2032] - Changed python set to cpp stl set ( #2170 )
...
Changed python set to cpp stl set #2032
## Description
Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/
### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob ( resolves #1554 , resolves #1752 )
...
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors ( resolves #1660 )
...
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
ines
3eb67bbe4b
Allow entity types with dashes ( resolves #1967 )
2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546
Update serialization test
2018-03-28 20:12:53 +02:00
Matthew Honnibal
95fa89c4b8
Update doc.ents test
2018-03-28 18:39:03 +02:00
Matthew Honnibal
cbd2794be0
Add test for ent_iob during span merge
2018-03-28 18:36:53 +02:00
Matthew Honnibal
fd9e259414
Add test for #1660
2018-03-28 18:22:51 +02:00
Matthew Honnibal
95a9615221
Fix loading of multiple pre-trained vectors
...
This patch addresses #1660 , which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.
The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.
In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
6d2c85f428
Drop six and related hacks as a dependency
2018-03-28 10:45:25 +02:00
Matthew Honnibal
de9fd091ac
Fix #2014 : token.pos_ not writeable
2018-03-27 21:21:11 +02:00
Matthew Honnibal
1f7229f40f
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to 92c26a35d4
.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
d2118792e7
Merge changes from master
2018-03-27 13:38:41 +02:00
Matthew Honnibal
7d4687162f
Update doc.ents test
2018-03-26 07:14:35 +02:00
Matthew Honnibal
938436455a
Add test for ent_iob during span merge
2018-03-25 22:16:19 +02:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER ( #2108 )
...
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.
Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.
We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.
To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.
Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.
To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.
Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.
This is a squash merge, as I made a lot of very small commits. Individual commit messages below.
* Simplify label management for TransitionSystem and its subclasses
* Fix serialization for new label handling format in parser
* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir
* Set actions in transition system
* Require thinc 6.11.1.dev4
* Fix error in parser init
* Add unicode declaration
* Fix unicode declaration
* Update textcat test
* Try to get model training on less memory
* Print json loc for now
* Try rapidjson to reduce memory use
* Remove rapidjson requirement
* Try rapidjson for reduced mem usage
* Handle None heads when projectivising
* Stream json docs
* Fix train script
* Handle projectivity in GoldParse
* Fix projectivity handling
* Add minibatch_by_words util from ud_train
* Minibatch by number of words in spacy.cli.train
* Move minibatch_by_words util to spacy.util
* Fix label handling
* More hacking at label management in parser
* Fix encoding in msgpack serialization in GoldParse
* Adjust batch sizes in parser training
* Fix minibatch_by_words
* Add merge_subtokens function to pipeline.pyx
* Register merge_subtokens factory
* Restore use of msgpack tmp directory
* Use minibatch-by-words in train
* Handle retokenization in scorer
* Change back-off approach for missing labels. Use 'dep' label
* Update NER for new label management
* Set NER tags for over-segmented words
* Fix label alignment in gold
* Fix label back-off for infrequent labels
* Fix int type in labels dict key
* Fix int type in labels dict key
* Update feature definition for 8 feature set
* Update ud-train script for new label stuff
* Fix json streamer
* Print the line number if conll eval fails
* Update children and sentence boundaries after deprojectivisation
* Export set_children_from_heads from doc.pxd
* Render parses during UD training
* Remove print statement
* Require thinc 6.11.1.dev6. Try adding wheel as install_requires
* Set different dev version, to flush pip cache
* Update thinc version
* Update GoldCorpus docs
* Remove print statements
* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1
Fix unicode declaration on test
2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6
Add test for textcat
2018-03-16 12:39:45 +01:00
ines
f3f8bfc367
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
ines
d854f69fe3
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2
2018-03-12 23:03:05 +01:00
Matthew Honnibal
53b3249e06
Add tests for arc eager oracle
2018-03-10 23:42:56 +01:00
Matthew Honnibal
5cc3bd1c1d
Update alignment tests
2018-02-24 16:03:58 +01:00
Matthew Honnibal
7865746574
Support many-to-one alignment
2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831
Poke matcher test for appveyor
2018-02-23 23:53:48 +01:00
Matthew Honnibal
2c9c8b8d72
Try comming out emoji test in matcher
2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe
Try to find test that fails on appveyor
2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3
Try to find test failing on appveyor
2018-02-23 20:59:21 +01:00
Matthew Honnibal
7b575a119e
Try to reduce memory usage of test_matcher
2018-02-23 15:34:37 +01:00
Matthew Honnibal
875411b875
Set unicode types in _align.pyx and test
2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3
Fix broken span.as_doc test
2018-02-23 14:22:24 +01:00
Matthew Honnibal
3e6c1111b7
Remove obsolete test
2018-02-23 03:22:07 +01:00
Matthew Honnibal
c0734ba526
Make alignment work with strings
2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98
Add tests for new Levenshtein alignment
2018-02-20 17:32:25 +01:00
Matthew Honnibal
2bccad8815
Fix incorrect matcher test
2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-18 14:40:42 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
...
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
...
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b
Update matcher branch with v2.0.8 master
2018-02-18 13:42:58 +01:00
ines
6bba1db4cc
Drop six and related hacks as a dependency
2018-02-18 13:29:56 +01:00
Matthew Honnibal
f9f46e5a07
Revert matcher fixes from GregDubbin
2018-02-18 10:59:28 +01:00
Matthew Honnibal
f7dc64d2a3
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-17 16:47:35 +01:00
Aaron Marquez
f0d3672e17
Changed loading EN model
2018-02-15 14:28:38 -08:00
Aaron Marquez
7ba4111554
Add test for issue-1959
2018-02-15 12:46:22 -08:00
Thomas Opsomer
5d24a81c0b
add test for span.sent when doc not parsed
2018-02-15 16:59:16 +01:00
Matthew Honnibal
4533c7408d
Update matcher tests
2018-02-15 15:39:47 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
...
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Matthew Honnibal
00261eea27
Make tests refer to matcher2
2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk
are closed
...
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706 ).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
dcd8d89aef
Update test for 850, making it work with matcher2
2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f
Remove re comparisons tests, as matcher behaves differently
2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1
Fix matcher test
2018-02-13 16:28:06 +01:00
4altinok
471d3c9e23
added lex test for is_currency
2018-02-11 18:50:50 +01:00
Matthew Honnibal
fd9fd275c5
Make test for #1945 more precise
2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180
Add test for #1945 : PhraseMatcher regression
2018-02-07 01:29:23 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
...
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Matthew Honnibal
f74a802d09
Test and fix #1919 : Error resuming training
2018-02-02 02:32:40 +01:00
Motoki Wu
54062b7326
added tests for issue #1915
2018-01-30 18:30:19 -08:00
ines
8901814248
Improve error handling if pipeline component is not callable ( resolves #1911 )
...
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
...
Fix sentence boundaries serialization (issue #1834 )
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100
Limit parser model size, to hopefully reduce memory during CI tests
2018-01-28 21:00:32 +01:00
Thomas Opsomer
45d62561f7
add test for the issue
2018-01-28 19:49:56 +01:00
Kit
52ef51f36e
Add test for issue #1889
2018-01-25 22:56:48 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
...
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
edb71a280e
Add test for #1883 : Unpickling Matcher
2018-01-24 15:42:33 +01:00
Matthew Honnibal
42a18ef903
Add test for #1868 : Vocab.__contains__ with ints
2018-01-23 23:27:05 +01:00
greg
85ab99e692
Correct test examples
2018-01-23 15:00:14 -05:00
Matthew Honnibal
91e916cb67
Add comment to new test
2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad
Add test for #1727
2018-01-23 19:11:01 +01:00
Matthew Honnibal
7e6dc283db
Fix unicode import in test
2018-01-22 23:55:44 +01:00
greg
686735b94e
Fix matcher import
2018-01-22 16:53:05 -05:00
Matthew Honnibal
4ce7d24fd5
Add test for #1799 : Set left and right edges (and thus sentences) in non-projective parses.
2018-01-22 20:18:38 +01:00
greg
7072b395c9
Add greedy matcher tests
2018-01-16 15:46:13 -05:00
Matthew Honnibal
ccb51a9f36
Make .similarity() return 1.0 if all orth attrs match
2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7
Fix test
2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58
Add test for #1757 : Comparison against None
2018-01-15 15:55:01 +01:00
Matthew Honnibal
9e413449f6
Fix unicode error in new test
2018-01-15 15:39:00 +01:00
Matthew Honnibal
6b215d2dd3
Add test for Issue #1537
2018-01-15 15:20:56 +01:00
ines
5babb7d6f6
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-14 17:31:09 +01:00
ines
793890cb4d
Remove test for removed deprecation warning
2018-01-14 17:31:06 +01:00
Matthew Honnibal
1a1cca6052
Fix vectors.resize() on Py3. Closes #1539
2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304
Make set_vector add word to vocab. Fixes #1807
2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
...
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow
2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769
2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769
2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes
2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769
2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
...
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7)
2018-01-12 20:30:51 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769 )
2018-01-08 03:42:04 +01:00
Søren Lind Kristiansen
62de5da1ff
Remove unsused dummy variable
2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8
Remove dummy variable from function calls
2018-01-05 09:37:05 +01:00
Kevin Humphreys
597df5bf83
add test
2018-01-03 13:00:05 -08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
...
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc
Fix missing import
2017-12-22 16:21:44 +01:00
ines
8dc1c27841
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-22 16:01:00 +01:00
ines
b10ba848b8
xfail test that causes MemoryError on Python 2 on Windows
...
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization
2017-12-20 21:05:34 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
...
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Søren Lind Kristiansen
15d13efafd
Tune Danish tokenizer to more closely match tokenization in Universal Dependencies.
2017-12-20 17:36:52 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698
2017-12-12 10:36:11 +01:00
Isaac Sijaranamual
38021fbb00
Switch from python 3 only TemporaryDirectory to pytest's tmpdir
2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
568130ce7c
Adds regression test_issue1622
2017-12-10 23:00:48 +01:00
Matthew Honnibal
36b47e3fa6
Fix (and test) vector pickling
2017-12-07 09:53:30 +01:00
Canbey Bilgili
abe098b255
Adds Turkish Lemmatization
2017-12-01 17:04:32 +03:00
Vadim Mazaev
4ba7ddf651
Bugfixies
2017-11-30 12:29:38 +03:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
...
Solving #1494
2017-11-29 23:11:23 +01:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
...
Add support for Russian
2017-11-29 23:10:01 +01:00
ines
a31506e060
Fix off-by-one error in nlp.add_pipe(after=name) ( fixes #1654 )
2017-11-28 20:37:55 +01:00
ines
b62739fbfe
Add regression test for #1654
2017-11-28 20:27:54 +01:00
ines
2e50dbb9d7
Simplify test
2017-11-28 20:27:27 +01:00
Felix Sonntag
724ae7dc55
Fixed issue of infix capturing prefixes
2017-11-28 17:17:12 +01:00
Søren Lind Kristiansen
0ffd27b0f6
Add several Danish alternative spellings
2017-11-27 13:35:41 +01:00
Vadim Mazaev
53e7c38637
Fixed tests depends on pymorphy2
2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd
Added tag map, fixed tests fails, added more exceptions
2017-11-26 20:54:48 +03:00
Ines Montani
a7bb8f1b42
Merge pull request #1637 from sorenlind/da_tokenization
...
Improve Danish tokenization
2017-11-26 15:41:38 +00:00
ines
c699aec089
Add offsets_from_biluo_tags helper and tests (see #1626 )
2017-11-26 16:38:01 +01:00
Søren Lind Kristiansen
6aa241bcec
Add day of month tokenizer exceptions for Danish.
2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020
Add weekday abbreviations and remove abiguous month abbreviations for Danish.
2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989
Add multiple tokenizer exceptions for Danish.
2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c
Add test for tokenization of 'i.' for Danish.
2017-11-24 11:29:37 +01:00
Matthew Honnibal
30ba81f881
Merge pull request #1576 from ligser/master
...
Actually reset caches in pipe [wip]
2017-11-23 12:54:48 +01:00
ines
c90fe92e15
Fix displaCy test
2017-11-22 05:04:39 +01:00
ines
a6f33ac27d
Fix displaCy test
2017-11-22 04:19:28 +01:00
Vadim Mazaev
81314f8659
Fixed tokenizer: added char classes; added first lemmatizer and
...
tokenizer tests
2017-11-21 22:23:59 +03:00
Burton DeWilde
635792997c
Add regression test for #1612
2017-11-20 12:05:35 -06:00
ines
d70a64d78b
Fix syntax error and formatting in test (see #1617 )
2017-11-20 14:01:25 +01:00
ines
17849dee4b
Fix French test (see #1617 )
2017-11-20 13:59:59 +01:00
Felix Sonntag
8be3392302
Added regression text for 1494
2017-11-19 16:30:35 +01:00
Motoki Wu
b818afaa0e
Added failing test for Issue #1207 .
...
The noun chunk iterator should work for `Doc` but not for `Span`.
2017-11-17 17:04:27 -08:00
ines
a3d4dd1a5d
Test adding of lots of pipeline components (see #1585 )
...
Just to make sure that there's no error now or in the future with adding a large number of pipeline components.
2017-11-15 17:28:06 +01:00
Roman Domrachev
505c6a2f2f
Completely cleanup tokenizer cache
...
Tokenizer cache can have be different keys than string
That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Roman Domrachev
3e21680814
Use safer method to get string without hit
2017-11-14 22:58:46 +03:00
Roman Domrachev
4e378dc4a4
Remove all obsolete code and test only initial problem
2017-11-14 20:45:04 +03:00
Roman
47ce2347b0
Create test that fails when actual cleanup caused
2017-11-14 20:28:13 +03:00
Roman Domrachev
3d247d2bb8
Get back previous testcase
2017-11-14 18:01:37 +03:00
Roman Domrachev
a2745b0e84
StringStore now actually cleaned
...
Do not lose docs in ref tracking
2017-11-14 17:45:50 +03:00
Roman Domrachev
ee60a52ee7
Fix test imports and last batch cleanup
2017-11-11 11:32:16 +03:00
Roman Domrachev
3c600adf23
Try to fix StringStore clean up (see #1506 )
2017-11-11 03:11:27 +03:00
ines
ee97fd3cb4
Add regression test for #1547
2017-11-11 00:14:03 +01:00
ines
2df27db671
Add unicode declaration
2017-11-11 00:13:56 +01:00
ines
1c218397f6
Ensure path in Doc.to_disk/from_disk (resolves ##1521)
...
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
a5ea0fdf5a
Fix #1518 : vocab.vectors.resize() didn't work
2017-11-08 22:18:37 +01:00
Matthew Honnibal
4194bc5744
Xfail flakey serialization test
2017-11-08 13:55:13 +01:00
ines
42a0fbf291
Fix textcat simple train example
2017-11-07 01:25:54 +01:00
ines
5f43953536
Move test
2017-11-06 23:14:10 +01:00
Matthew Honnibal
1831dbd065
Add test of simple textcat workflow
2017-11-06 22:04:29 +01:00
Matthew Honnibal
2f7e9f390d
Make test less flakey
2017-11-06 17:34:50 +01:00
Matthew Honnibal
407b08017e
Make test less flakey
2017-11-06 17:31:40 +01:00
Matthew Honnibal
102f797933
Fix lemma ordering in test
2017-11-06 17:02:17 +01:00
Matthew Honnibal
63c6ae4191
Fix lemmatizer test
2017-11-06 11:57:06 +01:00
Matthew Honnibal
00435d8f0c
Add extra beam parsing test
2017-11-05 14:39:57 +01:00
ines
5e7d98f72a
Remove test for #1491
2017-11-03 22:10:57 +01:00
ines
718f1c50fb
Add regression test for #1491
2017-11-03 21:11:20 +01:00
Matthew Honnibal
144a93c2a5
Back-off to tensor for similarity if no vectors
2017-11-03 20:56:33 +01:00
Matthew Honnibal
d6e831bf89
Fix lemmatizer tests
2017-11-03 19:46:34 +01:00
ines
eef930c73e
Assert instead of print
2017-11-03 18:50:57 +01:00
ines
f0986df94b
Add test for #1488 (passes on v2.0.0a18?)
2017-11-03 14:44:36 +01:00
Matthew Honnibal
711278b667
Make test less flakey
2017-11-03 14:36:08 +01:00
Matthew Honnibal
0a534ae96a
Fix test for backprop d_pad
2017-11-03 14:04:16 +01:00
Matthew Honnibal
a22f96c3f1
Add test for backpropagating padding
2017-11-03 00:48:54 +01:00
ines
3af281a334
Update test model name
2017-11-01 23:02:00 +01:00
ines
8c2260e18c
Move span tests to /doc
2017-11-01 16:56:35 +01:00
ines
260cb37224
Catch deprecation warning
2017-11-01 16:49:18 +01:00
ines
5914faafbb
Fix .merge tests to not use deprecated API
2017-11-01 16:49:11 +01:00
Matthew Honnibal
9e0ebee81c
Add Token.is_sent_start property, so can deprecate Token.sent_start
2017-11-01 13:27:14 +01:00
Matthew Honnibal
c047498f87
Fix vectors test
2017-11-01 13:24:47 +01:00
Matthew Honnibal
86eba61fae
Fix token.vector when vectors are missing
2017-11-01 00:47:35 +01:00
Ines Montani
d11659463b
Merge pull request #1152 from jimregan/develop-irish
...
[WIP] attempt a port from #1147
2017-11-01 00:23:43 +01:00
Jim O'Regan
08b0bfd153
merge
2017-10-31 22:55:59 +00:00
Jim O'Regan
00ecfa5417
Ó, not O
2017-10-31 22:54:42 +00:00
Ines Montani
25b1d6cd91
Fix syntax error
2017-10-31 22:36:03 +01:00
Matthew Honnibal
92dc127569
Fix test for Python 3
2017-10-31 22:21:55 +01:00
Jim O'Regan
fe4b10346a
replace example sentence until I get around to adding a punctuation.py
2017-10-31 20:24:53 +00:00
Matthew Honnibal
77d8f5de9a
Revise and simplify Vectors class
2017-10-31 18:25:08 +01:00
Jim O'Regan
d4a8160c36
change quotes
2017-10-31 15:15:44 +00:00
Jim O'Regan
34ca59691b
no idea what is wrong here
2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e
merge
2017-10-31 14:07:45 +00:00
Matthew Honnibal
cb5217012f
Fix vector remapping
2017-10-31 11:40:46 +01:00
Matthew Honnibal
9c11ee4a1c
WIP on vectors fixes
2017-10-31 11:22:56 +01:00
Matthew Honnibal
368fdb389a
WIP on refactoring and fixing vectors
2017-10-31 02:00:26 +01:00
Explosion Bot
72aea8f105
Update vectors.add() to allow setting keys to rows
2017-10-30 10:03:08 +01:00
Matthew Honnibal
64e4ff7c4b
Merge 'tidy-up' changes into branch. Resolve conflicts
2017-10-28 13:16:06 +02:00
Ines Montani
4033e70c71
Merge pull request #1461 from explosion/feature/disable-pipes
...
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
Matthew Honnibal
b0f3ea2200
Fix names of pipeline components
...
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder --> Tensorizer
NeuralLabeller --> MultitaskObjective
2017-10-26 12:38:23 +02:00
ines
de1e5f35d5
Merge branch 'develop' into feature/disable-pipes
2017-10-25 16:33:12 +02:00
ines
c0b55ebdac
Fix PhraseMatcher.__contains__ and add more tests
2017-10-25 16:31:11 +02:00
ines
657a4d91bc
Merge branch 'develop' into feature/disable-pipes
2017-10-25 15:19:05 +02:00
ines
1a722dac31
Merge branch 'develop' into feature/disable-pipes
2017-10-25 15:18:18 +02:00
Matthew Honnibal
b5de768852
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-25 14:44:16 +02:00
Matthew Honnibal
094512fd47
Fix model-mark on regression test.
2017-10-25 14:44:00 +02:00
Matthew Honnibal
e70f80f29e
Add Language.disable_pipes()
2017-10-25 13:46:41 +02:00
Ines Montani
d3bf488e16
Merge pull request #1171 from mollerhoj/support-danish
...
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal
908809d488
Update tests
2017-10-24 17:05:15 +02:00
Matthew Honnibal
30e67fa808
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-24 16:08:23 +02:00
Matthew Honnibal
63f0bde749
Add test for #1250 : Tokenizer cache clobbered special-case attrs
2017-10-24 16:07:18 +02:00
ines
090aed940a
Add test for currently failing span.as_doc case
2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc
Fix whitespace
2017-10-24 16:00:56 +02:00
Matthew Honnibal
4bea65a1a8
Fix Issue #1450 : Off-by-1 in * and ? matches
...
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1
Normalize imports in regression test
2017-10-24 14:25:49 +02:00
Matthew Honnibal
b66b8f028b
Fix #1375 -- out-of-bounds on token.nbor()
2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3
Add failing test for bug #1375 -- no out-of-bounds error for token.nbor()
2017-10-24 12:05:25 +02:00
Ines Montani
facf77e541
Merge branch 'develop' into support-danish
2017-10-24 11:53:19 +02:00
Matthew Honnibal
ccd2ab1a62
Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
...
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b
Merge pull request #1442 from explosion/feature/fix-sp
...
💫 Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba
Merge pull request #1440 from ramananbalakrishnan/develop
...
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
Matthew Honnibal
490ad3eaf0
Check that empty strings are handled. Closes #1242
2017-10-21 00:52:14 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs
2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d
Fix #1434 : Matcher failed on ending ? if no token
2017-10-20 16:49:36 +02:00
Matthew Honnibal
f111b228e0
Fix re-parsing of previously parsed text
...
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:
1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.
This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.
Closes #1253 .
2017-10-20 16:27:36 +02:00
Matthew Honnibal
ebecaddb76
Make 'data_or_width' two keyword args in Vectors.__init__
...
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array
2017-10-20 11:46:57 +05:30
ines
bf415fd778
Add test for serializing extension attrs (see #1085 )
2017-10-19 00:53:08 +02:00
Matthew Honnibal
fe844148f6
Test pickling hooks
2017-10-17 19:43:52 +02:00
Matthew Honnibal
374819edf8
Test user_data deserialization, re #1085
2017-10-17 19:28:54 +02:00
Matthew Honnibal
8ca97f32a3
Fix doc pickling test
2017-10-17 18:19:57 +02:00
Matthew Honnibal
45d1dd90b1
Add tests for pickling doc
2017-10-17 17:20:58 +02:00
Matthew Honnibal
4174477161
Fix equality check in test
2017-10-16 19:50:35 +02:00
Matthew Honnibal
010a7309ff
Merge pull request #1402 from explosion/feature/fix-matcher-operators
...
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7
Fix matcher test
2017-10-16 17:22:18 +02:00
Matthew Honnibal
a928ae2f35
Merge branch 'develop' into feature/fix-matcher-operators
2017-10-16 13:38:36 +02:00
Matthew Honnibal
748d525801
Add more matcher operator tests
2017-10-16 13:38:01 +02:00
ines
3516aa0cea
Port over changes from #1389
2017-10-14 13:32:55 +02:00
ines
cd6a29dce7
Port over changes from #1294
2017-10-14 13:28:46 +02:00
ines
38c756fd85
Port over changes from #1287
2017-10-14 13:16:21 +02:00
ines
612224c10d
Port over changes from #1157
2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3
Fix formatting and add comment on languages
2017-10-14 13:11:18 +02:00
ines
a4d974d97b
Port over URL pattern changes from #1411
2017-10-14 12:58:07 +02:00
Matthew Honnibal
cf6da9301a
Update lemmatizer test
2017-10-12 22:50:52 +02:00
Matthew Honnibal
462caf835a
Fix SBD test
2017-10-12 21:18:22 +02:00
Ines Montani
37aa523a8e
Merge pull request #1408 from explosion/feature/dot-underscore
...
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
51519251c2
Fix underscore method test
2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf
Fix formatting
2017-10-11 13:34:11 +02:00
ines
453c47ca24
Add German lemmatizer tests
2017-10-11 13:27:26 +02:00
ines
15fe0fd82d
Fix tests
2017-10-11 13:27:18 +02:00
ines
e0ff145a8b
Merge branch 'develop' into feature/dot-underscore
2017-10-11 11:57:05 +02:00
Matthew Honnibal
fd47f8e89f
Fix failing test
2017-10-11 08:38:34 +02:00
Matthew Honnibal
462b2e26b4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-11 08:23:04 +02:00
Matthew Honnibal
2c118ab3a6
Add tests for Doc creation
2017-10-11 03:21:23 +02:00
Matthew Honnibal
d84136b4a9
Update add label test
2017-10-10 22:57:41 +02:00
Matthew Honnibal
e0a9b02b67
Merge Span._ and Span.as_doc methods
2017-10-09 22:00:15 -05:00
Matthew Honnibal
09d61ada5e
Merge pull request #1396 from explosion/feature/pipeline-management
...
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
Matthew Honnibal
f0f2739ae3
Add test for serialization issue raised in #1105
2017-10-10 03:57:58 +02:00
ines
de374dc72a
Merge branch 'feature/pipeline-management' into feature/dot-underscore
2017-10-09 14:37:51 +02:00
Matthew Honnibal
2534cd57d7
Add bandaid solution to the 'shadowing' problem in #864
2017-10-09 08:59:35 +02:00
Matthew Honnibal
d8a2506023
Merge pull request #1401 from explosion/feature/add-parser-action
...
💫 Allow labels to be added to pre-trained parser and NER modes
2017-10-09 04:57:51 +02:00
Matthew Honnibal
689349e32f
Merge pull request #1400 from explosion/feature/sentence-parsing
...
💫 Force parser to respect preset sentence boundaries
2017-10-09 04:31:43 +02:00
Matthew Honnibal
fad2b8315f
Merge branch 'develop' into feature/add-parser-action
2017-10-09 04:13:04 +02:00
Matthew Honnibal
6c79841c0d
Fix tests for history features
2017-10-09 04:12:24 +02:00
Matthew Honnibal
dde87e6b0d
Add tests for adding parser actions
2017-10-09 03:42:35 +02:00
Matthew Honnibal
81a64119db
Fix string-to-unicode problem
2017-10-09 00:59:49 +02:00
Matthew Honnibal
02c2af7119
Fix test
2017-10-09 00:29:37 +02:00
Matthew Honnibal
5a67efeccc
Add tests for sentence segmentation presetting
2017-10-09 00:02:23 +02:00
Matthew Honnibal
9bd8191739
Add tests for Underscore
2017-10-07 18:56:19 +02:00
Matthew Honnibal
3b67eabfea
Allow empty dictionaries to match any token in Matcher
...
Often patterns need to match "any token". A clean way to denote this
is with the empty dict {}: this sets no constraints on the token,
so should always match.
The problem was that having attributes length==0 was used as an
end-of-array signal, so the matcher didn't handle this case correctly.
This patch compiles empty token spec dicts into a constraint
NULL_ATTR==0. The NULL_ATTR attribute, 0, is always set to 0 on the
lexeme -- so this always matches.
2017-10-07 03:36:15 +02:00
ines
0adadcb3f0
Fix beam parse model test
2017-10-07 02:15:15 +02:00
ines
b38a8f4a94
Fix and update pipe methods tests
2017-10-07 02:06:23 +02:00
Matthew Honnibal
3a65a0c970
Start adding tests for new pipeline management
2017-10-07 01:48:23 +02:00
ines
61a503a611
Fix parser test
2017-10-07 00:38:51 +02:00
Matthew Honnibal
c6cd81f192
Wrap try/except around model saving
2017-10-05 08:14:24 -05:00
Matthew Honnibal
fd4baff475
Update tests
2017-10-05 08:12:27 -05:00
Matthew Honnibal
40edb65ee7
Make test work for Python 2.7
2017-10-04 16:36:50 +02:00
Matthew Honnibal
db05d4d582
Add test for #1380 . Passes without fix?
2017-10-04 14:56:31 +02:00
Matthew Honnibal
4a59f6358c
Fix thinc imports
2017-10-03 19:21:26 +02:00
Ines Montani
959c46eabe
Merge pull request #1365 from wannaphongcom/develop
...
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun
7b5263ffa4
fix thai test
2017-09-26 23:54:15 +07:00
Matthew Honnibal
41cc5c4c17
Merge branch 'develop' into feature/phrasematcher
2017-09-26 09:59:17 -05:00
Wannaphong Phatthiyaphaibun
5cba67146c
add thai in spacy2
2017-09-26 21:36:27 +07:00
Matthew Honnibal
74f08e1ad5
Update test
2017-09-26 06:45:56 -05:00
Matthew Honnibal
20193371f5
Don't share CNN, to reduce complexities
2017-09-21 14:59:48 +02:00
Matthew Honnibal
cc408fc189
Make PhraseMatcher API like Matcher API
2017-09-20 22:20:35 +02:00
Matthew Honnibal
43ad250dd5
Update matcher tests
2017-09-20 21:54:49 +02:00
Matthew Honnibal
c013e5996f
Fix parser test
2017-09-17 13:13:20 -05:00
ines
ece30c28a8
Don't split hyphenated words in German
...
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Matthew Honnibal
ebf8942564
Fix test for Python3
2017-09-16 16:22:38 +02:00
Matthew Honnibal
8c945310fb
Excuse emoji failure on narrow unicode builds
2017-09-16 16:21:13 +02:00
Matthew Honnibal
3fa5b40b5c
Add test for hash consistency
2017-09-16 11:21:35 +02:00
Jim O'Regan
7de709483b
missed adding here
2017-09-11 10:51:21 +01:00
Jim O'Regan
b1b6123867
add ga_tokenizer
2017-09-11 10:31:41 +01:00
Jim O'Regan
187be6d372
copy/paste error
2017-09-11 09:33:17 +01:00
Jim O'Regan
c283e9edfe
first stab at test
2017-09-11 08:57:48 +01:00
Matthew Honnibal
456bb8a74c
Unxfail and close #1305
2017-09-06 19:14:17 +02:00
Matthew Honnibal
99e44fbdbb
Update regression test
2017-09-06 19:13:51 +02:00
Matthew Honnibal
497a9308a8
Xfail new lemmatizer test
2017-09-06 18:41:22 +02:00
Matthew Honnibal
5384fff5ce
Add test for 1305: Incorrect lemmatization of VBZ for English
2017-09-06 18:40:18 +02:00
Matthew Honnibal
d5fbf27335
Fix test
2017-09-04 16:45:11 +02:00
Matthew Honnibal
cb4839033c
Fix loader for EN tests
2017-09-04 15:19:18 +02:00
Matthew Honnibal
644d6c9e1a
Improve lemmatization tests, re #1296
2017-09-04 15:17:44 +02:00
Jim Geovedi
fbc62a09c7
added {pre,suf,in}fix tests
2017-08-20 13:43:00 +07:00
Jim Geovedi
713d7c0aa0
added indonesian lang test
2017-08-20 12:17:14 +07:00
Jim Geovedi
fa544e6c9a
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-20 11:49:40 +07:00
Matthew Honnibal
41c2218c53
Fix test for vectors
2017-08-19 22:09:12 +02:00
Matthew Honnibal
ef87562741
Restore vectors test utils
2017-08-19 20:35:16 +02:00
Matthew Honnibal
1391f9da37
Restore vectors tests
2017-08-19 20:34:58 +02:00
Matthew Honnibal
d55d6e1cfa
Fix comparison of Token from different docs. Closes #1257
2017-08-19 16:39:32 +02:00
Matthew Honnibal
4fda02c7e6
Add test for new Span.to_array method
2017-08-19 16:24:38 +02:00
Matthew Honnibal
c606b4a42c
Add test for Doc.char_span
2017-08-19 16:18:23 +02:00
Matthew Honnibal
42d47c1e5c
Fix tagger serialization
2017-08-19 04:16:32 +02:00
Matthew Honnibal
2da96a0ec7
Fix beam test
2017-08-19 04:15:46 +02:00
Matthew Honnibal
a7309a217d
Update tagger serialization
2017-08-18 23:12:05 +02:00
Matthew Honnibal
de7e8703e3
Restore tests for beam parser
2017-08-18 22:27:42 +02:00
Matthew Honnibal
52c180ecf5
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit ea8de11ad5
, reversing
changes made to 08e443e083
.
2017-08-14 13:00:23 +02:00
Matthew Honnibal
92ebab6073
Update beam-update tests
2017-08-13 08:56:02 +02:00
Matthew Honnibal
24b45b45c6
Add test for beam update
2017-08-12 17:15:28 -05:00
Matthew Honnibal
b353e4d843
Work on parser beam training
2017-08-12 14:47:45 -05:00
Jim Geovedi
cc4772cac2
reworks
2017-08-03 13:08:38 +07:00
Jim Geovedi
783f7d8b86
added test set for Indonesian language
2017-07-29 18:21:07 +07:00
Matthew Honnibal
d6a5c2c85a
Add test for NER
2017-07-22 01:48:58 +02:00
Matthew Honnibal
28244df4da
Add test for beam parsing
2017-07-22 01:48:35 +02:00
Matthew Honnibal
2424493970
Remove unnecessary import of Mock
2017-07-22 01:13:54 +02:00
Matthew Honnibal
289f23df51
Test beam parsing
2017-07-20 15:03:10 +02:00
Matthew Honnibal
f014138c11
Fix parser tests
2017-07-20 00:16:52 +02:00
mollerhoj
e840077601
Add some basic tests for Danish
2017-07-03 15:49:51 +02:00
ines
34a2eecb17
Add simple "naughty strings" test (see #1107 )
2017-06-06 17:43:51 +02:00
ines
cc9c5dc7a3
Fix noun chunks test
2017-06-05 16:39:04 +02:00
Matthew Honnibal
b4cdd05466
Add vectors.pyx in setup
2017-06-05 12:45:29 +02:00
Matthew Honnibal
30369d580f
Start testing Vectors class
2017-06-05 12:32:49 +02:00
ines
51d7414e94
Make sure sents are a list
2017-06-05 12:30:13 +02:00
ines
a0f4592f0a
Update tests
2017-06-05 02:26:13 +02:00
ines
3e105bcd36
Update tests
2017-06-05 02:09:27 +02:00
ines
078232932c
Fix tokenizer fixture scope
2017-06-05 01:06:34 +02:00
Matthew Honnibal
58be0e1f6f
Update tests
2017-06-04 16:35:06 -05:00
Matthew Honnibal
bb98d45a63
Fix tests
2017-06-04 16:00:44 -05:00
Matthew Honnibal
55d0621532
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 15:53:25 -05:00
Matthew Honnibal
5b9f116aca
Update tests
2017-06-04 15:53:17 -05:00
ines
8a29308d0b
Remove unused imports
2017-06-04 22:39:29 +02:00
Ines Montani
112c5787eb
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
...
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
96867a24ae
Fix typo
2017-06-04 22:36:40 +02:00
ines
f432bb4b48
Fix fixture scopes
2017-06-04 22:34:31 +02:00
ines
a66cf24ee8
xfail tokenizer serialization tests for now
...
Tests pass locally, but not on Travis – needs more investigation
2017-06-04 13:58:20 +02:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
d77c2cc8bb
Add tests for English norm exceptions
2017-06-03 20:59:50 +02:00
ines
3152ee5ca2
Update serialization tests for tokenizer
2017-06-03 17:05:28 +02:00
ines
1ebd0d3f27
Add assert_packed_msg_equal util function
2017-06-03 17:04:30 +02:00
ines
de974f7bef
Add serializer tests for tokenizer
2017-06-03 13:26:34 +02:00
ines
d21459f87d
Update serializer tests
2017-06-02 21:42:26 +02:00
ines
d86e7cde93
Add entity recognizer to parser serialization tests
2017-06-02 18:40:06 +02:00
ines
0051c05964
Add tests for serializing parser
2017-06-02 18:37:19 +02:00
ines
cef547a9f0
Add serialization tests for tensorizer
2017-06-02 18:18:30 +02:00
ines
f74a45c1fe
Remove unnecessary argument
2017-06-02 18:17:46 +02:00
ines
43b4d63f85
Add serialization tests for tagger
2017-06-02 17:29:34 +02:00
ines
acd65c00f6
Add serialization tests for StringStore and Vocab
2017-06-02 10:57:42 +02:00
ines
9692c98f57
Add test utils for temp file and temp dir
2017-06-02 10:56:09 +02:00
Matthew Honnibal
4c97371051
Fixes for thinc 6.7
2017-06-01 04:22:16 -05:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
ines
5e1c361270
Update tests README with info on model tests
2017-05-31 12:22:58 +02:00
Ines Montani
e6cf3c7e1c
Merge pull request #1093 from oroszgy/hu_emoji_fix
...
Fixed emoji handling for Hungarian
2017-05-31 11:33:24 +02:00
Matthew Honnibal
6937e311a4
Update doc tests
2017-05-30 23:34:23 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
Matthew Honnibal
b127645afc
Fix test_misc merge conflict
2017-05-29 18:31:44 -05:00
Matthew Honnibal
e0e8eae7c7
Tweak package test
2017-05-29 18:30:42 -05:00
ines
20a7003c0d
Update model fixtures and reorganise tests
2017-05-29 22:14:31 +02:00
ines
795fe43a4d
Add load_test_model function with importorskip()
...
Loads model only if it can be imported, i.e. if it's installed as a
package.
2017-05-29 22:11:31 +02:00
ines
6e3937efc5
Check for arguments of model markers to specify models to test
...
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
Matthew Honnibal
f4aafca222
Merge changes to test_misc
2017-05-29 12:26:02 +02:00
Matthew Honnibal
ff26aa6c37
Work on to/from bytes/disk serialization methods
2017-05-29 11:45:45 +02:00
ines
df920ba0e7
Add tests for displaCy and util functions and fix util typo
2017-05-29 10:51:19 +02:00
ines
c5714d4fb2
xfail matcher test for now until setting norm via Span.merge works
2017-05-29 10:51:02 +02:00
Matthew Honnibal
c91b121aeb
Move serialization functions to util
2017-05-29 10:13:42 +02:00
Matthew Honnibal
1fa2bfb600
Add model_to_bytes and model_from_bytes helpers. Probably belong in thinc.
2017-05-29 09:27:04 +02:00
Matthew Honnibal
6dad4117ad
Work on serialization for models
2017-05-29 01:37:57 +02:00
ines
7b1ddcc04d
Add test for vocab serialization
2017-05-29 01:09:52 +02:00
ines
00b2094dc3
Fix typos, long integers and tests
2017-05-29 01:09:52 +02:00
ines
804dbb8d25
Add StringStore test for API docs
2017-05-29 01:09:52 +02:00
Matthew Honnibal
92dbf28c1e
Hack a fixture in the vectors tests, for xfail
2017-05-28 20:28:32 +02:00
Matthew Honnibal
fe11564b8e
Finish stringstore change. Also xfail vectors tests
2017-05-28 15:10:22 +02:00
Matthew Honnibal
b007a2b0d3
Update stringstore tests
2017-05-28 14:08:09 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
fe4a746300
Accomodate symbols in new string scheme
2017-05-28 13:03:16 +02:00
Matthew Honnibal
a5606c3eda
Work on changing StringStore to return hashes.
2017-05-28 12:36:27 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
4917cbb484
Include sent_start test
2017-05-23 18:40:37 +02:00
ines
fb0ff0272f
xfail neural parser tests for now and remove test for deprecated method
2017-05-23 12:40:37 +02:00
Matthew Honnibal
5418bcf5d7
Resolve conflict on test
2017-05-23 04:37:16 -05:00
ines
e6acd3bbf2
Fix matcher tests and matcher docs
2017-05-23 11:36:02 +02:00
ines
d0c6d4f76d
Fix formatting
2017-05-23 11:32:00 +02:00
Matthew Honnibal
3959d778ac
Revert "Revert "WIP on improving parser efficiency""
...
This reverts commit 532afef4a8
.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8
Revert "WIP on improving parser efficiency"
...
This reverts commit bdaac7ab44
.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44
WIP on improving parser efficiency
2017-05-23 02:59:31 -05:00
ines
b3c7ee0148
Fix tests and use the new Matcher API
2017-05-22 13:54:20 +02:00
Matthew Honnibal
187f370734
Update tests for matcher changes
2017-05-22 12:59:50 +02:00
Matthew Honnibal
7e2cdc0c81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:39:34 +02:00
Matthew Honnibal
2f78413a02
PseudoProjectivity->nonproj
2017-05-22 05:39:03 -05:00
Matthew Honnibal
d8bb5bb959
Implement StringStore serialization, and update tests
2017-05-22 12:38:00 +02:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
Matthew Honnibal
836fe1d880
Update neural net tests
2017-05-19 18:11:29 -05:00
ines
a804045597
Use is_ancestor instead of deprecated is_ancestor_of
2017-05-19 20:23:40 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
c9a5d5d24b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-16 16:22:05 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
221b4c1ee8
Fix test for Python 3
2017-05-16 13:06:30 +02:00
Matthew Honnibal
1d7c18e58a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-15 21:53:47 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
ines
b462076d80
Merge load_lang_class and get_lang_class
2017-05-14 01:31:10 +02:00
ines
5858857a78
Update languages list in conftest
2017-05-13 15:37:54 +02:00
ines
8c2a0c026d
Fix parse_tree test
2017-05-13 12:32:45 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
7253b4e649
Remove old serialization tests
2017-05-09 18:12:58 +02:00
Matthew Honnibal
f9327343ce
Start updating serializer test
2017-05-09 18:12:03 +02:00
ines
2c3bdd09b1
Add English test for like_num
2017-05-09 11:06:34 +02:00
ines
22375eafb0
Fix and merge attrs and lex_attrs tests
2017-05-09 11:06:25 +02:00
ines
c714841cc8
Move language-specific tests to tests/lang
2017-05-09 00:02:37 +02:00
ines
bd57b611cc
Update conftest to lazy load languages
2017-05-09 00:02:21 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
be5541bd16
Fix import and tokenizer exceptions
2017-05-08 16:20:14 +02:00
ines
2324788970
Remove bad tests
2017-05-08 16:15:27 +02:00
Gregory Howard
c0afcd22bb
Merge remote-tracking branch 'remotes/upstream/master'
2017-04-27 14:42:54 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Gregory Howard
44cb486849
Adding unitest for tokenization in french (with title)
2017-04-27 10:59:38 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
26e31afc18
renamming tests
2017-04-25 17:46:01 +02:00
ghoward
c085c2d391
Adding some unitests
2017-04-25 17:44:16 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
Matthew Honnibal
874a3cbb07
Add test for Issue #955
2017-04-23 17:57:01 +02:00
Matthew Honnibal
5d8af40445
Add test for Issue #999
2017-04-23 17:06:30 +02:00
Matthew Honnibal
040751ad17
Remove xfail on Test #910
2017-04-23 16:28:55 +02:00
Ben Eyal
e90e8a3f10
Enable test
2017-04-20 02:25:24 +03:00
ines
2bd89e7ade
Tidy up Hebrew tests and test for punctuation (see #995 )
2017-04-19 19:28:03 +02:00
ines
13d30b6c01
xfail lemmatizer test that's causing problems (see #546 )
2017-04-16 21:18:39 +02:00
ines
0084466a66
Remove unused utf8open util and replace os.path with ensure_path
2017-04-16 20:37:45 +02:00
Matthew Honnibal
1dca7eeb03
Add unicode declaration on new regression test
2017-04-07 18:09:23 +02:00
ines
887827fc6a
Merge branch 'develop'
2017-04-07 17:36:23 +02:00
ines
444dd511c5
Fix xpassing URL test case
2017-04-07 17:36:05 +02:00
ines
bf0f15e762
Add / to tokenizer infixes ( resolves #891 )
2017-04-07 17:30:44 +02:00
ines
00b9011a49
Fix whitespace
2017-04-07 17:29:59 +02:00
Matthew Honnibal
0513c43bf0
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-07 17:07:10 +02:00
Matthew Honnibal
cc36c308f4
Fix noun_chunk rules around coordination
...
Closes #693 .
2017-04-07 17:06:40 +02:00
Matthew Honnibal
ab846256cf
Merge pull request #966 from recognai/master
...
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal
83dca920d4
Rename test #913 -> #957 , comment
...
Make test for #957 reference correct bug. Add comment.
Previous commit closes #957 .
2017-04-07 15:54:25 +02:00
Matthew Honnibal
5887383fc0
Add test for Issue #913 : Hang from bad regex
2017-04-07 15:47:27 +02:00
oeg
c693d40791
feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests
2017-04-06 18:48:45 +02:00
Matthew Honnibal
cfff4e0f61
Improve test
2017-03-31 13:59:32 +02:00
Matthew Honnibal
e854f28304
Add test for Issue #758
...
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Matthew Honnibal
0fefdfcbda
Merge pull request #935 from ericzhao28/master
...
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862 )
2017-03-30 02:51:24 +02:00
Eric Zhao
aafdf6ffb8
Add option to use label karg to determine ent_type in doc.merge
2017-03-28 23:35:03 -07:00
Matthew Honnibal
b94286de30
Fix regression test
2017-03-25 22:35:07 +01:00
Matthew Honnibal
4f400fa486
Prevent lemmatization of base nouns
...
Update lemmatizer's base-form check, for change in morphology class.
Closes #903 .
2017-03-25 21:51:12 +01:00
Matthew Honnibal
4454c1b23f
Block lemmatization of base-form adjectives
...
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912 .
2017-03-25 21:29:57 +01:00
Ines Montani
97cb4d5e3c
Merge branch 'master' into master
2017-03-25 10:03:47 +01:00
Iddo Berger
da135bd823
add hebrew tokenizer
2017-03-24 18:27:44 +03:00
Matthew Honnibal
f40fbc3710
Add test for Issue #910 : Resuming entity training
2017-03-23 23:38:57 +01:00
ines
f830213c4c
Remove compatibility check test
...
Will only cause problems when incrementing version and not updating
table. Also depends on external URL, which is bad.
2017-03-20 13:20:26 +01:00
Ines Montani
b6ee241e26
Fix print statements
2017-03-20 11:46:37 +01:00
ines
fe0ff00fe1
Fix spacing
2017-03-19 11:55:37 +01:00
ines
5712da6095
Add regression test for #891
2017-03-19 11:48:01 +01:00
ines
aefb898e37
Add title-case version of morph rules ( resolves #686 )
2017-03-18 17:27:11 +01:00
ines
64ec17abc1
Pass xpassing tests and add xfails for failures
2017-03-18 17:20:46 +01:00
ines
d0b85faf69
Pass regression test for #401 ( resolves #401 )
...
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines
be9daefbdd
Remove actual model downloading from tests
2017-03-18 17:01:10 +01:00
Matthew Honnibal
de0e6385b4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-18 16:17:28 +01:00
Matthew Honnibal
fe442cac53
Fix #717 : Set correct lemma for contracted verbs
2017-03-18 16:16:10 +01:00
ines
ad934a9abd
Add regression test for #693
2017-03-18 16:12:30 +01:00
ines
f57c616830
Add regression test for #704 and test new model ( resolves #704 )
...
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal
413138de79
Fix #719 : Lemmatizer can no longer output empty string
2017-03-18 16:02:06 +01:00
ines
ab1451f997
Don't mark compatibility test as slow
2017-03-18 15:17:39 +01:00
ines
ec3e810662
Add directory cli and set up command line interface
2017-03-18 15:14:48 +01:00
Matthew Honnibal
6420f86f02
Merge changes to __init__.py
2017-03-17 19:51:45 +01:00
ines
0e533ad0cc
Mark compatibility table test as slow (temporary)
...
Prevent Travis from running test test until models repo is published
2017-03-17 13:11:36 +01:00
Matthew Honnibal
a630726b13
Fix typo in tests
2017-03-16 20:50:36 -05:00
Matthew Honnibal
f98b30583f
Fix tests
2017-03-16 19:48:00 -05:00
Matthew Honnibal
db51abf685
Fix tests
2017-03-16 18:53:47 -05:00
Matthew Honnibal
fea9fe08af
Merge pull request #866 from juanmirocks/master
...
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
28bb546939
Merge pull request #883 from ericzhao28/master
...
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
Matthew Honnibal
8843b84bd1
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 12:00:42 -05:00
ines
4cfc8ffbd2
Reformat pickle tests
2017-03-15 17:39:54 +01:00
ines
2a0fcf1354
Add tests for new download module
2017-03-15 17:39:43 +01:00
Matthew Honnibal
4cab8ac136
Update morph exceptions test
2017-03-15 09:31:34 -05:00
ines
42ba740dde
Revert "Merge branch 'debug'"
...
This reverts commit 89b79d1178
, reversing
changes made to 02bdf490a1
.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e
Update regression test
2017-03-13 15:16:11 +01:00
ines
02bdf490a1
Remove regression test to see if it caused pytest Travis error
2017-03-13 13:00:22 +01:00
ines
17018750ac
Add regression test for #717
2017-03-13 12:58:22 +01:00
ines
2883ebfca2
Remove print statement
2017-03-13 12:30:42 +01:00
ines
98c13d8aa9
Add regression test for #401
2017-03-13 12:28:41 +01:00
ines
444d665f9d
Add regression test for #686
2017-03-13 12:23:35 +01:00
ines
46b17e5b51
Add regression test for #719
2017-03-13 12:17:35 +01:00
ines
c8ae682ff9
Add regression test for #636
2017-03-13 12:08:31 +01:00
ines
337f9601f2
Add missing unicode declaration
2017-03-13 12:08:19 +01:00
ines
d70386ec6e
Update docstring in #886 regression test
2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8
Add regression test for #886
2017-03-13 11:44:58 +01:00
ines
1da29a7146
Use new Lemmatizer data and remove file import
...
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
c89e30d1a3
Add test for English time exceptions ("1a.m." etc.)
2017-03-12 13:58:22 +01:00
ines
66c1f194f9
Use consistent unicode declarations
2017-03-12 13:07:28 +01:00
Em
9c809efc25
Removed mapStr
2017-03-11 16:23:26 -08:00
Matthew Honnibal
ea2592879f
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-11 11:13:37 -06:00
Em
426d17167f
Added string manipulation for spans
2017-03-10 16:50:02 -08:00
ines
10e29189ac
Adjust URL testcases and xfail problems (instead of comment)
2017-03-10 14:22:50 +01:00
Matthew Honnibal
ea53647362
Merge branch 'develop'
2017-03-10 02:49:39 -06:00
Dan Rapp
123d3f2d38
Fix error in test case parameterization
2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7
Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix
2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d
Issue #840 - URL pattenr too broad
2017-03-09 11:39:39 -07:00
Matthew Honnibal
5b0b968d13
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-08 15:03:10 +01:00
Matthew Honnibal
0ac3d27689
Fix handling of trailing whitespace
...
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
ines
c2e3e651b8
Re-add regression test for #859
2017-03-08 14:36:09 +01:00
Matthew Honnibal
16670d3251
Xfail the vocab pickling for now
2017-03-07 21:43:28 +01:00
Matthew Honnibal
a89c3500f6
Fixes to hacky vocab pickling
2017-03-07 20:58:55 +01:00
Matthew Honnibal
3edb8ae207
Whitespace
2017-03-07 17:16:26 +01:00
Matthew Honnibal
5de7e712b7
Add support for pickling StringStore.
2017-03-07 17:15:18 +01:00
Matthew Honnibal
4e75e74247
Update regression test for variable-length pattern problem in the matcher.
2017-03-07 16:08:32 +01:00
Matthew Honnibal
6d67213b80
Add test for 850: Matcher fails on zero-or-more.
2017-03-07 15:55:28 +01:00
Aniruddha Adhikary
696215a3fb
add tests for Bengali
2017-03-05 11:25:12 +06:00
ines
8dff040032
Revert "Add regression test for #859 "
...
This reverts commit c4f16c66d1
.
2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela
a8cfde46d3
#781 Fix test — colocalizes is lemmatized to colocaliz and colicalize
2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela
a471114eb2
#781 add regression test, failing previous bug fix
2017-03-01 21:30:51 +01:00
ines
c4f16c66d1
Add regression test for #859
2017-03-01 16:07:27 +01:00
Matthew Honnibal
34bcc8706d
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:21:21 +01:00
Matthew Honnibal
0aaa546435
Fix test after updating the French tokenizer stuff
2017-02-27 11:20:47 +01:00
ines
376c5813a7
Remove print statements from test
2017-02-24 18:26:32 +01:00
ines
7c1260e98c
Add regression test
2017-02-24 18:22:49 +01:00
ines
51eb190ef4
Remove print statements from test
2017-02-24 17:41:12 +01:00
Matthew Honnibal
db5ada3995
Merge branch 'master' of https://github.com/explosion/spaCy
2017-02-24 14:28:12 +01:00
Matthew Honnibal
8f94897d07
Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766
2017-02-24 14:27:02 +01:00
ines
67991b6e5f
Add more test cases to #775 regression test to cover #847
2017-02-18 14:10:44 +01:00
ines
44de3c7642
Reformat test and use text_file fixture
2017-02-16 23:49:19 +01:00
ines
3dd22e9c88
Mark vectors test as xfail (temporary)
2017-02-16 23:28:51 +01:00
ines
85d249d451
Revert "Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )""
...
This reverts commit ea05f78660
.
2017-02-16 23:26:25 +01:00
ines
ea05f78660
Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )"
...
This reverts commit 7d8c9eee7f
, reversing
changes made to f6b69babcc
.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
06a71d22df
Fix test failure by using unicode literals
2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque
3ba109622c
Add regression test with non ' ' space character as token
2017-02-16 12:23:27 +01:00
ines
21f09d10d7
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
...
This reverts commit f02a2f9322
.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
...
This reverts commit b95afdf39c
, reversing
changes made to b0ccf32378
.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
309da78bf0
Merge branch 'master' into tokenizer_exceptions
2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque
4ce0bbc6b6
Update unit tests
2017-02-09 16:30:43 +01:00
ines
654fe447b1
Add Swedish tokenizer tests (see #807 )
2017-02-05 11:47:07 +01:00
Michael Wallin
35100c8bdd
[issue 805] Add regression test and the required fixture
2017-02-04 16:21:34 +02:00
Michael Wallin
1a1952afa5
[finnish] Add initial tests for tokenizer
2017-02-04 13:54:10 +02:00
Ines Montani
afc6365388
Update regression test for #801 to match current expected behaviour
2017-02-02 16:23:05 +01:00
Ines Montani
13a4ab37e0
Add regression test for #801
2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque
85f951ca99
Add tokenizer exceptions for French
2017-02-02 08:36:16 +01:00
Ines Montani
e4875834fe
Fix formatting
2017-01-31 15:19:33 +01:00
Ines Montani
c304834e45
Add missing import
2017-01-31 15:18:30 +01:00
Ines Montani
e6465b9ca3
Parametrize test cases and mark as xfail
2017-01-31 15:14:42 +01:00
latkins
e4c84321a5
Added regression test for Issue #792 .
2017-01-31 13:47:42 +00:00
Ines Montani
19501f3340
Add regression test for #775
2017-01-25 13:16:52 +01:00
Raphaël Bournhonesque
1be9c0e724
Add fr tokenization unit tests
2017-01-24 10:57:37 +01:00
Ines Montani
0967eb07be
Add regression test for #768
2017-01-23 21:25:46 +01:00
Ines Montani
5f6f48e734
Add regression test for #759
2017-01-20 15:11:48 +01:00
Ines Montani
d704cfa60d
Fix typo
2017-01-16 21:30:33 +01:00
Matthew Honnibal
2c60d0cb1e
Test #743 : Tokens unhashable.
2017-01-16 13:27:26 +01:00
Ines Montani
50878ef598
Exclude "were" and "Were" from tokenizer exceptions and add regression test ( resolves #744 )
2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b
Fix formatting
2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c
Merge pull request #742 from oroszgy/hu_tokenizer_fix
...
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41
Further numeric test.
2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa
Better error handling
2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c
Better error handling
2017-01-14 22:09:29 +01:00
Ines Montani
332ce2d758
Update README.md
2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b
Passing all old tests.
2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af
Fixed hyphen handling in the Hungarian tokenizer.
2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6
Maintaining compatibility with other spacy tokenizers.
2017-01-14 16:19:15 +01:00
Gyorgy Orosz
1be5da1ac6
Fixed Hungarian tokenizer for numbers
2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a
Fix test formatting and consistency
2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5
Update README.md
2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1
Mark lemmatizer tests as models since they use installed data
2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a
Fix test name for consistency
2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f
Add util function to add vectors to vocab
2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d
Reformat add_docs_equal and add docstring
2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073
Add README.md to tests to explain organisation and conventions
2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90
Modernise serializer I/O tests and don't depend on models where possible
2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4
Add text_file_b fixture using BytesIO
2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62
Modernise noun chunks tests and don't depend on models
2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686
Rename test_parser to test_noun_chunks
2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47
Remove old tests
2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26
Move parser tests from unit to parser
2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e
Merge tokenizer tests
2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff
Move attrs tests from unit to root and modernise
2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967
Move alignment tests from munge to gold and modernise
2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a
Reformat and rename Pragmatic Segmenter tests and mark xfails
2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d
Modernise lemmatizer tests
2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9
Modernise tagger tests and fix xpassing test
2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e
Create basic and extended test set for URLs
2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8
Modernise BILUO tests
2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01
Add Lemmatizer fixture
2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597
Add path fixture for spaCy data path
2017-01-12 23:38:47 +01:00
Ines Montani
e9e99a5670
Add regression test for #740
2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409
Fix formatting
2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31
Modernise and merge matcher tests
2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a
Update comments on EN and DE fixtures
2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9
Tidy up and rename regression tests and remove unnecessary imports
2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3
Fix formatting and consistency
2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e
Remove redundant language loading integration tests
2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2
Modernise serializer codecs tests
2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6
Modernise Huffman tests
2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5
Modernise packer tests and don't depend on models where possible
2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0
Modernise and merge serialization tests
2017-01-12 21:57:19 +01:00
Ines Montani
442237787c
Add assert_docs_equal util to compare two docs
2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb
Add fixture for entity recognizer
2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc
Modernise matcher tests and split into two files
2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8
Move matcher tests for #188 and #242 to regression tests
...
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd
Update test to not create redundant Doc object
2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8
Fix formatting, naming and unicode declaration
2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d
Modernise vector similarity tests
2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a
Add get_cosine util function
2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629
Fix regression test for #615 and remove unnecessary imports
2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c
Adjust formatting
2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6
Modernise and merge lexeme vocab tests
2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2
Move test for #361 to regression tests
2017-01-12 16:51:12 +01:00
Ines Montani
7cb3d74426
Modernise span tests and don't depend on models
2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee
Modernise vocab API tests and remove old xfailing tests
2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd
Rename test_vocab.py to test_vocab_api.py
2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68
Merge flag features tests into orth tests in tests root
2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3
Remove StringStore tests from vocab tests
2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf
Modernise add vectors vocab test
2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345
Use consistent test names
2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce
Remove old unused tests and conftest files
2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9
Move Pragmatic Segmenter test cases (currently unused) to parser tests
2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874
Add tests for StringStore
2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5
Add fixture for StringStore
2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a
Modernise tests for merging spans and don't depend on models
2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d
Remove unused old test
2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b
Move test for #54 to regression tests
2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c
Remove unused conftest
2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc
Allow setting ents in get_doc
2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5
Fix and pass regression test for #736
2017-01-12 11:48:56 +01:00
Ines Montani
a6790b6694
Rename tags to pos in get_doc and allow adding tags to tokens
2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67
Merge lemmatizer tests
2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf
Modernise morph exceptions test and don't depend on models
2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e
Add regression test for #736
2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891
Move language-specific tests out of redundant tokenizer directories
2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a
Tidy up
2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7
Move text file back to tokenizer tests directory
2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017
Remove old and/or redundant tests
2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097
Modernise space attachment parser tests and don't depend on models
2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8
Modernise and merge parser tests and don't depend on models
2017-01-12 01:07:29 +01:00
Ines Montani
178c147612
Modernise nonprojectivity tests and don't depend on models
2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c
Modernise sentence boundary detection tests and don't depend on models (where possible)
2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d
Remove old unused pickle test
2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc
Move test for #309 to regression tests
2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670
Modernise parser tests and don't depend on models
2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782
Add apply_transition_sequence util function to utils
2017-01-11 21:30:14 +01:00
Ines Montani
09807addff
Add en_parser fixture
2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61
Modernise Doc parse tree navigation tests and don't depend on models
2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2
Use consistent test names
2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367
Rename "tokens" tests to "doc"
2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563
Remove old unused files
2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f
Remove old word vector tests
2017-01-11 18:55:08 +01:00
Ines Montani
e027936920
Modernise Doc noun chunks tests
2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd
Modernise Doc array tests and don't depend on models
2017-01-11 18:54:46 +01:00
Ines Montani
05447be884
Modernise test for adding entities
2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00
Modernise Doc API tests and don't depend on models
2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44
Make words optional for get_doc
2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419
Fix StringIO import for Python 3
2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b
Rename test_tokens_api.py to test_doc_api.py
2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18
Merge token tests into token API tests
2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0
Modernise token API tests and don't depend on loading models
2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90
Merge conftests into one cohesive file
2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df
Add test utils and get_doc helper function
...
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Ines Montani
3e6e1f0251
Tidy up regression tests
2017-01-10 19:24:10 +01:00
Ines Montani
869963c3c4
Mark extensive prefix/suffix tests as slow
2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe
Add simple test for surrounding brackets
2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2
Assert length first
2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907
Adjust names and formatting
2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964
Remove semi-redundant URLs and punctuation for faster testing
2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c
Add unicode declaration
2017-01-10 15:53:15 +01:00
Matthew Honnibal
64f747cb65
Token comparison test
2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c
Add tests for token comparison, re Issue #631
2017-01-09 19:09:59 +01:00
Matthew Honnibal
42cd598f57
Use correct fixtures in URL tokenizer
2017-01-09 14:10:40 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb
Remove old tests for old website example code
2017-01-08 22:28:53 +01:00
Ines Montani
5d28664fc5
Don't test Hungarian for numbers and hyphens for now
...
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
abb09782f9
Move sun.txt to original location and fix path to not break parser tests
2017-01-08 20:32:54 +01:00