Otto Sulin
4ec3f19e2b
fixed stop words -> to-do lex_attrs.py
2018-03-23 22:18:17 +02:00
Matthew Honnibal
85717f570c
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-23 20:30:42 +01:00
Matthew Honnibal
8902754f0b
Fix vector loading for ud_train
2018-03-23 20:30:00 +01:00
Xiaoquan Kong
a71b99d7ff
bugfix for global-variable-change-in-runtime related issue ( #2135 )
...
* Bugfix: setting pollution from spacy/cli/ud_train.py to whole package
* Add contributor agreement of howl-anderson
2018-03-23 11:36:38 +01:00
Matthew Honnibal
044397e269
Support .gz and .tar.gz files in spacy init-model
2018-03-21 14:33:23 +01:00
Matthew Honnibal
49fbe2dfee
Use thinc.openblas in spacy.syntax.nn_parser
2018-03-20 02:22:09 +01:00
DuyguA
f708d7443b
added contractions to stopwords #2020
2018-03-19 14:06:39 +01:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER ( #2108 )
...
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.
Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.
We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.
To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.
Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.
To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.
Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.
This is a squash merge, as I made a lot of very small commits. Individual commit messages below.
* Simplify label management for TransitionSystem and its subclasses
* Fix serialization for new label handling format in parser
* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir
* Set actions in transition system
* Require thinc 6.11.1.dev4
* Fix error in parser init
* Add unicode declaration
* Fix unicode declaration
* Update textcat test
* Try to get model training on less memory
* Print json loc for now
* Try rapidjson to reduce memory use
* Remove rapidjson requirement
* Try rapidjson for reduced mem usage
* Handle None heads when projectivising
* Stream json docs
* Fix train script
* Handle projectivity in GoldParse
* Fix projectivity handling
* Add minibatch_by_words util from ud_train
* Minibatch by number of words in spacy.cli.train
* Move minibatch_by_words util to spacy.util
* Fix label handling
* More hacking at label management in parser
* Fix encoding in msgpack serialization in GoldParse
* Adjust batch sizes in parser training
* Fix minibatch_by_words
* Add merge_subtokens function to pipeline.pyx
* Register merge_subtokens factory
* Restore use of msgpack tmp directory
* Use minibatch-by-words in train
* Handle retokenization in scorer
* Change back-off approach for missing labels. Use 'dep' label
* Update NER for new label management
* Set NER tags for over-segmented words
* Fix label alignment in gold
* Fix label back-off for infrequent labels
* Fix int type in labels dict key
* Fix int type in labels dict key
* Update feature definition for 8 feature set
* Update ud-train script for new label stuff
* Fix json streamer
* Print the line number if conll eval fails
* Update children and sentence boundaries after deprojectivisation
* Export set_children_from_heads from doc.pxd
* Render parses during UD training
* Remove print statement
* Require thinc 6.11.1.dev6. Try adding wheel as install_requires
* Set different dev version, to flush pip cache
* Update thinc version
* Update GoldCorpus docs
* Remove print statements
* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1
Fix unicode declaration on test
2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6
Add test for textcat
2018-03-16 12:39:45 +01:00
Matthew Honnibal
3cdee79a0c
Add depth argument for text classifier
2018-03-16 12:37:31 +01:00
Matthew Honnibal
13067095a1
Disable broken add-after-train in textcat
2018-03-16 12:33:33 +01:00
Matthew Honnibal
565ef8c4d8
Improve argument passing in textcat
2018-03-16 12:30:51 +01:00
Matthew Honnibal
eb2a3c5971
Remove unused function
2018-03-16 12:30:33 +01:00
Matthew Honnibal
307d6bf6d3
Fix parser for Thinc 6.11
2018-03-16 10:59:31 +01:00
Matthew Honnibal
9a389c4490
Fix parser for Thinc 6.11
2018-03-16 10:38:13 +01:00
Matthew Honnibal
648532d647
Don't assume blas methods are present
2018-03-16 02:48:20 +01:00
Matthew Honnibal
e85dd038fe
Merge remote-tracking branch 'origin/master' into feature/single-thread
2018-03-16 02:41:11 +01:00
Matthew Honnibal
e3be3d65b3
Version as v2.0.10.dev0
2018-03-15 17:31:22 +01:00
ines
f3f8bfc367
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Ines Montani
0d17377e8b
Merge pull request #2095 from DuyguA/quick-typo-fix ( resolves #2063 )
...
Quick typo fix
2018-03-15 00:29:56 +01:00
ines
d854f69fe3
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines
9ad5df41fe
Fix whitespace
2018-03-15 00:11:18 +01:00
Matthew Honnibal
d7ce6527fb
Use increasing batch sizes in ud-train
2018-03-14 20:15:28 +01:00
alldefector
f4e5904fc2
Fix Spanish noun_chunks failure caused by typo
2018-03-14 17:03:17 +01:00
Thomas Opsomer
fbf48b3f9f
lemma property to return hash instead of unicode
2018-03-14 17:03:00 +01:00
Matthew Honnibal
8cefc58abc
Fix Vectors pickling
2018-03-14 16:59:37 +01:00
DuyguA
be4f6da16b
maybe not a good idea to remove also
2018-03-14 14:47:24 +01:00
DuyguA
1a513f71e3
removed also from lookup
2018-03-14 11:57:15 +01:00
DuyguA
cca66abf1e
quick typo fix
2018-03-14 11:34:22 +01:00
Matthew Honnibal
7b755414eb
Update call into thinc
2018-03-13 13:59:59 +01:00
Matthew Honnibal
e101f10ef0
Fix header
2018-03-13 02:12:16 +01:00
Matthew Honnibal
952c87409e
Use openblas.sgemm in parser
2018-03-13 02:12:01 +01:00
Matthew Honnibal
d55620041b
Switch parser to gemm from thinc.openblas
2018-03-13 02:10:58 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2
2018-03-12 23:03:05 +01:00
Matthew Honnibal
9aeec9c242
Increment dev version
2018-03-11 01:58:21 +01:00
Matthew Honnibal
f49d71fa7c
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-11 01:27:17 +01:00
Matthew Honnibal
5dddb30e5b
Fix ud-train script
2018-03-11 01:26:45 +01:00
Matthew Honnibal
e42960bd14
Merge pull request #2012 from alldefector/patch-1
...
Fix Spanish noun_chunks failure caused by typo
2018-03-11 01:05:19 +01:00
Matthew Honnibal
2cab4d6517
Remove use of attr module in ud_train
2018-03-11 00:59:39 +01:00
Matthew Honnibal
fa9fd21620
Increment dev version
2018-03-11 00:41:54 +01:00
Matthew Honnibal
53b3249e06
Add tests for arc eager oracle
2018-03-10 23:42:56 +01:00
Matthew Honnibal
754ea1b2f7
Link in spaCy CoNLL commands
2018-03-10 23:42:15 +01:00
Matthew Honnibal
3478ea76d1
Add ud_train and ud_evaluate CLI commands
2018-03-10 23:41:55 +01:00
Matthew Honnibal
4b72c38556
Fix dropout bug in beam parser
2018-03-10 23:16:40 +01:00
Matthew Honnibal
9cc202d670
Fix Vectors pickling
2018-03-10 22:53:42 +01:00
Matthew Honnibal
3d6487c734
Support dropout in beam parse
2018-03-10 22:41:55 +01:00
Matthew Honnibal
31b156d60b
Fix itershuffle
2018-03-10 22:32:59 +01:00
Matthew Honnibal
b59765ca9f
Stream gold during spacy train
2018-03-10 22:32:45 +01:00
Matthew Honnibal
c3d168509a
Stream the gold data during training, to reduce memory
2018-03-10 22:32:32 +01:00
DuyguA
cba63196f9
fixed typo
2018-03-09 10:54:18 +01:00
DuyguA
7a780476af
added more abbreviations
2018-03-09 10:13:00 +01:00
DuyguA
cca87756d7
added Sti
2018-03-08 18:07:52 +01:00
DuyguA
3c994311c5
added abbrevs
2018-03-08 18:03:27 +01:00
DuyguA
56d6fb180e
added like_num to lex
2018-03-08 15:25:25 +01:00
DuyguA
26ee0590a3
added some commonly used cases
2018-03-08 12:43:58 +01:00
DuyguA
ae6473e4d5
removed some words with negation particle.
2018-03-08 12:20:32 +01:00
DuyguA
6ed59a2198
removed number words to be caried to the lexical
2018-03-08 12:19:23 +01:00
DuyguA
04784a44a6
made alphabetical order for Turkish chaaracters
2018-03-08 12:11:32 +01:00
DuyguA
af33e022a5
added example sentences for Turkish
2018-03-08 12:06:03 +01:00
Matthew Honnibal
a1be01185c
Fix array out of bounds error in Span
2018-02-28 12:27:09 +01:00
Thomas Opsomer
8df9e52829
lemma property to return hash instead of unicode
2018-02-27 19:50:01 +01:00
Ines Montani
35634352fe
Merge pull request #2025 from dejanmarich/patch-1
...
Update stop_words.py for Croatian language
2018-02-26 18:22:32 +01:00
Matthew Honnibal
14f729c72a
Add subtok label to parser
2018-02-26 12:26:35 +01:00
Matthew Honnibal
7137ad8b0b
Make label filtering clearer for projectivisation
2018-02-26 12:02:01 +01:00
Matthew Honnibal
b8d52cb285
Fix inconsistent label freq cutoff for projectivisation
2018-02-26 12:01:44 +01:00
Matthew Honnibal
7b66ec896a
Revert "Revert "Improve parser oracle around sentence breaks.""
...
This reverts commit 36e481c584
.
2018-02-26 10:57:37 +01:00
Matthew Honnibal
36e481c584
Revert "Improve parser oracle around sentence breaks."
...
This reverts commit 50817dc9ad
.
2018-02-26 10:53:55 +01:00
Matthew Honnibal
5faae803c6
Add option to not use Janome for Japanese tokenization
2018-02-26 09:39:46 +01:00
Matthew Honnibal
9b406181cd
Add Chinese.Defaults.use_jieba setting, for UD
2018-02-25 15:12:38 +01:00
Matthew Honnibal
9ccd0c643b
Add Vietnamese
2018-02-25 15:00:46 +01:00
Matthew Honnibal
d4fdb97c87
Fix alignment for words with spaces
2018-02-25 14:55:00 +01:00
Matthew Honnibal
6d2c1ef52c
Fix SP tag in generic tag map
2018-02-24 16:04:56 +01:00
Matthew Honnibal
5cc3bd1c1d
Update alignment tests
2018-02-24 16:03:58 +01:00
Matthew Honnibal
6138439469
Fix many-to-one alignment
2018-02-24 16:03:50 +01:00
Matthew Honnibal
4890ee1732
Fix scoring of tokenization for punct
2018-02-24 10:32:32 +01:00
Matthew Honnibal
12b39f87da
Move cython declarations in matcher.pyx
2018-02-24 10:32:18 +01:00
Matthew Honnibal
01d1b7abdf
Support many-to-one alignment in GoldParse
2018-02-24 10:17:01 +01:00
Matthew Honnibal
7865746574
Support many-to-one alignment
2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831
Poke matcher test for appveyor
2018-02-23 23:53:48 +01:00
Matthew Honnibal
968dabdde4
Fix bug in multi-task objective
2018-02-23 23:48:09 +01:00
Matthew Honnibal
2c9c8b8d72
Try comming out emoji test in matcher
2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe
Try to find test that fails on appveyor
2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3
Try to find test failing on appveyor
2018-02-23 20:59:21 +01:00
Matthew Honnibal
4492a33a9d
Fix sent_start multi-task objective when alignment fails
2018-02-23 16:50:59 +01:00
Matthew Honnibal
5fa44e93f1
Set unicode_literals in matcher
2018-02-23 16:48:54 +01:00
Matthew Honnibal
12264f9296
Add multi-task objective for sentence segmentation
2018-02-23 16:25:57 +01:00
Matthew Honnibal
e7deadb519
Set version to 2.1.0.dev1
2018-02-23 16:22:24 +01:00
Matthew Honnibal
7b575a119e
Try to reduce memory usage of test_matcher
2018-02-23 15:34:37 +01:00
Matthew Honnibal
24563f4026
Fix data typing in align
2018-02-23 15:08:06 +01:00
Matthew Honnibal
7a5ba20692
Fix integer typing in _align
2018-02-23 14:51:24 +01:00
Matthew Honnibal
875411b875
Set unicode types in _align.pyx and test
2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3
Fix broken span.as_doc test
2018-02-23 14:22:24 +01:00
dejanmarich
71c261d58b
Update stop_words.py
...
Added more words
2018-02-23 10:31:01 +01:00
Matthew Honnibal
3e6c1111b7
Remove obsolete test
2018-02-23 03:22:07 +01:00
Matthew Honnibal
a4fdec524a
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-gold
2018-02-22 21:44:28 +01:00
Matthew Honnibal
50817dc9ad
Improve parser oracle around sentence breaks.
2018-02-22 19:22:26 +01:00
Matthew Honnibal
307aefe131
Increment version to v2.0.9
2018-02-22 17:07:53 +01:00
Feng Niu
1c60384bed
return on empty doc
2018-02-21 15:39:04 -08:00
Feng Niu
7eb1cd100b
unbound doc var
2018-02-21 15:05:37 -08:00
Feng Niu
8df75b229c
fix unbound vars in es.syntax_iterators
2018-02-21 13:11:17 -08:00
alldefector
4244e285c2
Fix Spanish noun_chunks failure caused by typo
2018-02-21 12:43:21 -08:00
Matthew Honnibal
661873ee4c
Randomize the rebatch size in parser
2018-02-21 21:02:07 +01:00
Matthew Honnibal
0872cf611d
Don't lower-case lemmas of proper nouns
2018-02-21 16:01:16 +01:00
Matthew Honnibal
a0ddb803fd
Make error when no label found more helpful
2018-02-21 16:00:59 +01:00
Matthew Honnibal
ea2fc5d45f
Improve length and freq cutoffs in parser
2018-02-21 16:00:38 +01:00
Matthew Honnibal
e5757d4bf0
Add labels property to parser
2018-02-21 16:00:00 +01:00
Matthew Honnibal
eff4ae809a
Fix nonproj label filter
2018-02-21 15:59:04 +01:00
Matthew Honnibal
e624405cda
Temporarily remove cutoff when filtering labels in nonproj
2018-02-21 13:53:40 +01:00
Matthew Honnibal
f466f0186e
Use new alignment implementation in GoldParse
2018-02-20 21:16:35 +01:00
Matthew Honnibal
c0734ba526
Make alignment work with strings
2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98
Add tests for new Levenshtein alignment
2018-02-20 17:32:25 +01:00
Matthew Honnibal
930c980570
Add improved Levenshtein alignment implementation
2018-02-20 17:31:56 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
...
Polish tag map
2018-02-18 19:05:58 +01:00
Jim O'Regan
664407de5d
missing PrepCase attribute
2018-02-18 14:46:12 +00:00
Jim O'Regan
95f0673fbc
fix typo/missing here too
2018-02-18 14:38:27 +00:00
Matthew Honnibal
2bccad8815
Fix incorrect matcher test
2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-18 14:40:42 +01:00
Matthew Honnibal
cf0e320f2b
Add doc.is_sentenced attribute, re #1959
2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
...
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
1cf774bdc1
Add output options return_matches and as_tuples to Matcher
2018-02-18 14:00:45 +01:00
Matthew Honnibal
dd9b0945af
Fix inconsistencies in the symbols table
2018-02-18 13:51:31 +01:00
Matthew Honnibal
66496ac8e1
Set version to v2.1.0.dev0
2018-02-18 13:48:39 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
...
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b
Update matcher branch with v2.0.8 master
2018-02-18 13:42:58 +01:00
ines
6bba1db4cc
Drop six and related hacks as a dependency
2018-02-18 13:29:56 +01:00
Matthew Honnibal
b30b09192a
Merge pull request #1665 from jimregan/animacy
...
typo in "inan", add "nhum"
2018-02-18 13:26:53 +01:00
Matthew Honnibal
1b3c98e01b
Set version to v2.0.8
2018-02-18 12:16:31 +01:00
Matthew Honnibal
f9f46e5a07
Revert matcher fixes from GregDubbin
2018-02-18 10:59:28 +01:00
Matthew Honnibal
86405e4ad1
Fix CLI for multitask objectives
2018-02-18 10:59:11 +01:00
Matthew Honnibal
a34749b2bf
Add multitask objectives options to train CLI
2018-02-17 22:03:54 +01:00
Matthew Honnibal
8f06903e09
Fix multitask objectives
2018-02-17 18:41:36 +01:00
Matthew Honnibal
d1246c95fb
Fix model loading when using multitask objectives
2018-02-17 18:11:36 +01:00
Matthew Honnibal
262d0a3148
Fix overwriting of lexical attributes when loading vectors during training
2018-02-17 18:11:11 +01:00
Matthew Honnibal
c0caf7cf27
Fix LANG symbol
2018-02-17 18:10:50 +01:00
Matthew Honnibal
0bf2f6be29
Add missing symbol for LANG attr. Fixes inconsistent numeric ID
2018-02-17 17:37:02 +01:00
Matthew Honnibal
97a228a4ce
Increment to v2.0.8.dev0
2018-02-17 16:54:36 +01:00
Matthew Honnibal
f7dc64d2a3
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-17 16:47:35 +01:00
Aaron Marquez
ea571e8325
Merge branch 'master' into issue-1959
2018-02-16 15:14:09 -08:00
Matthew Honnibal
7d5c720fc3
Fix multitask objective when no pipeline provided
2018-02-15 23:50:21 +01:00
Aaron Marquez
f0d3672e17
Changed loading EN model
2018-02-15 14:28:38 -08:00
Aaron Marquez
3765d84d57
Fix issue #1959
2018-02-15 12:51:49 -08:00
Aaron Marquez
7ba4111554
Add test for issue-1959
2018-02-15 12:46:22 -08:00
Matthew Honnibal
59b7cf9db8
Add get_beam_parse method in ArcEager, for Prodigy
2018-02-15 21:03:16 +01:00
Matthew Honnibal
3e541de440
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-15 21:02:55 +01:00
Thomas Opsomer
5d24a81c0b
add test for span.sent when doc not parsed
2018-02-15 16:59:16 +01:00
Thomas Opsomer
deab391cbf
correct check on sent_start & raise if no boundaries
2018-02-15 16:58:30 +01:00
Matthew Honnibal
afbd46adfb
Remove length cap in PhraseMatcher
2018-02-15 16:10:54 +01:00
Matthew Honnibal
4533c7408d
Update matcher tests
2018-02-15 15:39:47 +01:00
Matthew Honnibal
1c19605426
Move matcher2.pyx to matcher.pyx
2018-02-15 15:27:03 +01:00
Matthew Honnibal
9ebf2fe7c3
Make helper function to get longest matches
2018-02-15 15:26:15 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
...
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Thomas Opsomer
b902731313
Find span sentence when only sentence boundaries (no parser)
2018-02-14 22:18:54 +01:00
Matthew Honnibal
d19dc67886
Make get_action nogil, for efficiency
2018-02-14 12:16:36 +01:00
Matthew Honnibal
7885b92b45
Refactor matcher2, hopefully making it faster
2018-02-14 12:11:17 +01:00
Matthew Honnibal
00261eea27
Make tests refer to matcher2
2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk
are closed
...
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706 ).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
262cbe356e
Remove caching, as doesn't seem to help for now.
2018-02-13 17:15:20 +01:00
Matthew Honnibal
f43d53f2c5
Remove print statement
2018-02-13 17:15:07 +01:00
Matthew Honnibal
dcd8d89aef
Update test for 850, making it work with matcher2
2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f
Remove re comparisons tests, as matcher behaves differently
2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1
Fix matcher test
2018-02-13 16:28:06 +01:00
Matthew Honnibal
9efda9e9ab
Add PhraseMatcher in matcher2.pyx
2018-02-13 16:27:46 +01:00
Johannes Dollinger
012e874d09
Add contributor agreement for emulbreh
2018-02-13 13:40:33 +01:00
Johannes Dollinger
bf94c13382
Don't fix random seeds on import
2018-02-13 12:42:23 +01:00
Matthew Honnibal
0004331895
Update notes on matcher2
2018-02-13 11:45:45 +01:00
Matthew Honnibal
b4cc39eb74
Fix zero-width quantifiers. Passes test_matcher
2018-02-13 11:45:32 +01:00
Matthew Honnibal
1b01685f47
Fix ZERO_PLUS operator
2018-02-12 12:28:03 +01:00
Matthew Honnibal
9115c3ba0a
Add TODO in notes
2018-02-12 12:06:48 +01:00
Matthew Honnibal
b00326a7fe
Move pattern_id out of TokenPattern
2018-02-12 12:05:54 +01:00
Matthew Honnibal
d34c732635
Add Python notes for rethinking matcher
2018-02-12 10:19:29 +01:00
Matthew Honnibal
d7c9b53120
Pass kwargs into pipeline components during begin_training
2018-02-12 10:18:39 +01:00
Matthew Honnibal
fae5c0dc18
Work on matcher2
2018-02-12 10:17:43 +01:00
4altinok
ca8728035d
added new lex feat to token
2018-02-11 18:55:48 +01:00
4altinok
edd7202a06
added new symbol
2018-02-11 18:55:32 +01:00
4altinok
ed1ac2969e
added new lexical feat to lexeme
2018-02-11 18:51:48 +01:00
4altinok
94fb0b75e3
code for is_currency
2018-02-11 18:51:32 +01:00
4altinok
3deef1497a
removed 18 and replaced 18 with is_currency
2018-02-11 18:51:09 +01:00
4altinok
471d3c9e23
added lex test for is_currency
2018-02-11 18:50:50 +01:00
ines
c63e99da8a
Fix typo in glossary ( resolves #1964 )
...
Co-Authored-By: SThomasP <sthomasp@users.noreply.github.com>
2018-02-10 11:58:41 +01:00
Lyndon White
6ee5dff51c
Make python 3.4 compat module loading ( fix #1733 )
2018-02-09 23:03:35 +08:00
Matthew Honnibal
e361b4f82b
Fix #1929 : Incorrect NER when pre-set sentence boundaries.
2018-02-08 15:25:41 +01:00
Matthew Honnibal
fd9fd275c5
Make test for #1945 more precise
2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180
Add test for #1945 : PhraseMatcher regression
2018-02-07 01:29:23 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
...
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe
Copied French syntax iterator to simplify future changes
2018-02-05 14:45:05 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
...
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Ali Zarezade
9df9da34a3
Fix init_model issue
...
Fixing issue #1928
2018-02-03 17:21:34 +03:30
Matthew Honnibal
ebe84e45e5
Increment version to 2.0.7
2018-02-02 03:39:16 +01:00
Matthew Honnibal
e4b1f57599
Increment version
2018-02-02 02:33:23 +01:00
Matthew Honnibal
069531c351
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-02 02:32:58 +01:00
Matthew Honnibal
f74a802d09
Test and fix #1919 : Error resuming training
2018-02-02 02:32:40 +01:00
ines
f1d3deffac
Add Russian example sentences (see #1107 )
2018-02-01 20:09:40 +01:00
Matthew Honnibal
6b1126c312
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-01 02:57:52 +01:00
ines
3c1fb9d02d
Make validate command fail more gracefully if version not found
...
Mostly relevant during develoment when working with .dev versions
2018-01-31 22:06:28 +01:00
Motoki Wu
54062b7326
added tests for issue #1915
2018-01-30 18:30:19 -08:00
Motoki Wu
f4a7d1a423
make to sure pass in **cfg to each component when training
2018-01-30 18:29:54 -08:00
ines
4046823699
Only check component in factories if string (see #1911 )
2018-01-30 16:29:07 +01:00
ines
ce10d320c4
Fix component check in self.factories (see #1911 )
2018-01-30 16:09:37 +01:00
Ole Henrik Skogstrøm
e40465487c
Added french syntax iterator with explenation
2018-01-30 15:44:29 +01:00
ines
8901814248
Improve error handling if pipeline component is not callable ( resolves #1911 )
...
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
a437ba87a3
Set release=True
2018-01-29 21:26:04 +01:00
Adam Binford
9238749aaf
Removed test to avoid network requests
2018-01-29 14:48:20 -05:00
Adam Binford
1a2c2f7d7f
Fixed auto linking after download and added simple test to check
2018-01-29 14:25:21 -05:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
...
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Matthew Honnibal
0c1e7f0c86
Merge pull request #1893 from azarezade/master
...
Add Persian language
2018-01-29 18:18:33 +01:00
Matthew Honnibal
cbdab75b36
Increment version
2018-01-28 23:46:22 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
...
Fix sentence boundaries serialization (issue #1834 )
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100
Limit parser model size, to hopefully reduce memory during CI tests
2018-01-28 21:00:32 +01:00
Thomas Opsomer
515e25910e
fix sent_start in serialization
2018-01-28 19:50:42 +01:00
Thomas Opsomer
45d62561f7
add test for the issue
2018-01-28 19:49:56 +01:00
ines
6d978e5c35
Don't use deprecated Doc.merge call in displaCy
...
As reported here: https://stackoverflow.com/a/48464412/6400719
2018-01-27 11:25:05 +01:00
Ali Zarezade
bb6bd3d8ae
add persian language
2018-01-27 13:27:26 +03:30
Ali Zarezade
d195675db5
add persian language
2018-01-27 13:21:38 +03:30
Kit
4b42267ba3
Fix issue #1889
2018-01-25 23:17:22 +01:00
Kit
52ef51f36e
Add test for issue #1889
2018-01-25 22:56:48 +01:00
Ole Henrik Skogstrøm
8e2c9f2475
Cleaned up nb tag_map comments
2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm
1107e89fcf
Updated doc string on nb tag_map module
2018-01-25 11:08:28 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
...
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
38b260e0c3
Merge pull request #1879 from azarezade/master
...
Add Persian character and symbols
2018-01-24 16:34:22 +01:00
Matthew Honnibal
edb71a280e
Add test for #1883 : Unpickling Matcher
2018-01-24 15:42:33 +01:00
Matthew Honnibal
2ad050e668
Fix unpickling of Matcher. Also store correct data in matcher._patterns
2018-01-24 15:42:11 +01:00
Ole Henrik Skogstrøm
4058a7d579
Fix æøå characters in lemmatizer
2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm
42248f423f
Updated tag map
2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm
74b430b49a
Correct Lemmatizer
2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm
b9b3a40c78
Add norwegian lemmatizer and tag_map
2018-01-24 12:28:29 +01:00
Matthew Honnibal
42a18ef903
Add test for #1868 : Vocab.__contains__ with ints
2018-01-23 23:27:05 +01:00
Matthew Honnibal
43f381ce36
Make Vocab.__contains__ work with ints. Fixes #1868
2018-01-23 23:26:47 +01:00
greg
85ab99e692
Correct test examples
2018-01-23 15:00:14 -05:00
greg
f50bb1aafc
Restructure StateC to eliminate dependency on unordered_map
2018-01-23 14:40:03 -05:00
Matthew Honnibal
f3753c2453
Further model deserialization fixes re #1727
2018-01-23 19:16:05 +01:00
Matthew Honnibal
91e916cb67
Add comment to new test
2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad
Add test for #1727
2018-01-23 19:11:01 +01:00
Matthew Honnibal
85c942a6e3
Dont overwrite pretrained_dims setting from cfg. Fixes #1727
2018-01-23 19:10:49 +01:00
Ali Zarezade
42349471bc
add ٪ as punctuation
2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
...
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Matthew Honnibal
7e6dc283db
Fix unicode import in test
2018-01-22 23:55:44 +01:00
greg
686735b94e
Fix matcher import
2018-01-22 16:53:05 -05:00
greg
3a491093ee
Import libcpp.map if libcpp.unordered_map doesn't exist
2018-01-22 16:46:25 -05:00
greg
d55992bdf0
Switch match dictionary to use final state pointer rather than ID
2018-01-22 15:36:47 -05:00
Matthew Honnibal
4ce7d24fd5
Add test for #1799 : Set left and right edges (and thus sentences) in non-projective parses.
2018-01-22 20:18:38 +01:00
Matthew Honnibal
56164ab688
Set l_edge and r_edge correctly for non-projective parses. Fixes #1799
2018-01-22 20:18:04 +01:00
Matthew Honnibal
964aa1b384
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-22 19:18:46 +01:00
Matthew Honnibal
29897ed1b3
Allow vector loading to work on 1d data files. Fixes #1831
2018-01-22 19:18:26 +01:00
greg
490bc82c27
Add comments clarifying matcher logic for '*'
2018-01-22 10:03:12 -05:00
Matthew Honnibal
fe4748fc38
Merge pull request #1870 from avadhpatel/master
...
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel
a517df55c8
Small fix
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel
5b5029890d
Merge branch 'perfTuning' into perfTuningMaster
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal
203d2ea830
Allow multitask objectives to be added to the parser and NER more easily
2018-01-21 19:37:02 +01:00
Matthew Honnibal
4a7d524efb
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-21 19:22:03 +01:00
Matthew Honnibal
61a051f2c0
Fix MultitaskObjective
2018-01-21 19:21:34 +01:00
Avadh Patel
75903949da
Updated model building after suggestion from Matthew
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel
fe879da2a1
Do not train model if its going to be loaded from disk
...
This saves significant time in loading a model from disk.
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel
2146faffee
Do not train model if its going to be loaded from disk
...
This saves significant time in loading a model from disk.
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
greg
7072b395c9
Add greedy matcher tests
2018-01-16 15:46:13 -05:00
greg
441f490c1c
Merge branch 'master' of github.com:GregDubbin/spaCy
2018-01-16 13:31:10 -05:00
greg
8bea62f26e
Correct bugs for greedy matching and introduce ADVANCE_PLUS action
2018-01-16 13:21:43 -05:00
Matthew Honnibal
ccb51a9f36
Make .similarity() return 1.0 if all orth attrs match
2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7
Fix test
2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58
Add test for #1757 : Comparison against None
2018-01-15 15:55:01 +01:00
Matthew Honnibal
b904d81e9a
Fix rich comparison against None objects. Closes #1757
2018-01-15 15:51:25 +01:00
Matthew Honnibal
9e413449f6
Fix unicode error in new test
2018-01-15 15:39:00 +01:00
Matthew Honnibal
ab7c45b12d
Fix error message and handling of doc.sents
2018-01-15 15:21:11 +01:00
Matthew Honnibal
6b215d2dd3
Add test for Issue #1537
2018-01-15 15:20:56 +01:00
ines
5babb7d6f6
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-14 17:31:09 +01:00
ines
793890cb4d
Remove test for removed deprecation warning
2018-01-14 17:31:06 +01:00
Matthew Honnibal
465a6f6452
Add missing Span.vocab property. Closes #1633
2018-01-14 15:06:30 +01:00
Matthew Honnibal
0cb090e526
Fix infinite recursion in token.sent_start. Closes #1640
2018-01-14 15:02:15 +01:00
Matthew Honnibal
5cbe913b6f
Don't raise deprecation warning in property. Closes #1813 , #1712
2018-01-14 14:55:58 +01:00
Matthew Honnibal
1a1cca6052
Fix vectors.resize() on Py3. Closes #1539
2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304
Make set_vector add word to vocab. Fixes #1807
2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
...
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow
2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769
2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769
2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes
2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769
2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
...
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7)
2018-01-12 20:30:51 +01:00
Matthew Honnibal
7ca49c2061
Merge branch 'master' into feature-improve-model-download
2018-01-10 18:21:55 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769 )
2018-01-08 03:42:04 +01:00
Kit
701e7cc6aa
Rename variable to keep code consistent
2018-01-08 03:38:44 +01:00
Kit
ed0db95183
Find lowercased forms of ordinal words, where possible
2018-01-08 03:28:50 +01:00
Kit
9bc524982e
Find lowercased forms of numeric words
2018-01-08 03:25:08 +01:00
Søren Lind Kristiansen
62de5da1ff
Remove unsused dummy variable
2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8
Remove dummy variable from function calls
2018-01-05 09:37:05 +01:00
Søren Lind Kristiansen
7f0ab145e9
Don't pass CLI command name as dummy argument
2018-01-04 21:33:47 +01:00
Ines Montani
6a008233b5
Merge pull request #1795 from textioHQ/issue1758 ( resolves #1758 )
...
english tokenizer: handle "would've"
2018-01-04 02:43:39 +00:00
Kevin Humphreys
597df5bf83
add test
2018-01-03 13:00:05 -08:00
Kevin Humphreys
7918fa4ef9
handle would've
2018-01-03 12:25:48 -08:00
ines
2c656f90fb
Exit with 1 if incompatible models found (see #1714 )
2018-01-03 21:20:35 +01:00
ines
dacfaa2ca4
Ensure that download command exits properly ( resolves #1714 )
2018-01-03 21:03:36 +01:00
Søren Lind Kristiansen
a9ff6eadc9
Prefix dummy argument names with underscore
2018-01-03 20:48:12 +01:00
ines
1081e08efb
Fix formatting
2018-01-03 20:14:50 +01:00
ines
d8109964d6
Use --no-deps on model install
...
In general, it's nice for models to specify spaCy as a dependency. However, this tends to cause problems in conda environments, as pip will re-install spaCy and its dependencies (especially Thinc)
2018-01-03 17:40:37 +01:00
ines
319d754309
Fix overwriting of existing symlinks
...
Check for is_symlink() to also overwrite invalid and outdated symlinks. Also show better error message if link path exists but is not symlink (i.e. file or directory).
2018-01-03 17:39:36 +01:00
ines
8ba0dfd017
Make message on failed linking more clear
2018-01-03 17:38:09 +01:00
Søren Lind Kristiansen
d6327e8495
Fix handling case when vectors not specified
2018-01-03 12:20:49 +01:00
Søren Lind Kristiansen
bcc51d7d8b
Fix shifted positional arguments
2018-01-03 12:19:47 +01:00
zqhZY
f27859fa99
add ChineseDefaults class for pickling
2017-12-28 17:13:58 +08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
...
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc
Fix missing import
2017-12-22 16:21:44 +01:00
ines
8dc1c27841
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-22 16:01:00 +01:00
ines
b10ba848b8
xfail test that causes MemoryError on Python 2 on Windows
...
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Søren Lind Kristiansen
bef735aef7
Fix Danish abbreviation 'm.h.t.'
2017-12-21 09:24:31 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization
2017-12-20 21:05:34 +00:00
Ines Montani
97f100f69f
Merge pull request #1742 from kimfalk/master
...
Two corrections in the da lan.
2017-12-20 21:02:00 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
...
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Benjamin Peterson
9452134cd1
remove no-break spaces from Hindi example ( fixes #1750 )
2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen
7a2f2f6f94
Fix formatting.
2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen
15d13efafd
Tune Danish tokenizer to more closely match tokenization in Universal Dependencies.
2017-12-20 17:36:52 +01:00
Kim FalkJørgensen
648dc60755
Remove the incorrect exception 'm.h.t'
2017-12-20 10:02:39 +01:00
Kim FalkJørgensen
9c9f4ef84a
Fixing a translation error in examples.py
...
Adding an exception in the tokenizer_exceptions.py
2017-12-19 15:26:50 +01:00
ines
22dc744b48
Fix check for '@' in like_url (see #1715 )
2017-12-16 13:48:43 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698
2017-12-12 10:36:11 +01:00
Ines Montani
6455b574fc
Check for email address first
2017-12-12 10:25:13 +01:00
Bri-Will
d77361d76c
Update lex_attrs.py. Fix like_url from matching on e-mail
2017-12-11 14:13:28 -08:00
Søren Lind Kristiansen
5a9d377580
Remove abbreviation for positional plac argument
2017-12-11 11:08:29 +01:00
Isaac Sijaranamual
38021fbb00
Switch from python 3 only TemporaryDirectory to pytest's tmpdir
2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
20ae0c459a
Fixes "Error saving model" #1622
2017-12-10 23:07:13 +01:00
Isaac Sijaranamual
568130ce7c
Adds regression test_issue1622
2017-12-10 23:00:48 +01:00
Isaac Sijaranamual
e188b61960
Make cli/train.py not eat exception
2017-12-10 22:53:08 +01:00
ines
020a7e5d52
Allow 'fine_grained' option in displaCy (see #1703 )
...
Shows token.tag_ instead of token.pos_. Disabled by default, to not cause rendering issues for models with long fine-grained tags (e.g. merged morphological features).
2017-12-09 15:11:12 +01:00
Matthew Honnibal
3b17eb7c49
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-07 10:39:32 +01:00
Matthew Honnibal
a6b43729c6
Set version to v2.0.5
2017-12-07 10:39:14 +01:00
ines
5eaa61c2b8
Fix formatting
2017-12-07 10:23:09 +01:00
ines
24e80c51b8
Document init-model command
2017-12-07 10:14:37 +01:00
Matthew Honnibal
c91f451b0f
Fix imports and CLI in init-model
2017-12-07 10:03:07 +01:00
ines
82e80ff928
Rename model command to init_model and fix formatting
2017-12-07 09:59:23 +01:00
Ines Montani
2feeb428d6
Merge pull request #1646 from GreenRiverRUS/master
...
Added model command to create models from raw data
2017-12-07 08:54:26 +00:00
Matthew Honnibal
6373d2580d
Increment version to v2.0.5.dev0
2017-12-07 09:53:59 +01:00
Matthew Honnibal
36b47e3fa6
Fix (and test) vector pickling
2017-12-07 09:53:30 +01:00
Matthew Honnibal
05f41ff587
Set version to 2.0.4
2017-12-06 13:24:02 +01:00
Matthew Honnibal
04c38f7e87
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-06 12:15:52 +01:00
Matthew Honnibal
361944e512
If no rules are set, lemmatize by lookup
2017-12-06 12:12:11 +01:00
Matthew Honnibal
2ab0f2d186
Merge pull request #1664 from jimregan/italian-lemmatizer
...
BOM in Italian lemmatiser
2017-12-06 11:09:04 +01:00
Matthew Honnibal
3f247119d3
Merge pull request #1668 from sorenlind/da_morph
...
Add more Danish morph rules and clean up existing ones
2017-12-06 11:08:09 +01:00
Matthew Honnibal
b712de774e
Fix vectors pickling
2017-12-05 12:45:24 +01:00
Matthew Honnibal
04650e38c7
Set version to 2.0.4.dev0
2017-12-05 10:52:31 +01:00
Matthew Honnibal
07acb43a85
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-04 14:42:52 +01:00
Thomas Werkmeister
94eac75b7c
fix setup.py spacy req string for packaging
...
Requirement should be `spacy>=2.0.2` instead of `spacy2.0.2`
2017-12-03 04:16:28 -06:00
ines
f2ea6d4713
Add Dutch example sentences (see #1107 )
2017-12-01 23:36:05 +01:00
Canbey Bilgili
abe098b255
Adds Turkish Lemmatization
2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen
d86b537a38
Enable morph rules for Danish
2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen
13a988adc3
Remove 'Number[psor]'
2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen
dd6fde18a9
Add more Danish morph rules and clean up existing ones
2017-11-30 11:17:19 +01:00
Vadim Mazaev
495eacf470
Merge branch 'model_command'
2017-11-30 12:30:26 +03:00
Vadim Mazaev
4ba7ddf651
Bugfixies
2017-11-30 12:29:38 +03:00
Jim O'Regan
a4ecdeadd4
aha
2017-11-29 23:43:25 +00:00