Matthew Honnibal
58be0e1f6f
Update tests
2017-06-04 16:35:06 -05:00
Matthew Honnibal
b78cc318c3
Fix loading of morphology exceptions
2017-06-04 16:34:32 -05:00
Matthew Honnibal
bb98d45a63
Fix tests
2017-06-04 16:00:44 -05:00
Matthew Honnibal
55d0621532
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 15:53:25 -05:00
Matthew Honnibal
5b9f116aca
Update tests
2017-06-04 15:53:17 -05:00
Matthew Honnibal
2a3bd5ee90
Fix fetching of noun chunk iterator
2017-06-04 15:53:05 -05:00
Matthew Honnibal
3680c51b8f
Avoid clobbering preset POS tags
2017-06-04 15:52:42 -05:00
Matthew Honnibal
939e8ed567
Add lookup properties for components in Language
2017-06-04 15:52:09 -05:00
Matthew Honnibal
e28f90b672
Fix syntax iterators
2017-06-04 15:51:50 -05:00
ines
8a29308d0b
Remove unused imports
2017-06-04 22:39:29 +02:00
Ines Montani
112c5787eb
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
...
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
96867a24ae
Fix typo
2017-06-04 22:36:40 +02:00
ines
f432bb4b48
Fix fixture scopes
2017-06-04 22:34:31 +02:00
Matthew Honnibal
6d0356e6cc
Whitespace
2017-06-04 14:55:24 -05:00
Matthew Honnibal
8a683a4494
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 21:53:56 +02:00
Matthew Honnibal
92ae36f84e
Improve way noun chunks iterator is looked up
2017-06-04 21:53:39 +02:00
ines
9254a3dd78
Import and add Spanish syntax iterators
2017-06-04 21:42:15 +02:00
ines
7db1a0e83e
Make sure printed values are always strings
2017-06-04 21:27:20 +02:00
Matthew Honnibal
51e1541ddb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 14:26:29 -05:00
Matthew Honnibal
add9a33782
Return False for vocab.has_vector
2017-06-04 14:26:14 -05:00
Matthew Honnibal
675f448313
Fix vector linkage on Doc
2017-06-04 14:25:30 -05:00
Matthew Honnibal
f4662e9218
Fix vector linkage for token
2017-06-04 14:19:58 -05:00
ines
070e026ed9
Ensure path on read_json
2017-06-04 20:44:37 +02:00
ines
e1e73936b1
Raise correct error
2017-06-04 20:44:27 +02:00
ines
848e47669e
Fix typo
2017-06-04 20:44:15 +02:00
ines
c4614c02a2
Fix dev resources URL
2017-06-04 15:45:50 +02:00
ines
a66cf24ee8
xfail tokenizer serialization tests for now
...
Tests pass locally, but not on Travis – needs more investigation
2017-06-04 13:58:20 +02:00
ines
7b7d46b64e
Fix typo and success message
2017-06-04 13:45:50 +02:00
ines
90d117f378
Update version
2017-06-04 13:41:16 +02:00
Matthew Honnibal
7ca215bc26
Resolve lex_attr_getters conflict
2017-06-03 16:12:01 -05:00
Matthew Honnibal
21eef90dbc
Support specifying which GPU
2017-06-03 16:10:23 -05:00
Matthew Honnibal
d0e42f9275
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 15:30:32 -05:00
Matthew Honnibal
8a17b99b1c
Use NORM attribute, not LOWER
2017-06-03 15:30:16 -05:00
ines
4c643d74c5
Add norm exceptions to other Language classes
2017-06-03 22:29:21 +02:00
ines
fa7e576c57
Change order of exception dicts
2017-06-03 21:52:06 +02:00
Matthew Honnibal
3f5c85d8de
Reorder setting of lex attrs, to avoid clobbering
2017-06-03 14:47:55 -05:00
Matthew Honnibal
aeb7520133
Make norm use lower-case
2017-06-03 14:47:38 -05:00
Matthew Honnibal
de3954843e
Populate norm exceptions with lower-case
2017-06-03 14:47:12 -05:00
Matthew Honnibal
f6955a459c
Fix prev commit
2017-06-03 14:38:37 -05:00
Matthew Honnibal
468ca6c760
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 14:33:51 -05:00
Matthew Honnibal
c647a0d33e
Fix training counter for gold preprocessing
2017-06-03 14:33:39 -05:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
d77c2cc8bb
Add tests for English norm exceptions
2017-06-03 20:59:50 +02:00
ines
0d6fa8b241
Add German norm exceptions
2017-06-03 20:54:18 +02:00
ines
5bd311c77e
Fix update of norm exceptions
2017-06-03 20:54:09 +02:00
Matthew Honnibal
94e063ae2a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-03 13:31:40 -05:00
Matthew Honnibal
fea1144e6d
Set max batch size in evaluate
2017-06-03 13:31:33 -05:00
Matthew Honnibal
805495af27
Fix off-by-one in number of tags
2017-06-03 13:29:23 -05:00
Matthew Honnibal
e62f46d39f
Clarify gold.pyx slightly
2017-06-03 13:28:52 -05:00
Matthew Honnibal
43353b5413
Improve train CLI script
2017-06-03 13:28:20 -05:00
ines
746653880c
Add English norm exceptions to lex_attrs
2017-06-03 20:27:28 +02:00
ines
095eeeb12f
Update English tokenizer exceptions and add norms
2017-06-03 20:27:16 +02:00
ines
e5d426406a
Add base norm exceptions
2017-06-03 20:27:05 +02:00
ines
4c2bbc3ccc
Add add_lookups util function
2017-06-03 19:44:47 +02:00
ines
05fe6758a7
Set lexeme attributes for tokenizer special cases
2017-06-03 19:44:39 +02:00
ines
3152ee5ca2
Update serialization tests for tokenizer
2017-06-03 17:05:28 +02:00
ines
7c919aeb09
Make sure serializers and deserializers are ordered
2017-06-03 17:05:09 +02:00
ines
1ebd0d3f27
Add assert_packed_msg_equal util function
2017-06-03 17:04:30 +02:00
ines
de974f7bef
Add serializer tests for tokenizer
2017-06-03 13:26:34 +02:00
ines
0153b66a86
Return self in Tokenizer.from_bytes
2017-06-03 13:26:13 +02:00
ines
82154a1861
Add letter spacing to arrow label
2017-06-03 13:25:41 +02:00
ines
32c6f05de9
Adjust spacing and sizing in compact mode
2017-06-03 13:25:32 +02:00
ines
cc8c8617a4
Shut down displaCy server on KeyboardInterrupt
2017-06-03 13:24:56 +02:00
ines
70fbba7d08
Clone Doc to never merge punctuation on original Doc
2017-06-03 13:24:43 +02:00
ines
459a1e8470
Fix whitespace
2017-06-03 11:31:18 +02:00
ines
5109bba910
Port over fix from #1070
2017-06-03 11:31:11 +02:00
ines
d21459f87d
Update serializer tests
2017-06-02 21:42:26 +02:00
ines
6669583f4e
Use OrderedDict
2017-06-02 21:07:56 +02:00
ines
2f1025a94c
Port over Spanish changes from #1096
2017-06-02 19:09:58 +02:00
ines
d86e7cde93
Add entity recognizer to parser serialization tests
2017-06-02 18:40:06 +02:00
ines
0051c05964
Add tests for serializing parser
2017-06-02 18:37:19 +02:00
ines
fdd0923be4
Translate model=True in exclude to lower_model and upper_model
2017-06-02 18:37:07 +02:00
ines
cef547a9f0
Add serialization tests for tensorizer
2017-06-02 18:18:30 +02:00
ines
924c58bde3
Fix serialization of optional elements
2017-06-02 18:18:17 +02:00
ines
f74a45c1fe
Remove unnecessary argument
2017-06-02 18:17:46 +02:00
ines
43b4d63f85
Add serialization tests for tagger
2017-06-02 17:29:34 +02:00
ines
1b593bbd6d
Fix encoding on tagger serialization
2017-06-02 17:29:21 +02:00
Matthew Honnibal
5f4d328e2c
Fix serialization of tag_map in NeuralTagger
2017-06-02 10:18:37 -05:00
Matthew Honnibal
ed6f575e06
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-02 04:26:39 -05:00
ines
acd65c00f6
Add serialization tests for StringStore and Vocab
2017-06-02 10:57:42 +02:00
ines
41a6adf1f6
Initialise Vocab length correctly
2017-06-02 10:57:25 +02:00
ines
53b82f972a
Add strings to Vocab in init, instead of StringStore
2017-06-02 10:57:06 +02:00
ines
023f38bdd4
Fix return value of Vocab.from_bytes
2017-06-02 10:56:40 +02:00
ines
9692c98f57
Add test utils for temp file and temp dir
2017-06-02 10:56:09 +02:00
Matthew Honnibal
c650bc481c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 13:03:57 -05:00
Matthew Honnibal
307d615c5f
Fix serialization for tagger when tag_map has changed
2017-06-01 12:18:36 -05:00
Matthew Honnibal
1d18cedae8
Fiddle with msgpack bytes vs unicode
2017-06-01 10:48:43 -05:00
ines
7a2380f617
Rename "nn_tagger" to "tagger"
2017-06-01 17:37:53 +02:00
ines
e5ae6ccf4e
Fix typo
2017-06-01 16:46:15 +02:00
ines
a3e4f91f4a
Only load vocab if it exists
2017-06-01 14:38:35 +02:00
Matthew Honnibal
d310b0aab3
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-01 04:58:03 -05:00
Matthew Honnibal
3ff7d7fcef
Merge for updated requirements
2017-06-01 04:57:47 -05:00
Matthew Honnibal
5eae3b9a1e
Fix to/from disk in tagger
2017-06-01 04:55:49 -05:00
ines
d5c8d2f5fd
Update about.py and increment version
2017-06-01 11:52:24 +02:00
Matthew Honnibal
4c97371051
Fixes for thinc 6.7
2017-06-01 04:22:16 -05:00
Matthew Honnibal
53d00a0371
Move weight serialization to Thinc
2017-06-01 03:04:36 -05:00
Matthew Honnibal
ae8010b526
Move weight serialization to Thinc
2017-06-01 02:56:12 -05:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
Matthew Honnibal
c8a58cfcf8
Fix Python2/3 load bug
2017-05-31 15:21:44 -05:00
Matthew Honnibal
99982684b0
Fix normalize_string_keys function'
2017-05-31 14:08:16 -05:00
Matthew Honnibal
67ade63fc4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 08:28:42 -05:00
Matthew Honnibal
490b38e6bb
Fix reference to thinc copy_array util
2017-05-31 08:25:21 -05:00
Matthew Honnibal
9805e0e369
Fix vocab pickling
2017-05-31 08:25:01 -05:00
Matthew Honnibal
6c51cd77b4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 15:06:56 +02:00
Matthew Honnibal
8dfb9546f0
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 07:21:14 -05:00
Matthew Honnibal
480ef8bfc8
Add compat function to normalize dict keys
2017-05-31 07:14:29 -05:00
Matthew Honnibal
92f9e5cc9a
Silence env_opt, and fix serialization for GPU
2017-05-31 07:14:11 -05:00
Matthew Honnibal
0561df2a9d
Fix tokenizer serialization
2017-05-31 14:12:38 +02:00
Matthew Honnibal
4a398c15b7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 13:44:16 +02:00
Matthew Honnibal
097ab9c6e4
Fix transition system to/from disk
2017-05-31 13:44:00 +02:00
Matthew Honnibal
b1469d3360
Fix string serialisation
2017-05-31 13:43:44 +02:00
Matthew Honnibal
e9419072e7
Fix tokenizer serialisation
2017-05-31 13:43:31 +02:00
Matthew Honnibal
33e5ec737f
Fix to/from disk methods
2017-05-31 13:43:10 +02:00
ines
5e1c361270
Update tests README with info on model tests
2017-05-31 12:22:58 +02:00
Matthew Honnibal
fe28602f2e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 11:43:56 +02:00
Matthew Honnibal
66af019d5d
Fix serialization of tokenizer
2017-05-31 11:43:40 +02:00
Ines Montani
e6cf3c7e1c
Merge pull request #1093 from oroszgy/hu_emoji_fix
...
Fixed emoji handling for Hungarian
2017-05-31 11:33:24 +02:00
Matthew Honnibal
e98eff275d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-31 10:29:15 +02:00
Matthew Honnibal
53a3824334
Fix mistake in ner feature
2017-05-31 03:01:02 +02:00
Matthew Honnibal
8a693c2605
Write binary file during training
2017-05-31 02:59:18 +02:00
Matthew Honnibal
498ad85309
Try using tensor for vector/similarity methdos
2017-05-30 23:35:17 +02:00
Matthew Honnibal
a131981f3b
Work on vectors
2017-05-30 23:34:50 +02:00
Matthew Honnibal
6937e311a4
Update doc tests
2017-05-30 23:34:23 +02:00
Matthew Honnibal
cc911feab2
Fix bug in NER state
2017-05-30 22:12:19 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
Matthew Honnibal
be4a640f0c
Fix arc eager label costs for uint64
2017-05-30 20:37:58 +02:00
Matthew Honnibal
b127645afc
Fix test_misc merge conflict
2017-05-29 18:31:44 -05:00
Matthew Honnibal
e0e8eae7c7
Tweak package test
2017-05-29 18:30:42 -05:00
Matthew Honnibal
11840ff5dd
Store tag map before normalizing props
2017-05-29 17:53:48 -05:00
Matthew Honnibal
b92a89f87b
Make it easier to reference embedding tables
2017-05-29 17:53:29 -05:00
Matthew Honnibal
293d1b425b
Serialize in consistent order
2017-05-29 17:53:06 -05:00
Matthew Honnibal
9bf22a94aa
Fix tag set serialisation
2017-05-29 17:52:36 -05:00
Matthew Honnibal
2a061e2777
Fix serialisation, for reals this time
2017-05-29 17:52:08 -05:00
ines
20a7003c0d
Update model fixtures and reorganise tests
2017-05-29 22:14:31 +02:00
ines
795fe43a4d
Add load_test_model function with importorskip()
...
Loads model only if it can be imported, i.e. if it's installed as a
package.
2017-05-29 22:11:31 +02:00
ines
ad3c8b3ad9
Fix formatting
2017-05-29 22:10:50 +02:00
ines
6e3937efc5
Check for arguments of model markers to specify models to test
...
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
Matthew Honnibal
35d981241f
Fix model deserialization
2017-05-29 14:46:31 -05:00
Matthew Honnibal
5b29f227ae
Fix serialization
2017-05-29 14:35:53 -05:00
Matthew Honnibal
1e6df0a2a1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 14:30:12 -05:00
ines
08382f21e3
Pass model meta to nlp object in load_model
2017-05-29 20:44:11 +02:00
ines
6145fe6a93
Catch all kwargs on Language
2017-05-29 20:43:48 +02:00
ines
0d7d50fe22
Add __version__ to __init__.py
2017-05-29 20:43:24 +02:00
Matthew Honnibal
6522ea6c8b
More serialization fixes. Still broken
2017-05-29 13:23:47 -05:00
Matthew Honnibal
9c9ee24411
Fix broken lambda scoping in Python 2
2017-05-29 13:23:28 -05:00
Matthew Honnibal
f1acdaab55
Fix serialization of weight offsets
2017-05-29 13:23:11 -05:00
Matthew Honnibal
c044e9c21c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-29 08:41:02 -05:00
Matthew Honnibal
aa4c33914b
Work on serialization
2017-05-29 08:40:45 -05:00
ines
9e83a17e95
Use new model templates
2017-05-29 15:27:24 +02:00
ines
567485a818
Fix and document model loading with pipeline and overrides
2017-05-29 14:10:10 +02:00
Matthew Honnibal
deac7eb01c
Fix for serialization
2017-05-29 13:54:18 +02:00
Matthew Honnibal
04c32aa091
Fix for serialization
2017-05-29 13:53:32 +02:00
Matthew Honnibal
a1960c2d09
Fix for serialization
2017-05-29 13:47:42 +02:00
Matthew Honnibal
7b06bb896e
Fix for serialization
2017-05-29 13:42:55 +02:00
Matthew Honnibal
74235587ef
Fix to serialization
2017-05-29 13:40:31 +02:00
Matthew Honnibal
59f355d525
Fixes for serialization
2017-05-29 13:38:20 +02:00
Matthew Honnibal
920887f4e4
Specify order of vocab deserialization
2017-05-29 13:04:40 +02:00
Matthew Honnibal
f4aafca222
Merge changes to test_misc
2017-05-29 12:26:02 +02:00
Matthew Honnibal
a318f0cae1
Add to/from disk/bytes methods for tokenizer
2017-05-29 12:24:41 +02:00
Matthew Honnibal
ff26aa6c37
Work on to/from bytes/disk serialization methods
2017-05-29 11:45:45 +02:00
ines
df920ba0e7
Add tests for displaCy and util functions and fix util typo
2017-05-29 10:51:19 +02:00
ines
c5714d4fb2
xfail matcher test for now until setting norm via Span.merge works
2017-05-29 10:51:02 +02:00
Matthew Honnibal
6b019b0540
Update to/from bytes methods
2017-05-29 10:14:20 +02:00
Matthew Honnibal
c91b121aeb
Move serialization functions to util
2017-05-29 10:13:42 +02:00
Matthew Honnibal
1fa2bfb600
Add model_to_bytes and model_from_bytes helpers. Probably belong in thinc.
2017-05-29 09:27:04 +02:00
Matthew Honnibal
6dad4117ad
Work on serialization for models
2017-05-29 01:37:57 +02:00
ines
7b1ddcc04d
Add test for vocab serialization
2017-05-29 01:09:52 +02:00
ines
00b2094dc3
Fix typos, long integers and tests
2017-05-29 01:09:52 +02:00
ines
804dbb8d25
Add StringStore test for API docs
2017-05-29 01:09:52 +02:00
Matthew Honnibal
6cd5730ee7
Fix lex struct setters for strings
2017-05-29 01:05:09 +02:00
Matthew Honnibal
2edd96ce47
Draft Vocab to/from disk/bytes
2017-05-28 23:34:12 +02:00
Matthew Honnibal
4ddff020c3
Fix compile error
2017-05-28 23:30:40 +02:00
Matthew Honnibal
6d3caeadd2
Fix type check for long
2017-05-28 23:22:45 +02:00
Matthew Honnibal
92dbf28c1e
Hack a fixture in the vectors tests, for xfail
2017-05-28 20:28:32 +02:00
Matthew Honnibal
9239f06ed3
Fix german noun chunks iterator
2017-05-28 20:13:03 +02:00
Matthew Honnibal
fd9b6722a9
Fix noun chunks iterator for new stringstore
2017-05-28 20:12:10 +02:00
ines
414193e9ba
Update docs to reflect StringStore changes
2017-05-28 18:19:11 +02:00
Matthew Honnibal
7996d21717
Fixes for new StringStore
2017-05-28 11:09:27 -05:00
Matthew Honnibal
8a24c60c1e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 08:12:05 -05:00
Matthew Honnibal
bc97bc292c
Fix __call__ method
2017-05-28 08:11:58 -05:00
Matthew Honnibal
5cf47b847b
Handle iob with no tag in converter
2017-05-28 08:11:39 -05:00
Matthew Honnibal
fe11564b8e
Finish stringstore change. Also xfail vectors tests
2017-05-28 15:10:22 +02:00
Matthew Honnibal
b007a2b0d3
Update stringstore tests
2017-05-28 14:08:09 +02:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
fe4a746300
Accomodate symbols in new string scheme
2017-05-28 13:03:16 +02:00
Matthew Honnibal
f51e6a6c16
Adjust lexeme sizing for attr_t being 64 bit
2017-05-28 12:51:09 +02:00
Matthew Honnibal
a5606c3eda
Work on changing StringStore to return hashes.
2017-05-28 12:36:27 +02:00
Matthew Honnibal
39293ab2ee
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-28 11:46:57 +02:00
Matthew Honnibal
dd052572d4
Update arc eager for SBD changes
2017-05-28 11:46:51 +02:00
Matthew Honnibal
3ea98e2043
Remove vector member from lexeme
2017-05-28 11:46:24 +02:00
Matthew Honnibal
2445707f3c
Re-delegate vectors to vocab
2017-05-28 11:46:10 +02:00
Matthew Honnibal
6863d01361
Remove vectors from lexeme
2017-05-28 11:45:48 +02:00
Matthew Honnibal
15f6efc127
Remove vectors from vocab
2017-05-28 11:45:32 +02:00
Matthew Honnibal
c1263a844b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 18:32:57 -05:00
Matthew Honnibal
9e711c3476
Divide d_loss by batch size
2017-05-27 18:32:46 -05:00
Matthew Honnibal
b082f76494
Randomize pipeline order during training
2017-05-27 18:32:21 -05:00
Matthew Honnibal
a1d4c97fb7
Improve correctness of minibatching
2017-05-27 17:59:00 -05:00
ines
84189c1cab
Add 'xx' language ID for multi-language support
...
Allows models to specify their language ID as 'xx'.
2017-05-28 00:58:59 +02:00
ines
33e332e67c
Remove unused export
2017-05-28 00:57:59 +02:00
ines
c1983621fb
Update util functions for model loading
2017-05-28 00:22:40 +02:00
ines
c8543c8237
Fix formatting and docstrings and remove deprecated function
2017-05-28 00:22:40 +02:00
Matthew Honnibal
49235017bf
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 16:34:28 -05:00
Matthew Honnibal
7ebd26b8aa
Use ordered dict to specify transitions
2017-05-27 15:52:20 -05:00
Matthew Honnibal
3eea5383a1
Add move_names property to parser
2017-05-27 15:51:55 -05:00
Matthew Honnibal
8de9829f09
Don't overwrite model in initialization, when loading
2017-05-27 15:50:40 -05:00
Matthew Honnibal
99316fa631
Use ordered dict to specify actions
2017-05-27 15:50:21 -05:00
Matthew Honnibal
655ca58c16
Clarifying change to StateC.clone
2017-05-27 15:49:37 -05:00
Matthew Honnibal
5e4312feed
Evaluate loaded class, to ensure save/load works
2017-05-27 15:47:02 -05:00
Matthew Honnibal
34bbad8e0e
Add __reduce__ methods on parser subclasses. Fixes pickling.
2017-05-27 15:46:06 -05:00
Matthew Honnibal
7cc9c3e9a6
Fix convert CLI
2017-05-27 15:44:42 -05:00
ines
1203959625
Add pipeline setting to meta.json generator
2017-05-27 20:02:01 +02:00
ines
086a06e7d7
Fix CLI docstrings and add command as first argument
...
Workaround for Plac
2017-05-27 20:01:46 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
dc07d72d80
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-27 08:20:40 -05:00
Matthew Honnibal
de13fe0305
Remove length cap on sentences
2017-05-27 08:20:32 -05:00
Matthew Honnibal
73a643d32a
Don't randomise pipeline for training, and don't update if no gradient
2017-05-27 08:20:13 -05:00
Matthew Honnibal
3d22fcaf0b
Return None from parser if there are no annotations
2017-05-26 14:02:59 -05:00
Matthew Honnibal
d06f235fc9
Fix conflict on convert.py
2017-05-26 11:33:29 -05:00
Matthew Honnibal
2e587c6417
Export iob_to_biluo utility
2017-05-26 11:32:55 -05:00
Matthew Honnibal
2b3b937a04
Fix converter CLI
2017-05-26 11:32:41 -05:00
Matthew Honnibal
5a87bcf35f
Fix converters
2017-05-26 11:32:34 -05:00
Matthew Honnibal
8af3100143
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-26 11:31:41 -05:00
Matthew Honnibal
3d5a536eaa
Improve efficiency of parser batching
2017-05-26 11:31:23 -05:00
Matthew Honnibal
daac3e3573
Always shuffle gold data, and support length cap
2017-05-26 11:30:52 -05:00
Matthew Honnibal
d65f99a720
Improve model saving in train script
2017-05-26 05:52:09 -05:00
ines
51882c4984
Fix formatting
2017-05-26 12:37:45 +02:00
ines
353f0ef8d7
Use disable argument (list) for serialization
2017-05-26 12:33:54 +02:00
Matthew Honnibal
22d7b448a5
Fix convert command
2017-05-25 19:47:12 -05:00
Matthew Honnibal
dbf2a4cf57
Update all models on each epoch
2017-05-25 19:46:56 -05:00
Matthew Honnibal
faff1c23fb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-25 17:16:10 -05:00
Matthew Honnibal
82b11b0320
Remove print statement
2017-05-25 17:15:59 -05:00
Matthew Honnibal
80cf42e33b
Fix compounding and decaying utils
2017-05-25 17:15:39 -05:00
Matthew Honnibal
df8015f05d
Tweaks to train script
2017-05-25 17:15:24 -05:00
Matthew Honnibal
3a6e59cc53
Add minibatch function in spacy.gold
2017-05-25 17:15:09 -05:00
Matthew Honnibal
702fe74a4d
Clean up spacy.cli.train
2017-05-25 16:16:30 -05:00
Matthew Honnibal
b9cea9cd93
Add compounding and decaying functions
2017-05-25 16:16:10 -05:00
Matthew Honnibal
2cb7cc2db7
Remove commented code from parser
2017-05-25 14:55:09 -05:00
Matthew Honnibal
f403c2cd5f
Add env opts for optimizer
2017-05-25 11:19:26 -05:00
Matthew Honnibal
c245ff6b27
Rebatch parser inputs, with mid-sentence states
2017-05-25 11:18:59 -05:00
Matthew Honnibal
679efe79c8
Make parser update less hacky
2017-05-25 06:49:00 -05:00
Matthew Honnibal
8500d9b1da
Only train one task per iter, holding grads
2017-05-25 06:47:42 -05:00
Matthew Honnibal
b27c587800
Fix pieces argument to PrecomputedMaxout
2017-05-25 06:46:59 -05:00
Matthew Honnibal
e1cb5be0c7
Adjust dropout, depth and multi-task in parser
2017-05-24 20:11:41 -05:00
Matthew Honnibal
e6cc927ab1
Rearrange multi-task learning
2017-05-24 20:10:54 -05:00
Matthew Honnibal
135a13790c
Disable gold preprocessing
2017-05-24 20:10:20 -05:00
Matthew Honnibal
467bbeadb8
Add hidden layers for tagger
2017-05-24 20:09:51 -05:00
ines
66088851dc
Add Doc.to_disk() and Doc.from_disk() methods
2017-05-24 11:58:17 +02:00
Matthew Honnibal
620df0414f
Fix dropout in parser
2017-05-23 15:20:45 -05:00
Matthew Honnibal
5b67bcbee0
Increase default embed size to 7500
2017-05-23 15:20:16 -05:00
Matthew Honnibal
48eef94f92
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 18:47:32 +02:00
Matthew Honnibal
d44b1eafc4
Fix conflict artefacts
2017-05-23 18:47:11 +02:00
Matthew Honnibal
01e59e4e6e
* Add Token.sent_start property, re Issue #235
2017-05-23 18:41:11 +02:00
Matthew Honnibal
4917cbb484
Include sent_start test
2017-05-23 18:40:37 +02:00
Matthew Honnibal
d68dd1f251
Add SENT_START attribute, for custom sentence boundary detection
2017-05-23 18:37:58 +02:00
Matthew Honnibal
8026c183d0
Add hacky logic to accelerate depth=0 case in parser
2017-05-23 11:06:49 -05:00
Matthew Honnibal
e7d3159d91
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 05:58:17 -05:00
Matthew Honnibal
a8b6d11c5b
Support optional maxout layer
2017-05-23 05:58:07 -05:00
Matthew Honnibal
c55b8fa7c5
Fix bugs in parse_batch
2017-05-23 05:57:52 -05:00
ines
fb0ff0272f
xfail neural parser tests for now and remove test for deprecated method
2017-05-23 12:40:37 +02:00
Matthew Honnibal
964707d795
Restore support for deeper networks in parser
2017-05-23 05:31:13 -05:00
Matthew Honnibal
e27262f431
Go back to previous matcher signature, with on_match positional
2017-05-23 04:37:40 -05:00
Matthew Honnibal
5418bcf5d7
Resolve conflict on test
2017-05-23 04:37:16 -05:00
ines
e6acd3bbf2
Fix matcher tests and matcher docs
2017-05-23 11:36:02 +02:00
ines
d0c6d4f76d
Fix formatting
2017-05-23 11:32:00 +02:00
Matthew Honnibal
f0bcc0bd8d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-23 04:29:28 -05:00
Matthew Honnibal
9adfe9e8fc
Don't hold gradient updates in language -- let the parser decide how to batch the updates.
2017-05-23 04:29:10 -05:00
Matthew Honnibal
6b918cc58e
Support making updates periodically during training
2017-05-23 04:23:29 -05:00
Matthew Honnibal
3f725ff7b3
Roll back changes to parser update
2017-05-23 04:23:05 -05:00
Matthew Honnibal
3959d778ac
Revert "Revert "WIP on improving parser efficiency""
...
This reverts commit 532afef4a8
.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8
Revert "WIP on improving parser efficiency"
...
This reverts commit bdaac7ab44
.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44
WIP on improving parser efficiency
2017-05-23 02:59:31 -05:00
Matthew Honnibal
8a9e318deb
Put the parsing loop in a nogil prange block
2017-05-22 17:58:12 -05:00
ines
a23f487b06
Tidy up displaCy and add "manual" option
...
Also don't require title in EntityRenderer
2017-05-22 18:48:20 +02:00
Matthew Honnibal
0264447c4d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 10:41:56 -05:00
Matthew Honnibal
6e8dce2c05
Fix train command line args
2017-05-22 10:41:39 -05:00
Matthew Honnibal
a7ee63c0ac
Fix labeller loss for unseen labels
2017-05-22 10:41:20 -05:00
Matthew Honnibal
c9760b2104
Support sentence limits in GoldCorpus
2017-05-22 10:40:46 -05:00
Matthew Honnibal
e2136232f9
Exclude states with no matching gold annotations from parsing
2017-05-22 10:30:12 -05:00
Matthew Honnibal
83ffd16474
Fix offset calculation for other negative values
2017-05-22 08:00:53 -05:00
ines
b3c7ee0148
Fix tests and use the new Matcher API
2017-05-22 13:54:20 +02:00
Matthew Honnibal
f00f821496
Fix pseudoprojectivity->nonproj
2017-05-22 06:14:42 -05:00
Matthew Honnibal
ae8cf70dc1
Fix CLI train signature
2017-05-22 06:13:39 -05:00
Matthew Honnibal
187f370734
Update tests for matcher changes
2017-05-22 12:59:50 +02:00
Matthew Honnibal
5d59e74cf6
PseudoProjectivity->nonproj
2017-05-22 05:49:53 -05:00
Matthew Honnibal
7e2cdc0c81
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:39:34 +02:00
Matthew Honnibal
70a8c531cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 05:39:18 -05:00
Matthew Honnibal
2f78413a02
PseudoProjectivity->nonproj
2017-05-22 05:39:03 -05:00
Matthew Honnibal
89ebc5c3cd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-22 12:38:15 +02:00
Matthew Honnibal
d8bb5bb959
Implement StringStore serialization, and update tests
2017-05-22 12:38:00 +02:00
ines
54f04a9fe0
Update API docs with changes in spacy.gold and spacy.language
2017-05-22 12:29:30 +02:00
ines
b5fb43fdd8
Allow sys.exit status as exits keyword arg in util.prints()
2017-05-22 12:29:15 +02:00
ines
fc3ec733ea
Reduce complexity in CLI
...
Remove now redundant model command and move plac annotations to cli
files
2017-05-22 12:28:58 +02:00
Matthew Honnibal
b45b4aa392
PseudoProjectivity --> nonproj
2017-05-22 05:17:44 -05:00
Matthew Honnibal
aae97f00e9
Fix nonproj import
2017-05-22 05:15:06 -05:00
Matthew Honnibal
9262fc4829
Fix syntax error
2017-05-22 05:14:59 -05:00
Matthew Honnibal
93a042253b
Make GoldParse attributes writeable
2017-05-22 04:51:08 -05:00
Matthew Honnibal
2a5eb9f61e
Make nonproj methods top-level functions, instead of class methods
2017-05-22 04:51:08 -05:00
Matthew Honnibal
c998776c25
Make single array for features, to reduce GPU copies
2017-05-22 04:51:08 -05:00
Matthew Honnibal
bc2294d7f1
Add support for fiddly hyper-parameters to train func
2017-05-22 04:51:08 -05:00
Matthew Honnibal
80e19a2399
Simplify CLI implementation for subcommands. Remove model command.
2017-05-22 04:51:08 -05:00
Matthew Honnibal
33e2222839
Remove unused code in deprojectivize
2017-05-22 04:51:08 -05:00
Matthew Honnibal
4e0988605a
Pass through non-projective=True
2017-05-22 04:51:08 -05:00
Matthew Honnibal
025d9bbc37
Fix handling of non-projective deps
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5738d373d5
Add deprojectivize to pipeline
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1b5fa68996
Do pseudo-projective pre-processing for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
1d5d9838a2
Fix action collection for parser
2017-05-22 04:51:08 -05:00
Matthew Honnibal
8d1e64be69
Add experimental NeuralLabeller
2017-05-22 04:51:08 -05:00
Matthew Honnibal
9b1b0742fd
Fix prediction for tok2vec
2017-05-22 04:51:08 -05:00
Matthew Honnibal
f13d6c7359
Support gold preprocessing and single gold files
2017-05-22 04:51:08 -05:00
Matthew Honnibal
e14533757b
Use averaged params for evaluation
2017-05-22 04:51:08 -05:00
Matthew Honnibal
7811d97339
Refactor CLI
2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
Matthew Honnibal
432b3499b3
Fix memory leak
2017-05-21 13:38:46 -05:00
Matthew Honnibal
59fbfb3829
Remove train.py -- functions now in GoldCorpus and Language
2017-05-21 09:08:27 -05:00
Matthew Honnibal
8904814c0e
Add missing import
2017-05-21 09:07:56 -05:00
Matthew Honnibal
baf3ef0ddc
Remove import of removed train_config script
2017-05-21 09:07:34 -05:00
Matthew Honnibal
4c9202249d
Refactor training, to fix memory leak
2017-05-21 09:07:06 -05:00
Matthew Honnibal
4803b3b69e
Add GoldCorpus class, to manage data streaming
2017-05-21 09:06:17 -05:00
Matthew Honnibal
180e5afede
Fix tokvecs flattening in pipeline
2017-05-21 09:05:34 -05:00
Matthew Honnibal
0731971bfc
Add itershuffle utility function. Maybe belongs in thinc
2017-05-21 09:05:05 -05:00
ines
2c5cfe8bbf
Update docstrings and API docs for StringStore
2017-05-21 14:18:58 +02:00
ines
251346b59f
Fix typos and formatting
2017-05-21 14:18:46 +02:00
ines
075f5ff87a
Update docstrings and API docs for GoldParse
2017-05-21 13:53:46 +02:00
ines
99b631617d
Reformat docstrings
2017-05-21 13:32:15 +02:00
ines
885e82c9b0
Update docstrings and remove deprecated load classmethod
2017-05-21 13:27:52 +02:00
ines
c5a653fa48
Update docstrings and API docs for Tokenizer
2017-05-21 13:18:14 +02:00
ines
f216422ac5
Remove deprecated load classmethod
2017-05-21 13:18:01 +02:00
ines
d82ae9a585
Change "function" to "callable" in docs
2017-05-21 13:17:40 +02:00
ines
3871157d84
Update spacy.util documentation
2017-05-21 01:12:09 +02:00
ines
0c6c65aa3c
Improve messaging if model linking fails after download
2017-05-21 00:28:37 +02:00
Matthew Honnibal
3b7c108246
Pass tokvecs through as a list, instead of concatenated. Also fix padding
2017-05-20 13:23:32 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
d52b65aec2
Revert "Move to contiguous buffer for token_ids and d_vectors"
...
This reverts commit 3ff8c35a79
.
2017-05-20 11:26:23 -05:00
ines
27de0834b2
Update docstrings and API docs for Lexeme
2017-05-20 15:13:42 +02:00
ines
7ed8a92ed1
Update docstrings and API docs for Token
2017-05-20 15:13:33 +02:00
ines
4ed6a36622
Update docstrings and API docs for Matcher
2017-05-20 14:43:10 +02:00
ines
39f36539f6
Update docstrings and API docs for Matcher
2017-05-20 14:32:34 +02:00
ines
c00ff257be
Update docstrings and API docs for Matcher
2017-05-20 14:26:10 +02:00
ines
790435e51c
Update docstrings
2017-05-20 14:05:07 +02:00
ines
f0cc642bb9
Update docstrings and API docs for Vocab
2017-05-20 14:00:41 +02:00
Matthew Honnibal
ce9234f593
Update Matcher API
2017-05-20 13:54:53 +02:00
Matthew Honnibal
b272890a8c
Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change
2017-05-20 06:40:10 -05:00
ines
e39ad78267
Resolve model name properly in cli.info
...
Use util.resolve_model_path() to also allow package names and paths.
2017-05-20 12:24:40 +02:00
Matthew Honnibal
3ff8c35a79
Move to contiguous buffer for token_ids and d_vectors
2017-05-20 04:17:30 -05:00
Matthew Honnibal
8b04b0af9f
Remove freqs from transition_system
2017-05-20 02:20:48 -05:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
a1ba20e2b1
Fix over-run on parse_batch
2017-05-19 18:57:30 -05:00
ines
1d4d3d0ecd
Add TODO
2017-05-20 01:38:04 +02:00
Matthew Honnibal
7ee1827af0
Disable data caching in parser
2017-05-19 18:17:11 -05:00
Matthew Honnibal
e84de028b5
Remove 'rebatch' op, and remove min-batch cap
2017-05-19 18:16:36 -05:00
Matthew Honnibal
3376d4d6e8
Update the train script, fixing GPU memory leak
2017-05-19 18:15:50 -05:00
Matthew Honnibal
836fe1d880
Update neural net tests
2017-05-19 18:11:29 -05:00
ines
fe5d8819ea
Update Matcher docstrings and API docs
2017-05-19 21:47:06 +02:00
Matthew Honnibal
08766240c3
Add incomplete iob converter
2017-05-19 13:27:51 -05:00
Matthew Honnibal
c12ab47a56
Remove state argument in pipeline. Other changes
2017-05-19 13:26:36 -05:00
Matthew Honnibal
66ea9aebe7
Remove the state argument from Language
2017-05-19 13:25:42 -05:00
Matthew Honnibal
09a877886b
WIP on iob converter
2017-05-19 13:24:39 -05:00
ines
a804045597
Use is_ancestor instead of deprecated is_ancestor_of
2017-05-19 20:23:40 +02:00
Matthew Honnibal
8d5e6d9f4f
Rename no_ner arg to no_entities
2017-05-19 13:23:11 -05:00
ines
e9e62b01b0
Update docstrings and API docs for Token
2017-05-19 18:47:56 +02:00
ines
62ceec4fc6
Update docstrings and API docs for Span
2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8
Update docstrings and API docs for Doc
2017-05-19 18:47:39 +02:00
ines
2c8c9dc0c9
Update docstrings and API docs for Language
2017-05-19 18:47:24 +02:00
ines
0791f0aae6
Update docstrings and API docs for Span class
2017-05-19 00:31:31 +02:00
ines
8455cb1327
Update docstring for Doc.__getitem__
2017-05-19 00:30:51 +02:00
ines
0fc05e54e4
Document TokenVectorEncoder
2017-05-19 00:00:02 +02:00
ines
b687ad109d
Update docstrings and API docs for Doc class
2017-05-18 23:59:44 +02:00
ines
d42bc16868
Update docstrings and API docs for Language class
2017-05-18 23:57:38 +02:00
ines
593361ee3c
Update docstrings for Span class
2017-05-18 22:17:41 +02:00
ines
b87066ff10
Update docstrings and API docs for Doc class
2017-05-18 22:17:41 +02:00
Matthew Honnibal
238be0f16a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-18 08:32:22 -05:00
Matthew Honnibal
c214c0decb
Improve env_opt reporting
2017-05-18 08:32:03 -05:00
Matthew Honnibal
bbb59e371c
Fix GPU evaluation
2017-05-18 08:31:15 -05:00
Matthew Honnibal
c2c825127a
Fix use_params and pipe methods
2017-05-18 08:30:59 -05:00
Matthew Honnibal
ca70b08661
Fix GPU training and evaluation
2017-05-18 08:30:33 -05:00
ines
489d2fb4ba
Add is_in_jupyter() helper for displaCy (see #1058 )
2017-05-18 14:13:14 +02:00
ines
abf0188b0a
Move cupy and CudaStream to compat
2017-05-18 14:12:45 +02:00
ines
33decd85b6
Reorganise and explicitly state what's importable
2017-05-18 14:12:31 +02:00
Matthew Honnibal
a438cef8c5
Fix significant bug in feature calculation -- off by 1
2017-05-18 06:21:32 -05:00
Matthew Honnibal
fc8d3a112c
Add util.env_opt support: Can set hyper params through environment variables.
2017-05-18 04:36:53 -05:00
Matthew Honnibal
d2626fdb45
Fix name error in nn parser
2017-05-18 04:31:01 -05:00
Matthew Honnibal
b460533827
Bug fixes to pipeline
2017-05-18 04:29:51 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
Matthew Honnibal
2713041571
Fix GPU usage in Language
2017-05-18 04:25:19 -05:00
Matthew Honnibal
711ad5edc4
Cache features in doc2feats
2017-05-18 04:22:20 -05:00
Matthew Honnibal
39ea38c4b1
Add option to use gpu to spacy train
2017-05-18 04:21:49 -05:00
Matthew Honnibal
a1d8e420b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 08:00:04 -05:00
Matthew Honnibal
edfea3a513
Fix progress bar
2017-05-17 14:59:37 +02:00
Matthew Honnibal
0b7fd67408
Fix style check in displacy
2017-05-17 07:57:24 -05:00
Matthew Honnibal
55dab77de8
Add conversion rule for .conll
2017-05-17 13:13:48 +02:00
Matthew Honnibal
692bd2a186
Bug fix to tagger: wasnt backproping to token vectors
2017-05-17 13:13:14 +02:00
Matthew Honnibal
877f83807f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-17 12:09:29 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
3bf4a28d8d
Use tag in CoNLL converter, not POS
2017-05-17 12:04:33 +02:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
c9a5d5d24b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-16 16:22:05 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
221b4c1ee8
Fix test for Python 3
2017-05-16 13:06:30 +02:00
Matthew Honnibal
5211645af3
Get data flowing through pipeline. Needs redesign
2017-05-16 11:21:59 +02:00
Matthew Honnibal
1d7c18e58a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-05-15 21:53:47 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
ines
98354be150
Only get user_data if it exists on doc
2017-05-15 13:39:47 +02:00
ines
c33bdeb564
Use uppercase for entity types
2017-05-15 01:24:57 +02:00
ines
4aaa607b8d
Add xmlns:xlink so SVGs are rendered properly as individual files
2017-05-14 19:54:13 +02:00
ines
9dd13cd76a
Update docstrings
2017-05-14 19:30:47 +02:00
ines
a04550605a
Add Jupyter notebook support (see #1058 )
2017-05-14 18:39:01 +02:00
ines
c31792aaec
Add displaCy visualisers (see #1058 )
2017-05-14 17:50:23 +02:00
ines
b462076d80
Merge load_lang_class and get_lang_class
2017-05-14 01:31:10 +02:00
ines
36bebe7164
Update docstrings
2017-05-14 01:30:29 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal
5cac951a16
Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.
2017-05-14 00:55:01 +02:00
Matthew Honnibal
f8c02b4341
Remove cupy imports from parser, so it can work on CPU
2017-05-14 00:37:53 +02:00
Matthew Honnibal
613ba79e2e
Fiddle with sizings for parser
2017-05-13 17:20:23 -05:00
Matthew Honnibal
e6d71e1778
Small fixes to parser
2017-05-13 17:19:04 -05:00
Matthew Honnibal
188c0f6949
Clean up unused import
2017-05-13 17:18:27 -05:00
Matthew Honnibal
f85c8464f7
Draft support of regression loss in parser
2017-05-13 17:17:27 -05:00
ines
1694c24e52
Add docstrings, error messages and fix consistency
2017-05-13 21:22:49 +02:00
ines
ee7dcf65c9
Fix expand_exc to make sure it returns combined dict
2017-05-13 21:22:25 +02:00
ines
824d09bb74
Move resolve_load_name to deprecated
2017-05-13 21:21:47 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
5858857a78
Update languages list in conftest
2017-05-13 15:37:54 +02:00
ines
9d85cda8e4
Fix models error message and use about.__docs_models__ (see #1051 )
2017-05-13 13:05:47 +02:00
ines
6b942763f0
Tidy up imports
2017-05-13 13:04:40 +02:00
ines
8c2a0c026d
Fix parse_tree test
2017-05-13 12:32:45 +02:00
ines
6129016e15
Replace deepcopy
2017-05-13 12:32:37 +02:00
ines
df68bf45ce
Set defaults for light and flat kwargs
2017-05-13 12:32:23 +02:00
ines
b9dea345e5
Remove old import
2017-05-13 12:32:11 +02:00
ines
293ee359c5
Fix formatting
2017-05-13 12:32:06 +02:00
ines
4eefb288e3
Port over PR #1055
2017-05-13 03:25:32 +02:00
Matthew Honnibal
ee1d35bdb0
Fix merge conflict
2017-05-13 03:20:19 +02:00
Matthew Honnibal
b2540d2379
Merge Kengz's tree_print patch
2017-05-13 03:18:49 +02:00
Matthew Honnibal
827b5af697
Update draft of parser neural network model
...
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.
Outline of the model:
We first predict context-sensitive vectors for each word in the input:
(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4
This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).
The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.
The current context tokens:
* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0
This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).
The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)
This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.
Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:
(exp(score) / Z) - (exp(score) / gZ)
Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.
Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines
c4857bc7db
Remove unused argument
2017-05-12 15:37:54 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
Matthew Honnibal
4efb391994
Fix serializer
2017-05-09 18:45:18 +02:00
Matthew Honnibal
b16ae75824
Remove serializer hacks from pipeline classes
2017-05-09 18:16:40 +02:00
Matthew Honnibal
7253b4e649
Remove old serialization tests
2017-05-09 18:12:58 +02:00
Matthew Honnibal
f9327343ce
Start updating serializer test
2017-05-09 18:12:03 +02:00
Matthew Honnibal
1166b0c491
Implement Doc.to_bytes and Doc.from_bytes methods
2017-05-09 18:11:34 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
Matthew Honnibal
b53f7dfdc3
Remove spacy.serialize
2017-05-09 17:22:06 +02:00
Matthew Honnibal
62ecdea9f2
Add binder class for document serialization
2017-05-09 17:21:00 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
2c3bdd09b1
Add English test for like_num
2017-05-09 11:06:34 +02:00
ines
22375eafb0
Fix and merge attrs and lex_attrs tests
2017-05-09 11:06:25 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
564939391a
Remove spacy.orth
2017-05-09 01:21:47 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
ea5fa46475
Import LEX_ATTRS from lang.lex_attrs
2017-05-09 00:58:10 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
7b83977020
Remove unused munge package
2017-05-09 00:16:16 +02:00
ines
c714841cc8
Move language-specific tests to tests/lang
2017-05-09 00:02:37 +02:00
ines
bd57b611cc
Update conftest to lazy load languages
2017-05-09 00:02:21 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
6eb6306843
Fix language data imports
2017-05-08 23:58:31 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
86d9c29f30
Reorder util functions
2017-05-08 23:51:15 +02:00
ines
9a0d2fdef1
Add load_lang_class() util function
2017-05-08 23:50:45 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00
ines
41a322c733
Fix LEMMA in exceptions and morph rules
2017-05-08 19:57:36 +02:00
ines
2edc0aee12
Update warning message
2017-05-08 19:53:36 +02:00
ines
6025cdb992
Fix string interpolation in times
2017-05-08 16:38:16 +02:00
ines
b9ba58ba5c
Add function to resolve load name
...
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines
e6f1a5d0a1
Add unicode declaration
2017-05-08 16:22:17 +02:00
ines
be5541bd16
Fix import and tokenizer exceptions
2017-05-08 16:20:14 +02:00
ines
2324788970
Remove bad tests
2017-05-08 16:15:27 +02:00
ines
b88c4193e7
Add missing symbol
2017-05-08 16:15:20 +02:00
ines
9a5b2bdd4c
Don't set morph rules without tag map
2017-05-08 16:15:12 +02:00
ines
4930f0fa8f
Explicitly import TOKEN_MATCH
2017-05-08 16:11:54 +02:00
ines
50b7ec03ca
Fix typo
2017-05-08 16:11:45 +02:00
ines
3ca611fe48
Fix wildcard imports
2017-05-08 15:56:29 +02:00
ines
c2469b8135
Remove __all__ export
2017-05-08 15:56:22 +02:00
ines
14a9c3ee7a
Fix wildcard import
2017-05-08 15:56:13 +02:00
ines
deed623864
Remove comment
2017-05-08 15:56:05 +02:00
ines
e7f95c37ee
Merge base tokenizer exceptions
2017-05-08 15:55:52 +02:00
ines
24606d364c
Remove redundant language_data.py files in languages
...
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines
a627d3e3b0
Reorganise Chinese language data
2017-05-08 15:54:36 +02:00
ines
7b86ee093a
Reorganise Swedish language data
2017-05-08 15:54:29 +02:00
ines
50510fa947
Reorganise Portuguese language data
2017-05-08 15:52:01 +02:00
ines
279895ea83
Reorganise Dutch language data
2017-05-08 15:51:39 +02:00
ines
04ef5025bd
Reorganise Norwegian language data
2017-05-08 15:51:22 +02:00
ines
5edbc725d8
Reorganise Japanese language data
2017-05-08 15:50:46 +02:00
ines
51a389d3bb
Reorganise Italian language data
2017-05-08 15:50:17 +02:00
ines
1bbfa14436
Reorganise Hungarian language data
2017-05-08 15:49:56 +02:00
ines
a77c9fc60d
Reorganise Hebrew language data
2017-05-08 15:49:28 +02:00
ines
7f05e977fa
Reorganise French language data
2017-05-08 15:49:05 +02:00
ines
0207ffdd52
Reorganise Finnish language data
2017-05-08 15:48:31 +02:00
ines
8e483ec950
Reorganise Spanish language data
2017-05-08 15:48:04 +02:00
ines
c7c21b980f
Reorganise English language data
2017-05-08 15:47:25 +02:00
ines
1bf9d5ec8b
Reorganise German language data
2017-05-08 15:44:26 +02:00
ines
7b3a983f96
Reorganise Bengali language data
2017-05-08 15:43:50 +02:00
ines
607ba458e7
Fix whitespace
2017-05-08 15:42:31 +02:00
ines
60db497525
Add update_exc and expand_exc to util
...
Doesn't require separate language data util anymore
2017-05-08 15:42:12 +02:00
Matthew Honnibal
b44f7e259c
Clean up unused parser code
2017-05-08 15:42:04 +02:00
ines
6e5bd4f228
Remove unused functions from deprecated
2017-05-08 15:40:16 +02:00
Matthew Honnibal
17efb1c001
Change width
2017-05-08 08:40:13 -05:00
ines
f68e420bc0
Add PRON_LEMMA and DET_LEMMA to deprecated
...
Will be replaced with proper values across the language data later.
2017-05-08 15:35:30 +02:00
ines
bd6a7cf4f6
Simplify deprecated model downloading
...
Only relevant for spaCy < v1.7.0.
2017-05-08 15:32:10 +02:00
ines
95edd9e896
Let parse_package_meta take full path
2017-05-08 15:30:48 +02:00
ines
326746eb15
Add util function to resolve arg to model path
...
1. check if in data dir or shortcut link
2. check if installed as a pip package
3. check if string is path to model
4. check if Path or Path-like object
2017-05-08 15:29:47 +02:00
Matthew Honnibal
bef89ef23d
Mergery
2017-05-08 08:29:36 -05:00
ines
a7801e7342
Update spacy.load()
...
path argument is now deprecated and name can either take a model name
or path. Implement lazy loading by importing module and read Language
class name off __all__.
2017-05-08 15:27:25 +02:00
Matthew Honnibal
50ddc9fc45
Fix infinite loop bug
2017-05-08 07:54:26 -05:00
Matthew Honnibal
94e86ae00a
Predict tags with encoder
2017-05-08 07:53:45 -05:00
Matthew Honnibal
56073a11ef
Don't use tags when calculating token vectors
2017-05-08 07:52:24 -05:00
Matthew Honnibal
a66a4a4d0f
Replace einsums
2017-05-08 14:46:50 +02:00
Matthew Honnibal
8d2eab74da
Use PretrainableMaxouts
2017-05-08 14:24:55 +02:00
Matthew Honnibal
807cb2e370
Add PretrainableMaxouts
2017-05-08 14:24:43 +02:00
Matthew Honnibal
2e2268a442
Precomputable hidden now working
2017-05-08 11:36:37 +02:00
ines
94697e9afc
Fix typo
2017-05-08 02:00:37 +02:00
ines
0ee2a22b67
Merge branch 'pr/1024' into develop
2017-05-08 01:12:44 +02:00
ines
c4492d260a
Fix kwargs
2017-05-08 01:05:24 +02:00
Matthew Honnibal
10682d35ab
Get pre-computed version working
2017-05-08 00:38:35 +02:00
ines
b5a726c5cd
Tidy up deprecated.py
2017-05-07 23:29:22 +02:00
ines
59c3b9d4dd
Tidy up CLI and fix print functions
2017-05-07 23:25:29 +02:00
ines
311704674d
Add path2str compat function
2017-05-07 23:24:56 +02:00
ines
e34069db9f
Move is_package and get_model_package_path to util
2017-05-07 23:24:51 +02:00
ines
957ba676b4
Add model files base path to about.py
2017-05-07 23:22:35 +02:00
ines
8d8dd9ceb2
Don't set default value for model
2017-05-07 23:22:21 +02:00
Matthew Honnibal
35458987e8
Checkpoint -- nearly finished reimpl
2017-05-07 23:05:01 +02:00
Matthew Honnibal
4441866f55
Checkpoint -- nearly finished reimpl
2017-05-07 22:47:06 +02:00
Matthew Honnibal
6782eedf9b
Tmp GPU code
2017-05-07 11:04:24 -05:00
Matthew Honnibal
e420e5a809
Tmp
2017-05-07 07:31:09 -05:00
Matthew Honnibal
12039e80ca
Switch to single matmul for state layer
2017-05-07 14:26:34 +02:00
Matthew Honnibal
700979fb3c
CPU/GPU compat
2017-05-07 04:01:11 +02:00
Matthew Honnibal
f99f5b75dc
working residual net
2017-05-07 03:57:26 +02:00
Matthew Honnibal
bdf2dba9fb
WIP on refactor, with hidde pre-computing
2017-05-07 02:02:43 +02:00
Matthew Honnibal
b439e04f8d
Learning smoothly
2017-05-06 20:38:12 +02:00
Matthew Honnibal
08bee76790
Learns things
2017-05-06 18:24:38 +02:00
Matthew Honnibal
04ae1c01f1
Learns things
2017-05-06 18:21:02 +02:00
Matthew Honnibal
bcf4cd0a5f
Learns things
2017-05-06 17:37:36 +02:00
Matthew Honnibal
8e48b58cd6
Gradients look correct
2017-05-06 16:47:15 +02:00
Matthew Honnibal
7e04260d38
Data running through, likely errors in model
2017-05-06 14:22:20 +02:00
Matthew Honnibal
fa7c1990b6
Restore tok2vec function
2017-05-05 20:12:03 +02:00
Matthew Honnibal
efe9630e1c
Bug fixes
2017-05-05 20:09:50 +02:00
Matthew Honnibal
ef4fa594aa
Draft of NN parser, to be tested
2017-05-05 19:20:39 +02:00
Matthew Honnibal
7d1df50aec
Draft up Parser model
2017-05-04 13:31:40 +02:00
Matthew Honnibal
ccaf26206b
Pseudocode for parser
2017-05-04 12:17:59 +02:00
ines
b1f22c5a10
Fix formatting
2017-05-03 20:11:02 +02:00
ines
a04b5be1b2
Add glossary for annotation scheme ( closes #1034 )
...
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard
929f2792a7
Rennaming cls in module. cls is now a class
2017-05-03 15:41:07 +02:00
Gregory Howard
0e8c41ea4f
Adding method lemmatizer for every class
2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e
adding export japanese
2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224
Merge branch 'master' into master
2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4
Lazy imports language
2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d
Fix formatting
2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d
Raise custom ImportError if importing janome fails
2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b
Add newline
2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea
Add newline
2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135
Add newline
2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87
Add basic japanese support
2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb
Merge remote-tracking branch 'remotes/upstream/master'
2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2
Add dropout optin for parser and NER
...
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.
nlp.entity.update(doc, gold, drop=0.4)
This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.
This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b
Removing extra spaces
2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e
Adding unitest for tokenization in french (with title)
2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25
Merge pull request #1022 from luvogels/master
...
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c
Add newline
2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2
Add newline
2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef
Add newline
2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21
Add newline
2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05
Add newline
2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849
Adding unitest for tokenization in french (with title)
2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45
Improvement of rules now title insentive and have same declaration format
2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431
Hooked up tokenizer tests
2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27
Increment version
2017-04-26 20:25:41 +02:00
luvogels
b331929a7e
Merge branch 'master' of https://github.com/luvogels/spaCy
2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9
Added tokenizer tests
2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7
Make Span hashable. Closes #1019
2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13
Try to make test999 less flakey
2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09
Update __init__.py
2017-04-26 18:27:55 +02:00
ines
527d51ac9a
Fetch shortcuts from GitHub and improve error handling
2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
26e31afc18
renamming tests
2017-04-25 17:46:01 +02:00
ghoward
c085c2d391
Adding some unitests
2017-04-25 17:44:16 +02:00
ghoward
55c6910f90
Look_up table for languages in spacy.
...
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe
Fix unicode header in tests
2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5
Fix test
2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1
Fix flakey test
2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15
Make training test less flakey
2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b
Fix reporting if no dev data with train
2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
ines
42305bc519
Remove unnecessary test
2017-04-23 21:21:41 +02:00
ines
012ea594d1
Add file for misc tests
2017-04-23 21:06:51 +02:00
ines
83f66947dc
Rename test_download to test_cli
2017-04-23 21:06:50 +02:00
ines
401045433c
Simplify compat.fix_text
2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64
Increment version
2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b
Update fix for Issue #999
2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07
Add test for Issue #955
2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5
Ensure noun chunks can't be nested. Closes #955
2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445
Add test for Issue #999
2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52
Fix json dump for Python3
2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17
Remove xfail on Test #910
2017-04-23 16:28:55 +02:00
ines
3a9710f356
Pass dev_scores to print_progress correctly ( resolves #1008 )
...
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab
Persist the actions within spacy.parser.cfg
2017-04-20 17:02:44 +02:00
ines
25c70b4cc5
Move fix_text to spacy.compat (see #1002 )
2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee
Merge pull request #1002 from oroszgy/model_cli_fix
...
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c
Using ftfy for handling broken encoded strings.
2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046
Merge pull request #1001 from recognai/master
...
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb
fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing
2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10
Enable test
2017-04-20 02:25:24 +03:00
Ben Eyal
33af52599e
Redefine alphabetic characters
...
For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.
2017-04-20 02:25:02 +03:00
Ben Eyal
d8098a8be2
Use regex
instead of re
2017-04-20 02:22:52 +03:00
oeg
daaa42dd25
Merge remote-tracking branch 'upstream/master'
2017-04-19 23:30:36 +02:00
oeg
936a297241
fix(model): Fix tag map for fixing issues with tag SPACE
2017-04-19 23:30:21 +02:00
luvogels
c7cec7e5e2
Update __init__.py
2017-04-19 21:06:30 +02:00
luvogels
55e8cade36
Update __init__.py
2017-04-19 21:06:30 +02:00
luvogels
03abd0c8e6
Update __init__.py
2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang
538a8d6b12
Resolved merge conflict by incorporating both suggestions.
2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang
e821c48489
Norwegian language basics
2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang
3796c668d9
more norwegian
2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang
bc9557b21f
Norwegian language basics
2017-04-19 21:00:01 +02:00
ines
2bd89e7ade
Tidy up Hebrew tests and test for punctuation (see #995 )
2017-04-19 19:28:03 +02:00
ines
48da244058
Use spacy.compat.json_dumps for Python 2/3 compatibility ( resolves #991 )
2017-04-19 11:50:36 +02:00
ines
ddd5194088
Update Language docs and docstrings
2017-04-17 01:52:13 +02:00
ines
f62b740961
Use compat.json_dumps
2017-04-17 01:46:14 +02:00
ines
8e83f8e2fa
Update docstrings
2017-04-17 01:40:26 +02:00
ines
e2299dc389
Ensure path in save_to_directory
2017-04-17 01:40:14 +02:00
ines
82f5f1f98f
Replace str with compat.unicode_
2017-04-17 01:29:54 +02:00
ines
16a8521efa
Increment version
2017-04-16 22:38:38 +02:00
Matthew Honnibal
4efd6fb9d6
Fix training
2017-04-16 15:28:27 -05:00
Matthew Honnibal
17c9fffb9e
Fix naked except
2017-04-16 15:28:16 -05:00
ines
5610fdcc06
Get language name first if no model path exists
...
Makes sure spaCy fails early if no tokenizer exists, and allows
printing better error message.
2017-04-16 22:16:47 +02:00
ines
ad168ba88c
Set model name to empty string if path override exists
...
Required for parse_package_meta, which composes path of data_path and
model_name (needs to be fixed in the future)
2017-04-16 22:15:51 +02:00
ines
97647c46cd
Add docstring and todo note
2017-04-16 22:14:45 +02:00
ines
5c5f8c0a72
Check if full string is found in lang classes first
...
This allows users to set arbitrary strings. (Otherwise, custom lang
class "my_custom_class" would always load Burmese "my" tokenizer if one
was available.)
2017-04-16 22:14:38 +02:00
ines
13d30b6c01
xfail lemmatizer test that's causing problems (see #546 )
2017-04-16 21:18:39 +02:00
Matthew Honnibal
4931c56afc
Increment version
2017-04-16 13:59:38 -05:00
ines
6145b7c153
Remove redundant Path
2017-04-16 20:53:25 +02:00
Matthew Honnibal
fa89613444
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-16 13:42:56 -05:00
ines
1f9f867c70
Remove unused util function
2017-04-16 20:37:45 +02:00
ines
7670c745b6
Update spacy.load() and fix path checks
2017-04-16 20:37:45 +02:00
ines
d3759dfb32
Fix docstring
2017-04-16 20:37:45 +02:00
ines
ed7e19ad68
Remove unused import
2017-04-16 20:37:45 +02:00
ines
0084466a66
Remove unused utf8open util and replace os.path with ensure_path
2017-04-16 20:37:45 +02:00
Matthew Honnibal
89a4f262fc
Fix training methods
2017-04-16 13:00:37 -05:00
Matthew Honnibal
6a4221a6de
Allow lemma to be set from Python. Re #973
2017-04-16 18:07:53 +02:00
Matthew Honnibal
137b210bcf
Restore use of FTRL training
2017-04-16 18:02:42 +02:00
ines
d10bd0eaf9
Fix formatting
2017-04-16 13:42:34 +02:00
ines
8191e33cf1
Update link error message with info on permissions
2017-04-16 13:32:31 +02:00
ines
a3ddbc0444
Add note about --force flag to error message
2017-04-16 13:14:36 +02:00
ines
e3de035814
Add meta validation to check for required settings
...
Complain if no "lang", "name" or "version" is found (those settings are
used in directory / package names). Package will still build without,
but it'll inevitably fail somewhere down the line.
2017-04-16 13:13:17 +02:00
ines
a7574b7572
Add more options to read in meta data in package command
...
Add meta option to supply path to meta.json. If no meta path is set,
check if meta.json exists in input directory and use it. Otherwise,
prompt for details on the command line.
2017-04-16 13:06:02 +02:00
ines
13c8a42d2b
Fix typos
2017-04-16 13:03:58 +02:00
ines
31fa73293a
Move read_json out to own util function
2017-04-16 13:03:28 +02:00
Matthew Honnibal
45464d065e
Remove print statement
2017-04-15 16:11:43 +02:00
Matthew Honnibal
c76cb8af35
Fix training for new labels
2017-04-15 16:11:26 +02:00
Matthew Honnibal
4884b2c113
Refix StepwiseState
2017-04-15 16:00:28 +02:00
Matthew Honnibal
e6ee7e130f
Fix parse package meta
2017-04-15 13:38:53 +02:00
Matthew Honnibal
1a98e48b8e
Fix Stepwisestate'
2017-04-15 13:35:01 +02:00
ines
0739ae7b76
Tidy up and fix formatting and imports
2017-04-15 13:05:15 +02:00
ines
fefe6684cd
Fix symlink function to check for Windows
2017-04-15 12:17:27 +02:00
ines
35fb4febe2
Fix whitespace
2017-04-15 12:13:45 +02:00
ines
e1efd589c3
Fix json imports and use ujson
2017-04-15 12:13:34 +02:00
ines
958b12dec8
Use pathlib instead of os.path
2017-04-15 12:13:00 +02:00
ines
956dc36785
Move functions to deprecated
2017-04-15 12:12:31 +02:00
ines
c05ec4b89a
Add compat functions and remove old workarounds
...
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
26445ee304
Add compat module for Python2/3 and platform compatibility
2017-04-15 12:07:02 +02:00
ines
d24589aa72
Clean up imports, unused code, whitespace, docstrings
2017-04-15 12:05:47 +02:00
ines
561f2a3eb4
Use consistent formatting for docstrings
2017-04-15 11:59:21 +02:00
Matthew Honnibal
d13f0a7017
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-04-14 23:54:57 +02:00
Matthew Honnibal
354458484c
WIP on add_label bug during NER training
...
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal
33ba5066eb
Refactor Language.end_training, making new save_to_directory method
2017-04-14 23:51:24 +02:00
ines
84341c2975
Only compile list of models if data_path exists
2017-04-14 16:48:02 +02:00
Gyorgy Orosz
dd3244c08a
Made json dump to produce unicode strings in py2
2017-04-13 23:30:47 +02:00
Gyorgy Orosz
a9469c8173
Fixed typo
2017-04-13 15:24:14 +02:00
ines
41037f0f07
Remove unused imports
2017-04-13 13:52:11 +02:00
ines
1b92c8d5d5
Use unicode paths on Windows/Python 2 and catch other errors ( resolves #970 )
...
try/except here is quite dirty, but it'll at least make sure users see
an error message that explains what's going on
2017-04-10 17:49:51 +02:00
Matthew Honnibal
49e2de900e
Add costs property to StepwiseState, to show which moves are gold.
2017-04-10 11:37:04 +02:00
Matthew Honnibal
e26577b202
Increment version
2017-04-07 18:45:06 +02:00
Matthew Honnibal
40bf7ecf27
Increment version
2017-04-07 18:44:20 +02:00
Matthew Honnibal
1dca7eeb03
Add unicode declaration on new regression test
2017-04-07 18:09:23 +02:00
ines
887827fc6a
Merge branch 'develop'
2017-04-07 17:36:23 +02:00
ines
444dd511c5
Fix xpassing URL test case
2017-04-07 17:36:05 +02:00
ines
bf0f15e762
Add / to tokenizer infixes ( resolves #891 )
2017-04-07 17:30:44 +02:00
ines
00b9011a49
Fix whitespace
2017-04-07 17:29:59 +02:00
ines
f9869e4dc5
Merge branch 'master' into develop
2017-04-07 17:23:40 +02:00
Matthew Honnibal
4a6204dbad
Merge remote-tracking branch 'origin/develop'
2017-04-07 17:20:09 +02:00
Matthew Honnibal
0513c43bf0
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-07 17:07:10 +02:00
Matthew Honnibal
cc36c308f4
Fix noun_chunk rules around coordination
...
Closes #693 .
2017-04-07 17:06:40 +02:00
Matthew Honnibal
ab846256cf
Merge pull request #966 from recognai/master
...
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal
83dca920d4
Rename test #913 -> #957 , comment
...
Make test for #957 reference correct bug. Add comment.
Previous commit closes #957 .
2017-04-07 15:54:25 +02:00
Matthew Honnibal
be204ed714
Merge branch 'master' of https://github.com/explosion/spaCy
2017-04-07 15:50:14 +02:00
Matthew Honnibal
e7b1ee9efd
Switch to regex module for URL identification
...
The URL detection regex was failing on input such as 0.1.2.3, as this
input triggered excessive back-tracking in the builtin re module.
The solution was to switch to the regex module, which behaves better.
Closes #913 .
2017-04-07 15:47:36 +02:00
Matthew Honnibal
5887383fc0
Add test for Issue #913 : Hang from bad regex
2017-04-07 15:47:27 +02:00
ines
7ea1673072
Fix whitespace
2017-04-07 13:28:48 +02:00
ines
255650dbc2
Add connlu2json converter from explosion/spacy-dev-resources/#11
2017-04-07 13:05:12 +02:00
ines
789ce8a45e
Add convert command
2017-04-07 13:04:17 +02:00
ines
9952d3b08a
Fix whitespace
2017-04-07 13:02:05 +02:00
ines
47ddce6eb7
Remove unused variable
2017-04-07 13:01:48 +02:00
ines
dcf8ab0c47
Merge branch 'develop'
2017-04-07 12:00:09 +02:00
ines
75f9b4c6e2
Fix whitespace
2017-04-07 10:22:18 +02:00
oeg
c693d40791
feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests
2017-04-06 18:48:45 +02:00
oeg
010293fb2f
fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli
2017-04-06 17:33:15 +02:00
ines
808cd6cf7f
Add missing tags to verbs ( resolves #948 )
2017-04-03 18:12:52 +02:00
ines
ad8bf1829f
Import and combine Portuguese tokenizer exceptions (see #943 )
2017-04-01 10:37:42 +02:00
Ines Montani
f8b2d9c3b7
Merge pull request #943 from mamoit/master
...
Portuguese improvements
2017-04-01 10:32:00 +02:00
ines
3b667a24d4
Remove whitespace
2017-04-01 10:21:08 +02:00
ines
e71a1f4bd0
Fix download commands in error messages (see #946 )
2017-04-01 10:20:57 +02:00
ines
42382d5692
Fix download commands in error messages (see #946 )
2017-04-01 10:19:32 +02:00
ines
d4a59c254b
Remove whitespace
2017-04-01 10:19:01 +02:00
Matthew Honnibal
51882ee2b8
Fix check for setting ent_id in merge
2017-03-31 19:32:01 +02:00
Miguel Almeida
4fde64c4ea
Portuguese contractions and some abreviations
2017-03-31 15:52:55 +01:00
Miguel Almeida
465b240bcb
Review Portuguese stop words
...
Mainly to review typos and add missing masculines/feminines
2017-03-31 13:00:47 +01:00
Matthew Honnibal
fc3900e5b2
Allow ent_id to be set in Token
2017-03-31 14:00:14 +02:00
Matthew Honnibal
9720103428
Improve attribute handlign in doc.merge(). Still unsatisfying
2017-03-31 13:59:58 +02:00
Matthew Honnibal
cfff4e0f61
Improve test
2017-03-31 13:59:32 +02:00
Matthew Honnibal
1bb7b4ca71
Add comment
2017-03-31 13:59:19 +02:00
Matthew Honnibal
725249c59a
Add merge_phrase callback in matcher.pyx
2017-03-31 13:58:59 +02:00
Matthew Honnibal
e854f28304
Add test for Issue #758
...
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Miguel Almeida
c1d020b0a6
Remove "ista" from portuguese stop words
2017-03-31 12:26:13 +01:00
Miguel Almeida
17a1e7a119
Add Portuguese numbers and ordinals
2017-03-31 12:21:01 +01:00
Matthew Honnibal
47a3ef06a6
Unhack deprojetivization, moving it into pipeline
...
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898 .
2017-03-31 12:31:50 +02:00
Joshua Reeter
564daf6dec
Issue #934 symlink should not convert paths as_posix under windows.
2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita
c2d48974bc
Fix typos in Portuguese stop words
2017-03-30 21:59:18 +13:00
Matthew Honnibal
0fefdfcbda
Merge pull request #935 from ericzhao28/master
...
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862 )
2017-03-30 02:51:24 +02:00
ines
4759fd437d
Merge branch 'master' into develop
2017-03-29 10:37:13 +02:00
ines
7e4befec88
Add Hebrew to init and setup.py
2017-03-29 10:34:57 +02:00
Grégory Howard
9c2996b27f
correction of package.py (encoding on open instead of write)
2017-03-29 09:11:02 +02:00
Eric Zhao
aafdf6ffb8
Add option to use label karg to determine ent_type in doc.merge
2017-03-28 23:35:03 -07:00
ines
7198cf1c8a
Remove unused import
2017-03-26 20:56:05 +02:00
ines
7ceaa1614b
Add experimental model init command
2017-03-26 20:51:40 +02:00
Matthew Honnibal
83ba6c247c
Fix init of Language without model
2017-03-26 16:46:00 +02:00
Matthew Honnibal
fa107f95f6
Remove unused train_config command
2017-03-26 09:28:59 -05:00
Matthew Honnibal
df83921f0a
Increment version
2017-03-26 09:27:32 -05:00
Matthew Honnibal
92ac3af21d
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-26 09:26:59 -05:00
Matthew Honnibal
a9b1f23c7d
Enable regression loss for parser
2017-03-26 09:26:30 -05:00
ines
c00d997924
Merge branch 'develop'
2017-03-26 15:57:00 +02:00
Matthew Honnibal
2efdbc08ff
Make training work with directories
2017-03-26 08:46:44 -05:00
ines
007a2492bd
Remove train_config command for now
2017-03-26 15:40:50 +02:00
ines
b297fab062
Update error message for missing commands
2017-03-26 15:40:02 +02:00
ines
7f95023fc0
Fix formatting
2017-03-26 15:37:37 +02:00
ines
5901c8f7f0
Update spacy train CLI documentation
2017-03-26 15:33:48 +02:00
Matthew Honnibal
9dcb58aaaf
Merge CLI changes
2017-03-26 07:30:45 -05:00
Matthew Honnibal
6b7f7a2060
Connect parser L1 option to train CLI
2017-03-26 07:24:07 -05:00
Matthew Honnibal
ed2b106f4d
Fix circular import in lemmatizer
2017-03-26 07:17:07 -05:00
Matthew Honnibal
dec5571bf3
Update train CLI
2017-03-26 07:16:52 -05:00
ines
53cf2f1c0e
Make dev data optional
2017-03-26 11:48:17 +02:00
Matthew Honnibal
5eac089fbe
Merge branch 'master' into develop
2017-03-26 04:45:43 -05:00
ines
0fc56e2544
Update flag and defaults
2017-03-26 11:42:11 +02:00
Matthew Honnibal
2f63806ddb
Update config when adding label. Re #910
2017-03-25 22:35:44 +01:00
Matthew Honnibal
b94286de30
Fix regression test
2017-03-25 22:35:07 +01:00
Matthew Honnibal
c748907a66
Fix errors in previous commit
2017-03-25 22:25:01 +01:00
Matthew Honnibal
4f400fa486
Prevent lemmatization of base nouns
...
Update lemmatizer's base-form check, for change in morphology class.
Closes #903 .
2017-03-25 21:51:12 +01:00
Matthew Honnibal
850d35dcb3
Make morphology use int attributes internally
...
The morphology class was calling the lemmatizer inconsistently,
which some string-valued attributes. This caused Issue #903 .
2017-03-25 21:49:10 +01:00
Matthew Honnibal
4454c1b23f
Block lemmatization of base-form adjectives
...
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912 .
2017-03-25 21:29:57 +01:00
ines
97814f8da6
Update Windows Python 2 link workaround to use helper functions
2017-03-25 14:04:27 +01:00
ines
fdec758113
Add is_windows and is_python2 utility functions
2017-03-25 14:04:02 +01:00
Ines Montani
09837158e4
Merge pull request #921 from solresol/master
...
Possible solution to #909
2017-03-25 13:51:55 +01:00
Greg Baker
b7f714b498
Possible solution to #909
2017-03-25 21:36:38 +11:00
Ines Montani
97cb4d5e3c
Merge branch 'master' into master
2017-03-25 10:03:47 +01:00
Iddo Berger
da135bd823
add hebrew tokenizer
2017-03-24 18:27:44 +03:00
Matthew Honnibal
f40fbc3710
Add test for Issue #910 : Resuming entity training
2017-03-23 23:38:57 +01:00
Matthew Honnibal
9c9cd99144
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-23 11:11:24 +01:00
ines
0035fd9efe
Add spacy train work in progress
2017-03-23 11:08:41 +01:00
ines
d5ebf583a4
Fix formatting
2017-03-23 11:08:30 +01:00
ines
3f20efe165
Merge branch 'develop'
...
# Conflicts:
# spacy/util.py
2017-03-22 17:14:15 +01:00
Ines Montani
f86a3a92d5
Merge pull request #899 from raphael0202/duplicate_keys
...
Remove duplicate keys in [en|fi] language data dicts
2017-03-22 10:20:11 +01:00
Ines Montani
87a2c85e1b
Merge pull request #900 from raphael0202/unused_imports
...
Remove unused import statements
2017-03-22 10:10:43 +01:00
ines
ce065e5d65
Fix imports
2017-03-22 10:02:14 +01:00
Andrew Poliakov
07199c3e8b
Fix infinite recursion in spacy.info
2017-03-22 11:43:22 +03:00
Raphaël Bournhonesque
f332bf05be
Remove unused import statements
2017-03-21 21:08:54 +01:00
ines
c3a9f73896
Fix writing to file
2017-03-21 12:35:22 +01:00
ines
d74aa428ad
Fix path
2017-03-21 12:26:00 +01:00
ines
83a999ea83
Change default license from MIT to CC
2017-03-21 12:24:43 +01:00
ines
ae46647560
Fix brackets
2017-03-21 12:21:42 +01:00
ines
3e134b5b2b
Make sure paths in copytree and rmtree are strings
2017-03-21 12:15:33 +01:00
ines
cf0094187e
Fetch MANIFEST.in from GitHub as well
2017-03-21 11:32:38 +01:00
ines
09b24bc5a9
Add docs for package command
2017-03-21 11:19:21 +01:00
ines
3f4e3fda1d
Update command and fetch file templates from GitHub
...
While feature is still experimental, this allows files to be modified
without having to ship a new version of spaCy.
2017-03-21 11:17:36 +01:00
ines
5230ed5b98
Move directory check and overwriting/creating dirs to own function
2017-03-21 02:06:53 +01:00
ines
46bc3c36b0
Fix typo
2017-03-21 02:06:37 +01:00
ines
64e38f304e
Only import shutil
2017-03-21 02:06:29 +01:00
ines
448a916d0d
Add --force option to override directory
2017-03-21 02:05:34 +01:00
ines
8eb9a2b355
Fix formatting
2017-03-21 02:05:14 +01:00
ines
b2bcdec0f6
Update docstring
2017-03-20 22:50:55 +01:00
ines
bf240132d7
Add cli.package command to build model packages
2017-03-20 22:50:13 +01:00
ines
a54e3c2efe
Remove empty line
2017-03-20 22:49:36 +01:00
ines
5aea327a5b
Add util function to get raw user input
2017-03-20 22:48:56 +01:00
ines
a6c0361803
Handle raw_input vs input in Python 2 and 3
2017-03-20 22:48:32 +01:00
ines
adbcac6591
Fix spacing
2017-03-20 22:48:21 +01:00
Matthew Honnibal
692eb0603d
Fix high memory usage in download command
...
Due to PyPi issue #2984 , installing large packages via pip causes
a large spike in memory usage. The recommended fix is to disable
caching.
2017-03-20 18:24:44 +01:00
ines
f830213c4c
Remove compatibility check test
...
Will only cause problems when incrementing version and not updating
table. Also depends on external URL, which is bad.
2017-03-20 13:20:26 +01:00
Matthew Honnibal
f314d3d044
Increment version
2017-03-20 12:58:24 +01:00
Matthew Honnibal
b487b8735a
Decrease beam density, and fix Python 3 problem in beam
2017-03-20 12:56:05 +01:00
Ines Montani
b6ee241e26
Fix print statements
2017-03-20 11:46:37 +01:00
ines
b8f8d5d8bf
Make sure model_path is a Posix path
...
Otherwise, formatting the success message with model_path.as_posix()
fails when using a local path for linking (linking still works, but the
error message is confusing)
2017-03-19 11:57:13 +01:00
ines
fe0ff00fe1
Fix spacing
2017-03-19 11:55:37 +01:00
ines
5712da6095
Add regression test for #891
2017-03-19 11:48:01 +01:00
Raphaël Bournhonesque
7f579ae834
Remove duplicate keys in [en|fi] data dicts
2017-03-19 11:40:29 +01:00
ines
8de5108af6
Exclude common cache directories from mode list in cli.info
...
This means models called "cache" etc. won't show up in the list, but it
seems worth it.
2017-03-19 01:44:43 +01:00
Matthew Honnibal
6ee2ea1128
Increment version
2017-03-19 01:40:52 +01:00
Matthew Honnibal
797f286c38
Use import to find data package
2017-03-19 01:39:36 +01:00
Matthew Honnibal
5941fb9e92
Make spacy/data a package
2017-03-18 20:04:22 +01:00
Matthew Honnibal
bc10d06bc2
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-18 19:32:54 +01:00
Matthew Honnibal
583628c350
Import metadata into __init__
2017-03-18 19:30:03 +01:00
Matthew Honnibal
1754e0db9b
Call pip via subprocess, to make it use virtualenv
2017-03-18 19:29:36 +01:00
ines
1277abcde2
Remove print statement
2017-03-18 19:14:58 +01:00
Matthew Honnibal
dcec104643
Remove unused import
2017-03-18 18:57:45 +01:00
Matthew Honnibal
703eb7bdbd
Fix link module
2017-03-18 18:57:31 +01:00
Matthew Honnibal
f6c6c89546
Add empty data directory
2017-03-18 18:32:29 +01:00
ines
7d33104180
Use distutils.sysconfig.get_python_lib
...
site.getsitepackages seems to not work as expected in Python 2
2017-03-18 18:20:40 +01:00
Matthew Honnibal
1a53fcc685
Fix CLI for Python 2
2017-03-18 18:14:03 +01:00
ines
aefb898e37
Add title-case version of morph rules ( resolves #686 )
2017-03-18 17:27:11 +01:00
ines
64ec17abc1
Pass xpassing tests and add xfails for failures
2017-03-18 17:20:46 +01:00
ines
d0b85faf69
Pass regression test for #401 ( resolves #401 )
...
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines
be9daefbdd
Remove actual model downloading from tests
2017-03-18 17:01:10 +01:00
ines
850650221a
Use correct command in deprecated download command message
2017-03-18 17:01:01 +01:00
ines
0dd7710556
Make sure paths are paths
2017-03-18 16:48:52 +01:00
Matthew Honnibal
de0e6385b4
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-18 16:17:28 +01:00
Matthew Honnibal
fe442cac53
Fix #717 : Set correct lemma for contracted verbs
2017-03-18 16:16:10 +01:00
ines
ad934a9abd
Add regression test for #693
2017-03-18 16:12:30 +01:00
ines
f57c616830
Add regression test for #704 and test new model ( resolves #704 )
...
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal
413138de79
Fix #719 : Lemmatizer can no longer output empty string
2017-03-18 16:02:06 +01:00
ines
ab1451f997
Don't mark compatibility test as slow
2017-03-18 15:17:39 +01:00
ines
ec3e810662
Add directory cli and set up command line interface
2017-03-18 15:14:48 +01:00
ines
cd94ea1095
Use info module for spacy.info()
2017-03-18 13:01:26 +01:00
ines
e3e25c0a33
Add spacy.info module
...
Print info about spaCy installation, local setup and models. Allow
export in Markdown format to copy-paste into GitHub issues.
2017-03-18 13:01:16 +01:00
ines
0eafc0f2c6
Add util functions to print data as table or markdown list
2017-03-18 13:00:14 +01:00
ines
6b9b444065
Fix imports
2017-03-18 12:59:41 +01:00
ines
a035ebd32a
Use pathlib.Path instead of os.path
2017-03-18 12:59:21 +01:00
ines
9605cf39cc
Handle default path in Language classes
2017-03-18 12:58:45 +01:00
Matthew Honnibal
ac4b88cce9
Fix auto-linking in download command
2017-03-17 21:36:13 +01:00
ines
8a34c3e666
Fix shortcut name
2017-03-17 20:07:34 +01:00
Matthew Honnibal
6420f86f02
Merge changes to __init__.py
2017-03-17 19:51:45 +01:00
ines
e01fbacf81
Update resolve_model_name
2017-03-17 19:26:28 +01:00
ines
aedefef49d
Add function to resolve model names and link them
2017-03-17 18:47:05 +01:00
Matthew Honnibal
d013aba7b5
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 18:30:53 +01:00
Matthew Honnibal
854cfce7cf
Make vocabs more compatible across versions
...
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal
1cc841e600
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 08:18:11 -05:00
Matthew Honnibal
4bfc55b532
Auto-add words to vocab when loading vectors
...
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
ines
0e533ad0cc
Mark compatibility table test as slow (temporary)
...
Prevent Travis from running test test until models repo is published
2017-03-17 13:11:36 +01:00
ines
279b1d1965
Update version
2017-03-17 12:43:08 +01:00
ines
8af4b9e4df
Fix compatibility.json link
2017-03-17 12:43:03 +01:00
Matthew Honnibal
a630726b13
Fix typo in tests
2017-03-16 20:50:36 -05:00
Matthew Honnibal
f98b30583f
Fix tests
2017-03-16 19:48:00 -05:00
Matthew Honnibal
db51abf685
Fix tests
2017-03-16 18:53:47 -05:00
Matthew Honnibal
adb0b7e43b
Fix loading when no package found
2017-03-16 18:30:23 -05:00
Matthew Honnibal
5c66cffafd
Add tag map for Spanish
2017-03-16 18:05:15 -05:00
Matthew Honnibal
c4351e1165
Update base-form check in lemmatizer, for UD 2.0 morphology
2017-03-16 17:59:31 -05:00
Matthew Honnibal
1e10383e1b
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:41:13 -05:00
Matthew Honnibal
859315863a
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:40:07 -05:00
Matthew Honnibal
fea9fe08af
Merge pull request #866 from juanmirocks/master
...
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
ffd4a19383
Increment version
2017-03-16 17:35:57 -05:00
Matthew Honnibal
28bb546939
Merge pull request #883 from ericzhao28/master
...
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
fd60961825
Fix spacing
2017-03-16 23:23:26 +01:00
Matthew Honnibal
890747d8ff
Fix trailing whitespace on morphology features
2017-03-16 17:07:37 -05:00
Matthew Honnibal
af41a9790c
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:41:37 +01:00
Matthew Honnibal
303a56f173
Get absolute path for linking
2017-03-16 20:41:23 +01:00
ines
3d484c3faf
Don't print in parse_package_meta and accept on_erro callback instead
...
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
d8c984b65e
Don't exit if no model meta data is present
2017-03-16 20:33:33 +01:00
Matthew Honnibal
2524efc0ac
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:20:41 +01:00
ines
8253581057
Link model automatically if not direct download
2017-03-16 19:54:51 +01:00
Matthew Honnibal
8843b84bd1
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 12:00:42 -05:00
Matthew Honnibal
55f813bfbb
Don't reapply the model during training
2017-03-16 11:59:43 -05:00
Matthew Honnibal
c90dc7ac29
Clean up state initiatisation in transition system
2017-03-16 11:59:11 -05:00
Matthew Honnibal
a46933a8fe
Clean up FTRL parsing stuff.
2017-03-16 11:58:20 -05:00
ines
618ce3b425
Add .meta to Language object
...
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
ines
e348d4434c
Add spacy.info(model_name) to show model meta
...
Allows "previewing" model before loading and making sure it's linked
correctly.
2017-03-16 17:13:40 +01:00
ines
eea3b35e3f
Update model loading to support links
...
Remove match_best_version check, fetch model language from meta instead
of directory name, and don't make too many assumptions – if model is
downloaded via downloader, version should match anyway. (Otherwise,
users should be free to add and load whichever models they want.)
2017-03-16 17:13:08 +01:00
ines
5f3f04bd0a
Add util function to load and parse package meta.json
2017-03-16 17:10:05 +01:00
ines
7f920c2f75
Don't break text in when rendering print_msg
2017-03-16 17:09:50 +01:00
ines
16a63d9676
Add docstring
2017-03-16 17:09:11 +01:00
ines
68c04fa897
Move sys_exit() function to util
2017-03-16 17:08:58 +01:00
ines
ccd1a79988
Add spacy.link module to link model directories to shortcuts
2017-03-16 17:01:51 +01:00
Matthew Honnibal
2611ac2a89
Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens
2017-03-16 09:38:28 -05:00
ines
595d89698a
Add basestring
2017-03-16 10:01:14 +01:00
ines
7b2eca36e4
Revert "Fix formatting and remove unused code"
...
This reverts commit d7898d586f
.
2017-03-16 09:58:41 +01:00
ines
2f0db1dd36
Use small English model as default
2017-03-16 09:54:40 +01:00
Matthew Honnibal
3d0833c3df
Fix off-by-1 in parse features fill_context
2017-03-15 19:55:35 -05:00
Matthew Honnibal
4ef68c413f
Approximate cost in Break transition, to speed things up a bit.
2017-03-15 16:40:27 -05:00
Matthew Honnibal
8543db8a5b
Use ftrl optimizer in parser
2017-03-15 11:56:37 -05:00
ines
4cfc8ffbd2
Reformat pickle tests
2017-03-15 17:39:54 +01:00
ines
2a0fcf1354
Add tests for new download module
2017-03-15 17:39:43 +01:00
ines
71956c94db
Handle deprecated language-specific model downloading
2017-03-15 17:37:55 +01:00
ines
58b884b6d4
Refactor download script and about.py to use new download method
2017-03-15 17:37:18 +01:00
ines
f5d1a39a5b
Add util functions for printing and wrapping messages
2017-03-15 17:35:57 +01:00
ines
d7898d586f
Fix formatting and remove unused code
2017-03-15 17:35:41 +01:00
ines
b672e95045
Fix formatting
2017-03-15 17:35:04 +01:00
ines
0474e706a0
Remove unused deprecated functions for sputnik
2017-03-15 17:34:54 +01:00
ines
b13e7f79b4
Fix formatting and remove unused imports
2017-03-15 17:33:57 +01:00
ines
1101fd3855
Fix formatting and remove unused imports
2017-03-15 17:33:39 +01:00
ines
842782c128
Move fix_deprecated_glove_vectors_loading to deprecated.py
2017-03-15 17:33:29 +01:00
Matthew Honnibal
4cab8ac136
Update morph exceptions test
2017-03-15 09:31:34 -05:00
Matthew Honnibal
d719f8e77e
Use nogil in parser, and set L1 to 0.0 by default
2017-03-15 09:31:01 -05:00
Matthew Honnibal
c61c501406
Update beam-parser to allow parser to maintain nogil
2017-03-15 09:30:22 -05:00
Matthew Honnibal
3d4e389d23
Whitespace
2017-03-15 09:29:42 -05:00
Matthew Honnibal
7769bc31e3
Add beam-search classes
2017-03-15 09:27:41 -05:00
Matthew Honnibal
c79b3129e3
Fix setting of empty lexeme in initial parse state
2017-03-15 09:26:53 -05:00
Matthew Honnibal
d864708072
Add more morphology names in attrs.pyx
2017-03-15 09:26:16 -05:00
Matthew Honnibal
b382dc902c
Add morph rules in Language
2017-03-15 09:24:40 -05:00
Matthew Honnibal
8dbff4f5f4
Wire up English lemma and morph rules.
2017-03-15 09:23:22 -05:00
Matthew Honnibal
f70be44746
Use lemmatizer in code, not from downloaded model.
2017-03-15 04:52:50 -05:00
ines
42ba740dde
Revert "Merge branch 'debug'"
...
This reverts commit 89b79d1178
, reversing
changes made to 02bdf490a1
.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e
Update regression test
2017-03-13 15:16:11 +01:00
ines
02bdf490a1
Remove regression test to see if it caused pytest Travis error
2017-03-13 13:00:22 +01:00
ines
17018750ac
Add regression test for #717
2017-03-13 12:58:22 +01:00
ines
2883ebfca2
Remove print statement
2017-03-13 12:30:42 +01:00
ines
98c13d8aa9
Add regression test for #401
2017-03-13 12:28:41 +01:00
ines
444d665f9d
Add regression test for #686
2017-03-13 12:23:35 +01:00
ines
46b17e5b51
Add regression test for #719
2017-03-13 12:17:35 +01:00
ines
c8ae682ff9
Add regression test for #636
2017-03-13 12:08:31 +01:00
ines
337f9601f2
Add missing unicode declaration
2017-03-13 12:08:19 +01:00
ines
d70386ec6e
Update docstring in #886 regression test
2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8
Add regression test for #886
2017-03-13 11:44:58 +01:00
ines
eec3f21c50
Add WordNet license
2017-03-12 13:58:24 +01:00
ines
f9e603903b
Rename stop_words.py to word_sets.py and include more sets
...
NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)
2017-03-12 13:58:22 +01:00
ines
f24f9b4b7b
Remove unused code
2017-03-12 13:58:22 +01:00
ines
1da29a7146
Use new Lemmatizer data and remove file import
...
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
0957737ee8
Add Python-formatted lemmatizer data and rules
2017-03-12 13:58:22 +01:00
ines
c89e30d1a3
Add test for English time exceptions ("1a.m." etc.)
2017-03-12 13:58:22 +01:00
ines
ce9568af84
Move English time exceptions ("1a.m." etc.) and refactor
2017-03-12 13:58:22 +01:00
ines
6b30541774
Fix formatting
2017-03-12 13:58:22 +01:00
Ines Montani
e97a30b99a
Merge pull request #885 from PySUST/master
...
[Bengali] Spell checked and add new stop words
2017-03-12 13:20:59 +01:00
ines
66c1f194f9
Use consistent unicode declarations
2017-03-12 13:07:28 +01:00
shuvanon
91cb4cdb2b
Sort stop_words
2017-03-12 17:55:51 +06:00
shuvanon
784f6cfa49
Update stop_words
2017-03-12 17:41:01 +06:00
shuvanon
73cc17078e
Merge branch 'master' of https://github.com/PySUST/spaCy
2017-03-12 14:52:17 +06:00
shuvanon
35ec7135bb
Spell checked and add new stop words
2017-03-12 14:51:34 +06:00
Em
9c809efc25
Removed mapStr
2017-03-11 16:23:26 -08:00
Matthew Honnibal
fa23278ee3
Add classes for beam parser and beam NER
2017-03-11 12:45:37 -06:00
Matthew Honnibal
6c4108c073
Add header for beam parser
2017-03-11 12:45:12 -06:00
Matthew Honnibal
4382f175b3
Squelch compiler warnings
2017-03-11 12:44:43 -06:00
Matthew Honnibal
ea2592879f
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-11 11:13:37 -06:00
Matthew Honnibal
1224c4d3c6
Improve output on trainer
2017-03-11 11:12:48 -06:00
Matthew Honnibal
b438dfd3f3
Add itn argument to tagger.update
2017-03-11 11:12:21 -06:00
Matthew Honnibal
931feb3360
Allow beam parsing for NER
2017-03-11 11:12:01 -06:00
Matthew Honnibal
f77a5bb60a
Switch back to greedy parser
2017-03-11 11:11:30 -06:00
Matthew Honnibal
ca9c8c57c0
Add iteration argument to parser.update
2017-03-11 07:00:47 -06:00
Matthew Honnibal
dcce9ca3f3
Use beam parser
2017-03-11 07:00:20 -06:00
Matthew Honnibal
e30ffdd003
Use ftrl optimizer in tagger
2017-03-11 06:59:13 -06:00
Matthew Honnibal
d59c6926c1
I think this fixes the segfault
2017-03-11 06:58:34 -06:00
Matthew Honnibal
318b9e32ff
WIP on beam parser. Currently segfaults.
2017-03-11 06:19:52 -06:00
Em
426d17167f
Added string manipulation for spans
2017-03-10 16:50:02 -08:00
Matthew Honnibal
b0d80dc9ae
Update name of 'train' function in BeamParser
2017-03-10 14:35:43 -06:00
Matthew Honnibal
d11f1a4ddf
Record negative costs in non-monotonic arc eager oracle
2017-03-10 11:22:04 -06:00
Matthew Honnibal
ecf91a2dbb
Support beam parser
2017-03-10 11:21:21 -06:00
Ines Montani
a16aff17aa
Merge pull request #876 from PySUST/master
...
[Bangla] Update "tokenizer_exceptions.py"
2017-03-10 14:46:00 +01:00
ines
10e29189ac
Adjust URL testcases and xfail problems (instead of comment)
2017-03-10 14:22:50 +01:00
ines
b04893a059
Make regex locale-independent for Python 2
2017-03-10 14:21:57 +01:00
Matthew Honnibal
ea53647362
Merge branch 'develop'
2017-03-10 02:49:39 -06:00
Ines Montani
1c40890321
Add missing comma
...
Should fix Travis build error
2017-03-10 09:34:54 +01:00
Shuvanon Razik
c251703428
Update abbreviations
2017-03-10 10:45:01 +06:00
Matthew Honnibal
b5247c49eb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-09 18:45:43 -06:00
Matthew Honnibal
798450136d
Set L1 penalty to 0 in tagger.
2017-03-09 18:43:47 -06:00
Matthew Honnibal
c62da02344
Use ftrl training, to learn compressed model.
2017-03-09 18:43:21 -06:00
Matthew Honnibal
f71eeef9bb
Pass path argument to end_training
2017-03-09 18:42:40 -06:00
Dan Rapp
123d3f2d38
Fix error in test case parameterization
2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7
Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix
2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d
Issue #840 - URL pattenr too broad
2017-03-09 11:39:39 -07:00
Matthew Honnibal
5b0b968d13
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-08 15:03:10 +01:00
Matthew Honnibal
0ac3d27689
Fix handling of trailing whitespace
...
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
ines
c2e3e651b8
Re-add regression test for #859
2017-03-08 14:36:09 +01:00
Matthew Honnibal
0a6d7ca200
Fix spacing after token_match
...
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859 .
2017-03-08 14:33:32 +01:00
shuvanon
85438aee1b
update tokenizertokenizer
2017-03-08 17:29:39 +06:00
shuvanon
45bc78461c
update tokenizertokenizer
2017-03-08 17:27:12 +06:00
Matthew Honnibal
cd33b39a04
Fix 2/3 problem for json save/load
2017-03-08 01:39:13 +01:00
Matthew Honnibal
40703988bc
Use FTRL training in parser
2017-03-08 01:38:51 +01:00
Matthew Honnibal
d108534dc2
Fix 2/3 problems for training
2017-03-08 01:37:52 +01:00
Matthew Honnibal
d03d6a13f1
Merge branch 'rominf-ud20' into develop
2017-03-07 21:48:56 +01:00
Matthew Honnibal
f7374d0b86
Merge branch 'ud20' of https://github.com/rominf/spaCy into rominf-ud20
2017-03-07 21:48:37 +01:00
Matthew Honnibal
16670d3251
Xfail the vocab pickling for now
2017-03-07 21:43:28 +01:00
Matthew Honnibal
a89c3500f6
Fixes to hacky vocab pickling
2017-03-07 20:58:55 +01:00
Matthew Honnibal
d814892805
Hackish pickle support for Vocab.
2017-03-07 20:25:12 +01:00
Matthew Honnibal
26614e028f
Add hacky support for StringCFile, to make pickling easier.
2017-03-07 20:24:37 +01:00
Matthew Honnibal
3edb8ae207
Whitespace
2017-03-07 17:16:26 +01:00
Matthew Honnibal
5de7e712b7
Add support for pickling StringStore.
2017-03-07 17:15:18 +01:00
Matthew Honnibal
4e75e74247
Update regression test for variable-length pattern problem in the matcher.
2017-03-07 16:08:32 +01:00
Matthew Honnibal
6d67213b80
Add test for 850: Matcher fails on zero-or-more.
2017-03-07 15:55:28 +01:00
Aniruddha Adhikary
696215a3fb
add tests for Bengali
2017-03-05 11:25:12 +06:00
Aniruddha Adhikary
8f3bfe9bfc
[Bengali] basic tag map, morph, lemma rules and exceptions
2017-03-04 12:36:59 +06:00
Roman Inflianskas
66e1109b53
Add support for Universal Dependencies v2.0
2017-03-03 13:17:34 +01:00
ines
8dff040032
Revert "Add regression test for #859 "
...
This reverts commit c4f16c66d1
.
2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela
25c29f072d
apply patch
2017-03-01 21:44:17 +01:00
Juan Miguel Cejuela
a8cfde46d3
#781 Fix test — colocalizes is lemmatized to colocaliz and colicalize
2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela
a471114eb2
#781 add regression test, failing previous bug fix
2017-03-01 21:30:51 +01:00
ines
c4f16c66d1
Add regression test for #859
2017-03-01 16:07:27 +01:00
Aniruddha Adhikary
d91be7aed4
add punctuations for Bengali
2017-02-28 21:07:14 +06:00
Aniruddha Adhikary
5a4fc09576
add basic Bengali support
2017-02-28 07:48:37 +06:00
Matthew Honnibal
cc9b2b74e3
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:44:39 +01:00
Matthew Honnibal
bd4375a2e6
Remove comment
2017-02-27 11:44:26 +01:00
Matthew Honnibal
e7e22d8be6
Move import within get_exceptions() function, to speed import
2017-02-27 11:34:48 +01:00
Matthew Honnibal
34bcc8706d
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:21:21 +01:00
Matthew Honnibal
0aaa546435
Fix test after updating the French tokenizer stuff
2017-02-27 11:20:47 +01:00
Matthew Honnibal
26446aa728
Avoid loading all French exceptions on import
...
Move exceptions loading behind a get_tokenizer_exceptions() function
for French, instead of loading into the top-level namespace. This
cuts import times from 0.6s to 0.2s, at the expense of making the
French data a little different from the others (there's no top-level
TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat
unsatisfying.
2017-02-25 11:55:00 +01:00
ines
376c5813a7
Remove print statements from test
2017-02-24 18:26:32 +01:00
ines
7c1260e98c
Add regression test
2017-02-24 18:22:49 +01:00
ines
0e2e331b58
Convert exceptions to Python list
2017-02-24 18:22:40 +01:00
ines
51eb190ef4
Remove print statements from test
2017-02-24 17:41:12 +01:00
Matthew Honnibal
db5ada3995
Merge branch 'master' of https://github.com/explosion/spaCy
2017-02-24 14:28:12 +01:00
Matthew Honnibal
8f94897d07
Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766
2017-02-24 14:27:02 +01:00
ines
67991b6e5f
Add more test cases to #775 regression test to cover #847
2017-02-18 14:10:44 +01:00
ines
30ce2a6793
Exclude "shed" and "Shed" from tokenizer exceptions (see #847 )
2017-02-18 14:10:44 +01:00
Ines Montani
de997c1a33
Merge pull request #842 from magnusburton/master
...
Added regular verb rules for Swedish
2017-02-17 11:18:20 +01:00
Magnus Burton
41fcfd06b8
Added regular verb rules for Swedish
2017-02-17 10:04:04 +01:00
ines
aa92d4e9b5
Fix unicode regex for Python 2 (see #834 )
2017-02-16 23:49:54 +01:00
ines
44de3c7642
Reformat test and use text_file fixture
2017-02-16 23:49:19 +01:00
ines
3dd22e9c88
Mark vectors test as xfail (temporary)
2017-02-16 23:28:51 +01:00
ines
85d249d451
Revert "Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )""
...
This reverts commit ea05f78660
.
2017-02-16 23:26:25 +01:00
ines
ea05f78660
Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )"
...
This reverts commit 7d8c9eee7f
, reversing
changes made to f6b69babcc
.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
06a71d22df
Fix test failure by using unicode literals
2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque
3ba109622c
Add regression test with non ' ' space character as token
2017-02-16 12:23:27 +01:00
Raphaël Bournhonesque
e17dc2db75
Remove useless import
2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque
3fd2742649
load_vectors should accept arbitrary space characters as word tokens
...
Fix bug #834
2017-02-16 12:08:30 +01:00
ines
f08e180a47
Make groups non-capturing
...
Prevents hitting the 100 named groups limit in Python
2017-02-10 13:35:02 +01:00
ines
fa3b8512da
Use consistent imports and exports
...
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
2017-02-10 13:34:09 +01:00
ines
21f09d10d7
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
...
This reverts commit f02a2f9322
.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
...
This reverts commit b95afdf39c
, reversing
changes made to b0ccf32378
.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
309da78bf0
Merge branch 'master' into tokenizer_exceptions
2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque
4ce0bbc6b6
Update unit tests
2017-02-09 16:30:43 +01:00
Raphaël Bournhonesque
5d706ab95d
Merge tokenizer exceptions from PR #802
2017-02-09 16:30:28 +01:00
ines
654fe447b1
Add Swedish tokenizer tests (see #807 )
2017-02-05 11:47:07 +01:00
ines
6715615d55
Add missing EXC variable and combine tokenizer exceptions
2017-02-05 11:42:52 +01:00
Ines Montani
30a52d576b
Merge pull request #807 from magnusburton/master
...
Added swedish lemma rules and more verb contractions
2017-02-05 11:34:19 +01:00
Magnus Burton
19c0ce745a
Added swedish lemma rules
2017-02-04 17:53:32 +01:00
Michael Wallin
d25556bf80
[issue 805] Fix issue
2017-02-04 16:22:21 +02:00
Michael Wallin
35100c8bdd
[issue 805] Add regression test and the required fixture
2017-02-04 16:21:34 +02:00
ines
0ab353b0ca
Add line breaks to Finnish stop words for better readability
2017-02-04 13:40:25 +01:00
Michael Wallin
1a1952afa5
[finnish] Add initial tests for tokenizer
2017-02-04 13:54:10 +02:00
Michael Wallin
f9bb25d1cf
[finnish] Reformat and correct stop words
2017-02-04 13:54:10 +02:00
Michael Wallin
73f66ec570
Add preliminary support for Finnish
2017-02-04 13:54:10 +02:00
Ines Montani
65d6202107
Merge pull request #802 from Tpt/fr-tokenizer
...
Adds more French tokenizer exceptions
2017-02-03 10:52:20 +01:00
Tpt
75a74857bb
Adds more French tokenizer exceptions
2017-02-03 13:45:18 +04:00
Ines Montani
afc6365388
Update regression test for #801 to match current expected behaviour
2017-02-02 16:23:05 +01:00
Ines Montani
012f4820cb
Keep infixes of punctuation + hyphens as one token (see #801 )
2017-02-02 16:22:40 +01:00
Ines Montani
1219a5f513
Add = to tokenizer prefixes
2017-02-02 16:21:11 +01:00
Ines Montani
ff04748eb6
Add missing emoticon
2017-02-02 16:21:00 +01:00
Ines Montani
13a4ab37e0
Add regression test for #801
2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque
85f951ca99
Add tokenizer exceptions for French
2017-02-02 08:36:16 +01:00
Matvey Ezhov
32a22291bc
Small Doc.count_by
documentation update
...
Current example doesn't work
2017-01-31 19:18:45 +03:00
Ines Montani
e4875834fe
Fix formatting
2017-01-31 15:19:33 +01:00
Ines Montani
c304834e45
Add missing import
2017-01-31 15:18:30 +01:00
Ines Montani
e6465b9ca3
Parametrize test cases and mark as xfail
2017-01-31 15:14:42 +01:00
latkins
e4c84321a5
Added regression test for Issue #792 .
2017-01-31 13:47:42 +00:00
Matthew Honnibal
6c665b81df
Fix redundant == TAG in from_array conditional
2017-01-31 00:46:21 +11:00
Ines Montani
19501f3340
Add regression test for #775
2017-01-25 13:16:52 +01:00
Ines Montani
209c37bbcf
Exclude "shell" and "Shell" from English tokenizer exceptions ( resolves #775 )
2017-01-25 13:15:02 +01:00
Raphaël Bournhonesque
1be9c0e724
Add fr tokenization unit tests
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
1faaf698ca
Add infixes and abbreviation exceptions (fr)
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
cf8474401b
Remove unused import statement
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
902f136f18
Add support for elision in French
2017-01-24 10:57:37 +01:00
Ines Montani
55c9c62abc
Use relative import
2017-01-23 21:27:49 +01:00
Ines Montani
0967eb07be
Add regression test for #768
2017-01-23 21:25:46 +01:00
Ines Montani
6baa98f774
Merge pull request #769 from raphael0202/spacy-768
...
Allow zero-width 'infix' token
2017-01-23 21:24:33 +01:00
Raphaël Bournhonesque
dce8f5515e
Allow zero-width 'infix' token
2017-01-23 18:28:01 +01:00
Ines Montani
5f6f48e734
Add regression test for #759
2017-01-20 15:11:48 +01:00
Ines Montani
09ecc39b4e
Fix multi-line string of NUM_WORDS ( resolves #759 )
2017-01-20 15:11:48 +01:00
Magnus Burton
69eab727d7
Added loops to handle contractions with verbs
2017-01-19 14:08:52 +01:00
Matthew Honnibal
be26085277
Fix missing import
...
Closes #755
2017-01-19 22:03:52 +11:00
Ines Montani
7e36568d5b
Fix title to accommodate sputnik
2017-01-17 00:51:09 +01:00
Ines Montani
d704cfa60d
Fix typo
2017-01-16 21:30:33 +01:00
Ines Montani
64e142f460
Update about.py
2017-01-16 14:23:08 +01:00
Matthew Honnibal
e889cd698e
Increment version
2017-01-16 14:01:35 +01:00
Matthew Honnibal
e7f8e13cf3
Make Token hashable. Fixes #743
2017-01-16 13:27:57 +01:00
Matthew Honnibal
2c60d0cb1e
Test #743 : Tokens unhashable.
2017-01-16 13:27:26 +01:00
Matthew Honnibal
48c712f1c1
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-16 13:18:06 +01:00
Matthew Honnibal
7ccf490c73
Increment version
2017-01-16 13:17:58 +01:00
Ines Montani
50878ef598
Exclude "were" and "Were" from tokenizer exceptions and add regression test ( resolves #744 )
2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b
Fix formatting
2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c
Merge pull request #742 from oroszgy/hu_tokenizer_fix
...
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41
Further numeric test.
2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa
Better error handling
2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c
Better error handling
2017-01-14 22:09:29 +01:00
Gyorgy Orosz
a45f22913f
Added further abbreviations present in the Szeged corpus
2017-01-14 22:08:55 +01:00
Ines Montani
332ce2d758
Update README.md
2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b
Passing all old tests.
2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af
Fixed hyphen handling in the Hungarian tokenizer.
2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6
Maintaining compatibility with other spacy tokenizers.
2017-01-14 16:19:15 +01:00
Gyorgy Orosz
be7a7aeb1a
Reversed accidental changes.
2017-01-14 15:59:36 +01:00
Gyorgy Orosz
1be5da1ac6
Fixed Hungarian tokenizer for numbers
2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a
Fix test formatting and consistency
2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5
Update README.md
2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1
Mark lemmatizer tests as models since they use installed data
2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a
Fix test name for consistency
2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f
Add util function to add vectors to vocab
2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d
Reformat add_docs_equal and add docstring
2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073
Add README.md to tests to explain organisation and conventions
2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90
Modernise serializer I/O tests and don't depend on models where possible
2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4
Add text_file_b fixture using BytesIO
2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62
Modernise noun chunks tests and don't depend on models
2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686
Rename test_parser to test_noun_chunks
2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47
Remove old tests
2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26
Move parser tests from unit to parser
2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e
Merge tokenizer tests
2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff
Move attrs tests from unit to root and modernise
2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967
Move alignment tests from munge to gold and modernise
2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a
Reformat and rename Pragmatic Segmenter tests and mark xfails
2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d
Modernise lemmatizer tests
2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9
Modernise tagger tests and fix xpassing test
2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e
Create basic and extended test set for URLs
2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8
Modernise BILUO tests
2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01
Add Lemmatizer fixture
2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597
Add path fixture for spaCy data path
2017-01-12 23:38:47 +01:00
Ines Montani
0894b8c0ef
Don't split tokens with digits and "/" infixes ( resolves #740 )
2017-01-12 22:58:26 +01:00
Ines Montani
e9e99a5670
Add regression test for #740
2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409
Fix formatting
2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31
Modernise and merge matcher tests
2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a
Update comments on EN and DE fixtures
2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9
Tidy up and rename regression tests and remove unnecessary imports
2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3
Fix formatting and consistency
2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e
Remove redundant language loading integration tests
2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2
Modernise serializer codecs tests
2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6
Modernise Huffman tests
2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5
Modernise packer tests and don't depend on models where possible
2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0
Modernise and merge serialization tests
2017-01-12 21:57:19 +01:00
Ines Montani
442237787c
Add assert_docs_equal util to compare two docs
2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb
Add fixture for entity recognizer
2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc
Modernise matcher tests and split into two files
2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8
Move matcher tests for #188 and #242 to regression tests
...
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd
Update test to not create redundant Doc object
2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8
Fix formatting, naming and unicode declaration
2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d
Modernise vector similarity tests
2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a
Add get_cosine util function
2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629
Fix regression test for #615 and remove unnecessary imports
2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c
Adjust formatting
2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6
Modernise and merge lexeme vocab tests
2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2
Move test for #361 to regression tests
2017-01-12 16:51:12 +01:00
Daniel Hershcovich
99eb494a82
Fix #737 : support loading word vectors with " " as a word
2017-01-12 17:00:14 +02:00
Ines Montani
7cb3d74426
Modernise span tests and don't depend on models
2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee
Modernise vocab API tests and remove old xfailing tests
2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd
Rename test_vocab.py to test_vocab_api.py
2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68
Merge flag features tests into orth tests in tests root
2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3
Remove StringStore tests from vocab tests
2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf
Modernise add vectors vocab test
2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345
Use consistent test names
2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce
Remove old unused tests and conftest files
2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9
Move Pragmatic Segmenter test cases (currently unused) to parser tests
2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874
Add tests for StringStore
2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5
Add fixture for StringStore
2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a
Modernise tests for merging spans and don't depend on models
2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d
Remove unused old test
2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b
Move test for #54 to regression tests
2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c
Remove unused conftest
2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc
Allow setting ents in get_doc
2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5
Fix and pass regression test for #736
2017-01-12 11:48:56 +01:00
Matthew Honnibal
4e48862fa8
Remove print statement
2017-01-12 11:25:39 +01:00
Matthew Honnibal
d1d8214767
Increment version
2017-01-12 11:21:57 +01:00
Matthew Honnibal
fba67fa342
Fix Issue #736 : Times were being tokenized with incorrect string values.
2017-01-12 11:21:01 +01:00
Ines Montani
a6790b6694
Rename tags to pos in get_doc and allow adding tags to tokens
2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67
Merge lemmatizer tests
2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf
Modernise morph exceptions test and don't depend on models
2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e
Add regression test for #736
2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891
Move language-specific tests out of redundant tokenizer directories
2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a
Tidy up
2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7
Move text file back to tokenizer tests directory
2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017
Remove old and/or redundant tests
2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097
Modernise space attachment parser tests and don't depend on models
2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8
Modernise and merge parser tests and don't depend on models
2017-01-12 01:07:29 +01:00
Ines Montani
178c147612
Modernise nonprojectivity tests and don't depend on models
2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c
Modernise sentence boundary detection tests and don't depend on models (where possible)
2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d
Remove old unused pickle test
2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc
Move test for #309 to regression tests
2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670
Modernise parser tests and don't depend on models
2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782
Add apply_transition_sequence util function to utils
2017-01-11 21:30:14 +01:00
Ines Montani
09807addff
Add en_parser fixture
2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61
Modernise Doc parse tree navigation tests and don't depend on models
2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2
Use consistent test names
2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367
Rename "tokens" tests to "doc"
2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563
Remove old unused files
2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f
Remove old word vector tests
2017-01-11 18:55:08 +01:00
Ines Montani
e027936920
Modernise Doc noun chunks tests
2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd
Modernise Doc array tests and don't depend on models
2017-01-11 18:54:46 +01:00
Ines Montani
05447be884
Modernise test for adding entities
2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00
Modernise Doc API tests and don't depend on models
2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44
Make words optional for get_doc
2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419
Fix StringIO import for Python 3
2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b
Rename test_tokens_api.py to test_doc_api.py
2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18
Merge token tests into token API tests
2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0
Modernise token API tests and don't depend on loading models
2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90
Merge conftests into one cohesive file
2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df
Add test utils and get_doc helper function
...
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Matthew Honnibal
e12c90e03f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-11 13:03:51 +01:00
Matthew Honnibal
12cd27b821
Amend 8ae8b443f: Handle comparison with None tokens.
2017-01-11 13:03:32 +01:00
Daniel Hershcovich
8e603cc917
Avoid "True if ... else False"
2017-01-11 11:18:22 +02:00
Matthew Honnibal
44e2b0100d
Support TAG attribute in doc.from_array
2017-01-10 22:47:07 +01:00
Ines Montani
3e6e1f0251
Tidy up regression tests
2017-01-10 19:24:10 +01:00
Magnus Burton
aad23ab0b4
Supplemented with capitalized Swedish exceptions
2017-01-10 16:07:20 +01:00
Ines Montani
869963c3c4
Mark extensive prefix/suffix tests as slow
2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe
Add simple test for surrounding brackets
2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2
Assert length first
2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907
Adjust names and formatting
2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964
Remove semi-redundant URLs and punctuation for faster testing
2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c
Add unicode declaration
2017-01-10 15:53:15 +01:00
Matthew Honnibal
0f9b8a00a5
Unbreak data download
2017-01-09 23:40:26 +01:00
Matthew Honnibal
8ae8b443f1
Add richcmp method to Token. Closes #631
2017-01-09 19:30:31 +01:00
Matthew Honnibal
64f747cb65
Token comparison test
2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c
Add tests for token comparison, re Issue #631
2017-01-09 19:09:59 +01:00
Matthew Honnibal
97a1286129
Revert changes to tagger and parser for thinc 6
2017-01-09 10:08:34 -06:00
Matthew Honnibal
95a52005df
Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class."
...
This reverts commit 40e71586d6
.
2017-01-09 09:55:55 -06:00
Ines Montani
363f09e68c
Merge pull request #726 from magnusburton/master
...
Added Swedish abbreviations as token exceptions
2017-01-09 14:58:15 +01:00
Matthew Honnibal
42cd598f57
Use correct fixtures in URL tokenizer
2017-01-09 14:10:40 +01:00
Matthew Honnibal
d9a77ddf14
Return None for data path if it doesn't exist
2017-01-09 14:10:05 +01:00
Matthew Honnibal
e4862d1dab
Merge branch 'develop'
2017-01-09 13:36:01 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb
Remove old tests for old website example code
2017-01-08 22:28:53 +01:00
Ines Montani
eef94e3ee2
Split off period after two or more uppercase letters ( fixes #483 )
2017-01-08 22:28:25 +01:00
Ines Montani
a89a6000e5
Remove unused import
2017-01-08 22:17:37 +01:00
Ines Montani
5d28664fc5
Don't test Hungarian for numbers and hyphens for now
...
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
53362b6b93
Reorganise Hungarian prefixes/suffixes/infixes
...
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani
347c4a2d06
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
2017-01-08 20:37:39 +01:00
Ines Montani
0dec90e9f7
Use global abbreviation data languages and remove duplicates
2017-01-08 20:36:00 +01:00
Ines Montani
7c3cb2a652
Add global abbreviations data
2017-01-08 20:34:03 +01:00
Ines Montani
de5aa92bc2
Handle deprecated tokenizer prefix data
2017-01-08 20:33:28 +01:00
Ines Montani
abb09782f9
Move sun.txt to original location and fix path to not break parser tests
2017-01-08 20:32:54 +01:00
Ines Montani
cab39c59c5
Add missing contractions to English tokenizer exceptions
...
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07
Move abbreviations below other exceptions
2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9
Generate he/she/it correctly with 's instead of 've
2017-01-05 19:57:00 +01:00
Ines Montani
8328925e1f
Add newlines to long German text
2017-01-05 18:13:30 +01:00
Ines Montani
55b46d7cf6
Add tokenizer tests for German
2017-01-05 18:11:25 +01:00
Ines Montani
5bb4081f52
Remove redundant test_tokenizer.py for English
2017-01-05 18:11:11 +01:00
Ines Montani
8216ba599b
Add tests for longer and mixed English texts
2017-01-05 18:11:04 +01:00
Ines Montani
65f937d5c6
Move basic contraction tests to test_contractions.py
2017-01-05 18:09:53 +01:00
Ines Montani
bbe7cab3a1
Move non-English-specific tests back to general tokenizer tests
2017-01-05 18:09:29 +01:00
Ines Montani
038002d616
Reformat HU tokenizer tests and adapt to general style
...
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani
bc911322b3
Move ") to emoticons (see Tweebo challenge test)
2017-01-05 18:05:38 +01:00
Ines Montani
637f785036
Add general sanity tests for all tokenizers
2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de
Move English tokenizer tests to directory /en
2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d
Modernize and merge general tokenizer tests
2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9
Modernize and merge tokenizer tests for string loading
2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822
Modernize and merge tokenizer tests for whitespace
2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1
Modernize and merge tokenizer tests for text from file
2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653
Modernize and merge tokenizer tests for punctuation
2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5
Modernize and merge tokenizer tests for exception and emoticons
2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d
Fix formatting
2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa
Add missing docstrings
2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6
Add unicode declarations
2017-01-05 13:09:48 +01:00
Ines Montani
58adae8774
Remove unused file
2017-01-05 13:09:22 +01:00
Ines Montani
c6e5a5349d
Move regression test for #360 into own file
2017-01-04 00:49:31 +01:00
Ines Montani
8279993a6f
Modernize and merge tokenizer tests for punctuation
2017-01-04 00:49:20 +01:00
Ines Montani
550630df73
Update tokenizer tests for contractions
2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f
Update conftest fixture
2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293
Modernize tokenizer tests for emoticons
2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd
Modernize tokenizer tests for infixes
2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27
Move regression test for #351 to own file
2017-01-04 00:47:11 +01:00
Ines Montani
667051375d
Modernize tokenizer tests for whitespace
2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285
Modernize tokenizer tests for contractions
...
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
1d237664af
Add lowercase lemma to tokenizer exceptions
2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb
Fix typos
2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani
fb9d3bb022
Revert "Merge remote-tracking branch 'origin/master'"
...
This reverts commit d3b181cdf1
, reversing
changes made to b19cfcc144
.
2017-01-03 18:21:36 +01:00
Ines Montani
461cbb99d8
Revert "Reorganise English tokenizer exceptions (as discussed in #718 )"
...
This reverts commit b19cfcc144
.
2017-01-03 18:21:29 +01:00
Ines Montani
d3b181cdf1
Merge remote-tracking branch 'origin/master'
...
# Conflicts:
# spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani
b19cfcc144
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani
1bd53bbf89
Fix typos ( resolves #718 )
2017-01-03 11:26:21 +01:00
Matthew Honnibal
fde53be3b4
Move whole token mach inside _split_affixes.
2016-12-30 17:11:50 -06:00
Matthew Honnibal
3ba7c167a8
Fix URL tests
2016-12-30 17:10:08 -06:00
Matthew Honnibal
9936a1b9b5
Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
2016-12-30 14:53:40 -06:00
Magnus Burton
56e2219b65
Added Swedish city abbreviations
2016-12-30 21:17:34 +01:00
Magnus Burton
e935c950d8
Added months and days as abbreviations for Swedish
2016-12-30 21:08:44 +01:00
kengz
73a38bd4d1
Merge remote-tracking branch 'upstream/master'
2016-12-30 12:19:59 -05:00
kengz
da44183ae1
move parse_tree logic to a new tokens/printers.py file
2016-12-30 12:19:18 -05:00
Matthew Honnibal
3e8d9c772e
Test interaction of token_match and punctuation
...
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Matthew Honnibal
74b921f394
Merge branch 'master' of ssh://github.com/explosion/spaCy into develop
2016-12-30 14:38:27 +01:00
Matthew Honnibal
623d94e14f
Whitespace
2016-12-31 00:30:28 +11:00
Matthew Honnibal
af81ac8bb0
Use thinc 6.0
2016-12-29 11:58:42 +01:00
Petter Hohle
f112e7754e
Add PART to tag map
...
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
2016-12-28 18:39:01 +01:00
Matthew Honnibal
f62db78dc3
Increment version
2016-12-27 21:11:22 +01:00
Matthew Honnibal
cade536d1e
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-27 21:04:10 +01:00
Matthew Honnibal
ce4539dafd
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
2016-12-27 21:03:45 +01:00
Ines Montani
ad3669cef5
Merge pull request #703 from magnusburton/master
...
Added Swedish abbreviations
2016-12-27 01:01:49 +01:00
Ines Montani
78f754dd9a
Merge pull request #705 from oroszgy/hu_tokenizer
...
Initial support for Hungarian
2016-12-27 00:48:13 +01:00
Ines Montani
8785706039
Reformat stop words for better readability
2016-12-24 00:58:40 +01:00
Gyorgy Orosz
45e045a87b
Unicode/UTF8 compatibility for Python2
2016-12-24 00:21:00 +01:00
Gyorgy Orosz
72b61b6d03
Typo fix.
2016-12-24 00:10:29 +01:00
Gyorgy Orosz
3a9be4d485
Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
2016-12-23 23:49:34 +01:00
Ines Montani
1436b9f15a
Fix formatting and consistency
2016-12-23 21:36:01 +01:00
Ines Montani
1d64527727
Update Spanish tokenizer
...
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
2016-12-23 21:36:01 +01:00
Ines Montani
7f411fd01c
Remove exceptions containing whitespace / no special chars
2016-12-23 14:30:06 +01:00
Magnus Burton
fdf4776262
Added Swedish abbreviations
2016-12-22 22:45:18 +01:00
Gyorgy Orosz
d9c59c4751
Maintaining backward compatibility.
2016-12-21 23:30:49 +01:00
Gyorgy Orosz
1748549aeb
Added exception pattern mechanism to the tokenizer.
2016-12-21 23:16:19 +01:00
Gyorgy Orosz
35aa54765d
Hungarian module is exposed in spacy.
2016-12-21 20:45:36 +01:00
Gyorgy Orosz
ab2f6ea46c
Removed data files from tests..
2016-12-21 20:22:09 +01:00
Ines Montani
3c87c71d43
Add tokenizer exceptions for a.m. and p.m. in Spanish
2016-12-21 18:19:10 +01:00
Ines Montani
78e63dc7d0
Update tokenizer exceptions for English
2016-12-21 18:06:34 +01:00
Ines Montani
702d1eed93
Update tokenizer exceptions for German
2016-12-21 18:06:27 +01:00
Ines Montani
d60380418e
Update tokenizer exceptions for Spanish
2016-12-21 18:06:17 +01:00
Ines Montani
920fa0fed2
Add DET_LEMMA constant
2016-12-21 18:05:41 +01:00
Ines Montani
8978806ea6
Allow Vocab to load without serializer_freqs
2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6
Remove trailing whitespace
2016-12-21 18:04:41 +01:00
Ines Montani
926e19184a
Merge pull request #695 from magnusburton/master
...
Added Swedish morph rules
2016-12-21 01:06:00 +01:00
Gyorgy Orosz
3d5306acb9
Added further testcases.
2016-12-20 23:49:35 +01:00
Gyorgy Orosz
23956e72ff
Improved partial support for tokenzing Hungarian numbers
2016-12-20 23:36:59 +01:00
Gyorgy Orosz
6add156075
Refactored language data structure
2016-12-20 22:28:20 +01:00
Gyorgy Orosz
366b3f8685
Merge branch 'master' into hu_tokenizer
2016-12-20 20:53:31 +01:00
Gyorgy Orosz
c035928156
Partial Hungarian number tokenization is added.
2016-12-20 20:46:20 +01:00
JM
70ff0639b5
Fixed missing vec_path declaration that was failing if 'add_vectors' was set
...
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
2016-12-20 18:21:05 +01:00
Magnus Burton
48dcc9f647
Added morph rules
2016-12-20 13:18:41 +01:00
Magnus Burton
db5a077d2b
Initial commit for Swedish
2016-12-20 11:05:06 +01:00
Matthew Honnibal
3f5747a9b2
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-18 23:44:22 +01:00
Matthew Honnibal
40e71586d6
Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class.
2016-12-18 23:44:05 +01:00
Matthew Honnibal
fa1d23e10d
Merge branch 'master' of https://github.com/explosion/spaCy
2016-12-18 23:32:03 +01:00
Matthew Honnibal
f38eb25fe1
Fix test for word vector
2016-12-18 23:31:55 +01:00
Matthew Honnibal
4e68abebc4
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-18 23:19:45 +01:00
Matthew Honnibal
5a6328a5a4
Increment version
2016-12-18 23:19:19 +01:00
Matthew Honnibal
13a0b31279
Another tweak to GloVe path hackery.
2016-12-18 23:12:49 +01:00
Matthew Honnibal
2c6228565e
Fix vector loading re glove hack
2016-12-18 23:06:44 +01:00
Matthew Honnibal
618b50a064
Fix issue #684 : GloVe vectors not loaded in spacy.en.English.
2016-12-18 22:46:31 +01:00
Matthew Honnibal
404019ad2f
Fix issue #672 : ent_iob_ was a string, not unicode, due to missing unicode_literals statement.
2016-12-18 22:33:53 +01:00
Matthew Honnibal
2ef9d53117
Untested fix for issue #684 : GloVe vectors hack should be inserted in English, not in spacy.load.
2016-12-18 22:29:31 +01:00
Matthew Honnibal
c065359459
Fix path-override bug in spacy.load
2016-12-18 22:15:29 +01:00
Matthew Honnibal
813249f826
Work on morphology class. Still not fully consistent with rest of library.
2016-12-18 17:35:22 +01:00
Matthew Honnibal
3679fb43a3
Fix loading of lemmatizer
2016-12-18 17:34:09 +01:00
Matthew Honnibal
3980f1b0cb
Ignore more morphology attributes in deprecated mode of intify_attrs
2016-12-18 17:33:46 +01:00
Matthew Honnibal
7a98ee5e5a
Merge language data change
2016-12-18 17:03:52 +01:00
Matthew Honnibal
e4c951c153
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93
Fix formatting
2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd
Fix tokenizer test
2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5
Use base language data as default
2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09
Remove trailing whitespace
2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c
Add base tag map
2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11
Reorganise language data
2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8
Whitespace
2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96
Add import in regression test
2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5
Set tag_map to None if it's not seen in the data by vocab
2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b
Update header for morphology class
2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7
Filter out morphology keys in deprecated attrs
2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100
Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.
2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd
Wire up lemmatizer rules for English
2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04
Whitespace
2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882
Break language data components into their own files
2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db
Update English language data
2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd
Add lemma rules
2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8
Add morph rules
2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9
Add entity rules
2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d
Fix formatting
2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0
Break language data components into their own files
2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f
Add ENT_ID constant
2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453
Refactor loading of morphology exceptions, adding a method add_special_case.
2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6
Remove unnecessary argument in test
2016-12-18 14:06:27 +01:00
Ines Montani
121c310566
Remove trailing whitespace
2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3
Fix tag map for German
2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3
Fix typo
2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635
Change test595 to mock data, instead of requiring model.
2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff
Check POS key in lemmatizer, to update it for new data format
2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e
Restore missing '' character in tokenizer exceptions.
2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9
Remove duplicates in tag map
2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8
Fix tag map
2016-12-17 22:44:22 +01:00
Ines Montani
577adad945
Fix formatting
2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136
Fix typo
2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc
Fix typo
2016-12-17 13:59:30 +01:00
Ines Montani
afda532595
Use symbols in tag map
2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9
Fix formatting
2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6
Reformat dutch language data to match new style
2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504
Resolve stopwords conflict to merge Dutch
2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f
Merge pull request #688 from nlesc-sherlock/dutch
...
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f
Add missing lemmas to tokenizer exceptions ( fixes #674 )
2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd
Expand tokenizer exceptions with unicode apostrophe ( fixes #685 )
2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612
Fix formatting
2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67
Move shared functions and constants to global language data
2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086
Move update_exc to global language data utils
2016-12-17 12:29:02 +01:00
Ines Montani
f324311249
Add global language data utils
2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a
Add encoding declaration
2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334
Add tokenizer exception for "gonna" ( fixes #691 )
2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa
Revert "Add exception for "gonna""
...
This reverts commit 280c03f67b
.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b
Add exception for "gonna"
2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2
Fix typo in stopwords ( fixes #689 )
2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a
Merge github.com:explosion/spaCy into dutch
2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7
Revert "Add acl to symbols.pyx"
2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed
Update symbols.pxd
2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777
Add acl to symbols.pyx
2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24
Adding partial hyphen and quote handling support.
2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3
Passing Hungatian abbrev tests.
2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9
Add Portuguese stopwords
2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc
Update Portuguese language data
2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950
Remove unused data and download script
2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104
Remove unused data
2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660
Add French stopwords
2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb
Update French language data
2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6
Add Italian stopwords
2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea
Update Italian language data
2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e
Add Spanish language data
2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786
Remove unused import
2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761
Split punctuation into its own file
2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8
Remove time from German language data
2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9
Add emoticons
2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f
Fix formatting
2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35
Reorganize exceptions for English and German
2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda
Add update_exc util function
2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad
Fix formatting
2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c
Fix formatting
2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004
Fix formatting
2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b
Add more custom rules for abbreviations
2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa
Additional abbreviation tests.
2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023
Added Hungarian resource files.
2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c
Update language data for German
2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321
Fix capitalization on morphological features
2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955
First steps towards the Hungarian tokenizer code.
2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df
Resolve conflict
2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695
Change morphology and lemmatizer API
...
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df
Remove trailing whitespace
2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80
Apply emoticon exceptions to tokenizer
2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3
Fix formatting
2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee
Declare encoding and unicode literals
2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657
Fix __all__
2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c
Add missing emoticons
2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93
Update English language data
2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe
Add emoticons
2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294
Add line breaks
2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102
Add test for tokenizer regular expressions
2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32
Reformat language data
2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d
Increment version
2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada
Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667
2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07
Merge github.com:explosion/spaCy into dutch
2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86
Update language data with tag map from UD_Dutch
2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9
Update Dutch language data
...
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2
Added language Dutch to init file
2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5
Fix create_tokenizer when nlp is None
2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9
Fix model saving error for Python 3
2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c
Fix unicode problem in nonproj module
2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6
Filter out deprecated attributes when reading special-case tokenization rules.
2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2
Exclude morphs from deprecated token attributes for now
2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1
Merge old training fixes with newer state
2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4
Exclude morphs from deprecated token attributes for now
2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0
Allow dep to be None in scorer, for missing labels.
2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb
Fix NER label calculation
2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53
Tweak arc_eager n_gold to deal with negative costs, and improve error message.
2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015
Pass cfg through loading, for training.
2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421
Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state
2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a
Fix gold.pyx for 1.0
2016-11-25 08:57:59 -06:00
root
080d29e092
Fix train.py for 1.0
2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135
Test #656 , #624 : special case rules for tokenizer with attributes.
2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95
Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.
2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f
Add set_struct_attr staticmethod to token
2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e
Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.
2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51
Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.
2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840
Add emoticons
2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a
Added nl module for dutch
2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322
Added language class and some language data (with some TODOs) for Dutch
2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02
Add line breaks
2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2
Add test for tokenizer regular expressions
2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7
Reformat language data
2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76
Allow German noun chunks to work on Span
...
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4
Add directory and initial (empty) files for language Dutch
2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641
Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4
Fix default path loading.
2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee
Work on test for #615
2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89
Fix syntax mistake
2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce
Only try to load vectors if they exist.
2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093
Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.
2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6
Fix another bug related to Language.__init__'s path parameter
2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0
Fix path
param of Language.__init__
always being ignored
...
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389
Merge remote-tracking branch 'origin/master' into specify-data-path
2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72
Let --data-path be specified when running download.py scripts
...
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9
Strip trailing whitespace
2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326
Update and reformat German stopwords
2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309
Update language_data.py
2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a
Add German Stopwords
2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7
Merge pull request #627 from sadovnychyi/patch-1
...
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29
Fixed bug: eg.guess is a tag id, rather than tag
2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1
Remove duplicated line of vocab declaration
...
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c
Fix #617 : Vocab.load() required Path. Should work with string as well.
2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6
Fix test for issue 617
2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329
spacy/tests/regression/test_issue617.py
...
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f
Added a test case to cover the span.merge returning values
2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9
now span.merge returns token like it says on documentation
2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79
Fix PhraseMatcher to work with updated Matcher
...
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64
Add basic test for PhraseMatcher
...
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f
Fix test for 605
2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439
Test #590 : Order dependence in Matcher rules.
2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265
Fix #605 : Acceptor now rejects matches as expected.
2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd
Test Issue #605
2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac
Fix #608 -- __version__ should be available at the base of the package.
2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7
Increment version
2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994
Update version
2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808
Fix Issue #376 : and/or was tagged as a noun.
2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e
Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.
2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756
Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.
2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82
Fix #602 , #603 --- Broken build
2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a
Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.
2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331
Prefer to import from symbols instead of parts_of_speech
2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001
Test #595 -- Bug in lemmatization of base forms.
2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec
Fix #588 : Matcher should reject empty pattern.
2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec
Test Issue #588 : Matcher accepts invalid, empty patterns.
2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb
Add tokenizer exception for 'Ph.D.', to fix 592.
2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b
Import Jieba inside zh.make_doc
2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680
Remove deprecated tokens_from_list test.
2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595
Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.
2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2
Fix Issue #600 : Missing setters for Token attribute.
2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d
Test Issue #600
2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615
Fix doc strings for tokenizer
2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29
Fix test
2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc
Add import fr
2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982
Fix infixes in Italian
2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c
Fix infixes in spanish and portuguese
2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a
Fix infixes in french
2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb
Add test for french tokenizer
2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044
Add test for loading languages
2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b
Fix stray POS in language stubs
2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576
Handle null prefix/suffix/infix search in tokenizer
2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423
Check that patterns aren't null before compiling regex for tokenizer
2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33
Link languages in __init__.py
2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965
Stub out support for Italian
2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7
Stub out support for French, Spanish, Italian and Portuguese
2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83
Specify that spacy.util is encoded in utf8
2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395
Add draft Jieba tokenizer for Chinese
2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b
Check for class-defined make_docs method before assigning one provided as an argument
2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d
Work on draft Italian tokenizer
2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177
Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf
Add __init__.py file for regression tests
2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20
Fix variable error in token
2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce
Fix variable error in Span
2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa
Use 32 bit hashes for OOV, re Issue #589 , Issue #285
2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd
Add test for Issue #589
2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb
Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.
2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595
Improve test slightly
2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4
Test Issue #587 : Matcher segfaults on particular input
2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208
Infer types in transition_system.pyx
2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94
Fix training evaluate method
2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898
Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3
Test Issue 429: No valid actions for NER after matcher adds a new entity label.
2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f
Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.
2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912
Fix test, after IOB tweak.
2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477
Remove dead code
2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579
Improve error message when oracle fails for non-projective trees, re Issue #571 .
2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2
Fix Python 3 basestring error
2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f
Update strings.pxd
2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff
Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.
2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99
Test workaround for Issue #285 : Streaming data memory growth
2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7
Update test
2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2
Update test
2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125
Update test
2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe
Update test
2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3
Update test
2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83
Increment version
2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e
Fix Issue #566
2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294
Fix vector_norm test.
2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8
Make installation print data path.
2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064
Try to fix weird install glitch.
2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd
Increment version
2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77
Fix vector norm when loading lexemes.
2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e
Fix similarity test for Python 3
2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b
Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98
Test Issue #514 : Serializer fails when new entity type has been added.
2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1
Fix broken test, re Issue #461
2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38
Test Issue #514 : Serialization fails after adding a new entity label.
2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa
Fix Issue #459 -- failed to deserialize empty doc.
2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322
Test Issue #459 : Fail to deserialize empty doc
2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c
Fix bug where huffman codec died if given empty freqs dict.
2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f
Test that huffman codec works with empty freqs dict
2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e
Fix Issue #461 : O tag was being clobbered by doc.ents.__set__
2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9
Test Issue #461 : ent_iob tag incorrect after setting entities.
2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646
Fix compile error
2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693
Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.
2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a
Fix calculation of L2-norm for Lexeme
2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac
Add tests to verify that Issue #442 is fixed in 1.1
2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af
Fix create_pipeline keyword argument.
2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4
Fix vector_norm when vector is assigned to Lexeme.
2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f
Add tests for using the new Entity ID tracking in the rule matcher
2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9
Add test for vector resizing, re Issue #544
2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc
Bump version to 1.1.0
2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4
Add resize_vectors method to Vocab
2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824
Fix JSON in tokenizer
2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f
Fix JSON in tagger
2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807
Increment version
2016-10-21 00:00:03 +02:00
Matthew Honnibal
e80944276f
Fix Span.vector_norm
2016-10-20 21:58:56 +02:00
Matthew Honnibal
f5fe4f595b
Fix json loading, for Python 3.
2016-10-20 21:23:26 +02:00
Matthew Honnibal
2e92c6fb3a
Fix JSON encoding issue on load
2016-10-20 21:06:48 +02:00
Matthew Honnibal
4ad7bb96c9
Increment version.
2016-10-20 20:48:30 +02:00
Matthew Honnibal
5ec32f5d97
Fix loading of GloVe vectors, to address Issue #541
2016-10-20 18:27:48 +02:00
Matthew Honnibal
ddeabd76c4
Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.
2016-10-20 16:57:53 +02:00
Matthew Honnibal
bfe5cb1244
Increment version.
2016-10-20 14:52:00 +02:00
Matthew Honnibal
f189a3cb00
Fix encoding when opening files in Python 2.7, re Issue #539
2016-10-20 14:42:56 +02:00
Matthew Honnibal
c353a5214d
Increment version
2016-10-19 23:51:01 +02:00
Matthew Honnibal
d10c17f2a4
Fix Issue #536 : oov_prob was 0 for OOV words.
2016-10-19 23:38:47 +02:00
Matthew Honnibal
dfa752d064
Increment version
2016-10-19 23:19:13 +02:00
Matthew Honnibal
3588a18fb8
Fix hook names in doc
2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773
Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.
2016-10-19 20:54:22 +02:00
Matthew Honnibal
ed5e178817
Add sentiment property on lexeme object
2016-10-19 20:52:52 +02:00
Matthew Honnibal
d4aaf2752c
Fix issue #535 : Pipeline elements added even when data not installed.
2016-10-19 19:55:19 +02:00
Matthew Honnibal
04d1c959da
Fix version
2016-10-19 03:45:37 +02:00
Matthew Honnibal
d35aa7344e
Change version ID to make PyPi happy
2016-10-19 03:24:39 +02:00
Matthew Honnibal
89d2a5c8b3
Increment build version.
2016-10-19 03:05:17 +02:00
Matthew Honnibal
622b0a9674
Tweak download script
2016-10-19 00:52:16 +02:00
Matthew Honnibal
5a5c7192a5
Fix download.py for GloVe vectors.
2016-10-19 00:47:44 +02:00
Matthew Honnibal
edc45c19d6
Update download script
2016-10-19 00:41:14 +02:00
Matthew Honnibal
2bbb050500
Fix default of serializer_freqs
2016-10-18 19:55:41 +02:00
Matthew Honnibal
1b651db9c5
Fix parser creation in Language class.
2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7
Fix loading of tagger.
2016-10-18 19:33:04 +02:00
Matthew Honnibal
76c815f40d
Fix spacy.load
2016-10-18 19:23:31 +02:00
Matthew Honnibal
8c8f5c62c6
Add LANG attribute to English and German
2016-10-18 18:52:48 +02:00
Matthew Honnibal
05e2a589a4
Fix None label in matcher
2016-10-18 18:05:21 +02:00
Matthew Honnibal
c3a8a1cf51
Update serializer test.
2016-10-18 16:18:46 +02:00
Matthew Honnibal
7d5212f131
Refactor defaults
2016-10-18 16:18:25 +02:00
Matthew Honnibal
a45a9d5092
Remove stray .tensor attribute from Lexeme
2016-10-18 01:16:32 +02:00
Matthew Honnibal
9258db788a
Revert "Have the matcher return character offsets, to handle the match better."
...
This reverts commit 049c937540
.
2016-10-17 16:49:51 +02:00
Matthew Honnibal
7d446e5094
Revert "Update matcher test, to reflect character offset return instead of token offset."
...
This reverts commit f8d3e3bcfe
.
2016-10-17 16:49:49 +02:00
Matthew Honnibal
4bf2c53c13
Revert "Hack on matcher tests, for new implementation."
...
This reverts commit dbe60644ab
.
2016-10-17 16:49:48 +02:00
Matthew Honnibal
2fd97c71cc
Revert "Don't try to pickle matcher."
...
This reverts commit 97bd0c9d00
.
2016-10-17 16:49:43 +02:00
Matthew Honnibal
97bd0c9d00
Don't try to pickle matcher.
2016-10-17 16:38:40 +02:00
Matthew Honnibal
dbe60644ab
Hack on matcher tests, for new implementation.
2016-10-17 16:12:22 +02:00
Matthew Honnibal
f8d3e3bcfe
Update matcher test, to reflect character offset return instead of token offset.
2016-10-17 16:00:10 +02:00
Matthew Honnibal
049c937540
Have the matcher return character offsets, to handle the match better.
2016-10-17 15:58:57 +02:00
Matthew Honnibal
9b60186266
Fix doc class
2016-10-17 15:23:47 +02:00
Matthew Honnibal
6cbdc94959
Lots of updates to Matcher, to make entity handling sane.
2016-10-17 15:23:31 +02:00
Matthew Honnibal
7fd98fc91c
Remove deprecation shim around str/bytes in Token.
2016-10-17 14:02:47 +02:00
Matthew Honnibal
b67697a97b
Improve API for doc.merge() and span.merge(), to use keyword arguments.
2016-10-17 14:02:13 +02:00
Matthew Honnibal
fbb7f3f15c
Add user_data attribute to Doc object.
2016-10-17 11:43:22 +02:00
Matthew Honnibal
c1abc8f6ed
Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec
2016-10-17 11:18:41 +02:00
Matthew Honnibal
4ba9eadf3d
Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1
2016-10-17 02:45:44 +02:00
Matthew Honnibal
09ab447a18
Remove tensor property from token.
2016-10-17 02:45:09 +02:00
Matthew Honnibal
5d10e2005c
Defer some attributes to Doc, via getters_for_tokens attribute.
2016-10-17 02:44:49 +02:00
Matthew Honnibal
8829984efb
Remove tensor attribute from Span and Token.
2016-10-17 02:44:04 +02:00
Matthew Honnibal
d15a88c66a
Defer some attributes to Doc via getters_for_spans
2016-10-17 02:43:35 +02:00
Matthew Honnibal
62230dd13a
Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring
2016-10-17 02:42:51 +02:00
Matthew Honnibal
ae11ea8240
Add getters_for_tokens and getters_for_spans attributes to Doc object.
2016-10-17 02:42:05 +02:00
Matthew Honnibal
be48a7b4f3
Fix conftest for website tests.
2016-10-17 01:54:26 +02:00
Matthew Honnibal
8951bf6989
Update matcher tests
2016-10-17 01:53:24 +02:00