Commit Graph

2993 Commits

Author SHA1 Message Date
Matthew Honnibal
35458987e8 Checkpoint -- nearly finished reimpl 2017-05-07 23:05:01 +02:00
Matthew Honnibal
4441866f55 Checkpoint -- nearly finished reimpl 2017-05-07 22:47:06 +02:00
Matthew Honnibal
6782eedf9b Tmp GPU code 2017-05-07 11:04:24 -05:00
Matthew Honnibal
e420e5a809 Tmp 2017-05-07 07:31:09 -05:00
Matthew Honnibal
12039e80ca Switch to single matmul for state layer 2017-05-07 14:26:34 +02:00
Matthew Honnibal
700979fb3c CPU/GPU compat 2017-05-07 04:01:11 +02:00
Matthew Honnibal
f99f5b75dc working residual net 2017-05-07 03:57:26 +02:00
Matthew Honnibal
bdf2dba9fb WIP on refactor, with hidde pre-computing 2017-05-07 02:02:43 +02:00
Matthew Honnibal
b439e04f8d Learning smoothly 2017-05-06 20:38:12 +02:00
Matthew Honnibal
08bee76790 Learns things 2017-05-06 18:24:38 +02:00
Matthew Honnibal
04ae1c01f1 Learns things 2017-05-06 18:21:02 +02:00
Matthew Honnibal
bcf4cd0a5f Learns things 2017-05-06 17:37:36 +02:00
Matthew Honnibal
8e48b58cd6 Gradients look correct 2017-05-06 16:47:15 +02:00
Matthew Honnibal
7e04260d38 Data running through, likely errors in model 2017-05-06 14:22:20 +02:00
Matthew Honnibal
fa7c1990b6 Restore tok2vec function 2017-05-05 20:12:03 +02:00
Matthew Honnibal
efe9630e1c Bug fixes 2017-05-05 20:09:50 +02:00
Matthew Honnibal
ef4fa594aa Draft of NN parser, to be tested 2017-05-05 19:20:39 +02:00
Matthew Honnibal
7d1df50aec Draft up Parser model 2017-05-04 13:31:40 +02:00
Matthew Honnibal
ccaf26206b Pseudocode for parser 2017-05-04 12:17:59 +02:00
ines
b1f22c5a10 Fix formatting 2017-05-03 20:11:02 +02:00
ines
a04b5be1b2 Add glossary for annotation scheme (closes #1034)
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard
929f2792a7 Rennaming cls in module. cls is now a class 2017-05-03 15:41:07 +02:00
Gregory Howard
0e8c41ea4f Adding method lemmatizer for every class 2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e adding export japanese 2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224 Merge branch 'master' into master 2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4 Lazy imports language 2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b Add newline 2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea Add newline 2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135 Add newline 2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb Merge remote-tracking branch 'remotes/upstream/master' 2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2 Add dropout optin for parser and NER
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.

    nlp.entity.update(doc, gold, drop=0.4)

This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.

This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b Removing extra spaces 2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e Adding unitest for tokenization in french (with title) 2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255 correcting tokenizer exception.
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c Add newline 2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2 Add newline 2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef Add newline 2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21 Add newline 2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05 Add newline 2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849 Adding unitest for tokenization in french (with title) 2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45 Improvement of rules now title insentive and have same declaration format 2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431 Hooked up tokenizer tests 2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27 Increment version 2017-04-26 20:25:41 +02:00
luvogels
b331929a7e Merge branch 'master' of https://github.com/luvogels/spaCy 2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9 Added tokenizer tests 2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13 Try to make test999 less flakey 2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09 Update __init__.py 2017-04-26 18:27:55 +02:00
ines
527d51ac9a Fetch shortcuts from GitHub and improve error handling 2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451 Adding insensitive lemmatisation test 2017-04-25 18:07:02 +02:00
ghoward
26e31afc18 renamming tests 2017-04-25 17:46:01 +02:00
ghoward
c085c2d391 Adding some unitests 2017-04-25 17:44:16 +02:00
ghoward
55c6910f90 Look_up table for languages in spacy.
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe Fix unicode header in tests 2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5 Fix test 2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1 Fix flakey test 2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15 Make training test less flakey 2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b Fix reporting if no dev data with train 2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8 Create directory if missing in save_to_directory 2017-04-23 21:24:43 +02:00
ines
42305bc519 Remove unnecessary test 2017-04-23 21:21:41 +02:00
ines
012ea594d1 Add file for misc tests 2017-04-23 21:06:51 +02:00
ines
83f66947dc Rename test_download to test_cli 2017-04-23 21:06:50 +02:00
ines
401045433c Simplify compat.fix_text 2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64 Increment version 2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b Update fix for Issue #999 2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07 Add test for Issue #955 2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5 Ensure noun chunks can't be nested. Closes #955 2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445 Add test for Issue #999 2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52 Fix json dump for Python3 2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17 Remove xfail on Test #910 2017-04-23 16:28:55 +02:00
ines
3a9710f356 Pass dev_scores to print_progress correctly (resolves #1008)
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab Persist the actions within spacy.parser.cfg 2017-04-20 17:02:44 +02:00
ines
25c70b4cc5 Move fix_text to spacy.compat (see #1002) 2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee Merge pull request #1002 from oroszgy/model_cli_fix
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c Using ftfy for handling broken encoded strings. 2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046 Merge pull request #1001 from recognai/master
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing 2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10 Enable test 2017-04-20 02:25:24 +03:00
Ben Eyal
33af52599e Redefine alphabetic characters
For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.
2017-04-20 02:25:02 +03:00
Ben Eyal
d8098a8be2 Use regex instead of re 2017-04-20 02:22:52 +03:00
oeg
daaa42dd25 Merge remote-tracking branch 'upstream/master' 2017-04-19 23:30:36 +02:00
oeg
936a297241 fix(model): Fix tag map for fixing issues with tag SPACE 2017-04-19 23:30:21 +02:00
luvogels
c7cec7e5e2 Update __init__.py 2017-04-19 21:06:30 +02:00
luvogels
55e8cade36 Update __init__.py 2017-04-19 21:06:30 +02:00
luvogels
03abd0c8e6 Update __init__.py 2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang
538a8d6b12 Resolved merge conflict by incorporating both suggestions. 2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang
e821c48489 Norwegian language basics 2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang
3796c668d9 more norwegian 2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang
bc9557b21f Norwegian language basics 2017-04-19 21:00:01 +02:00
ines
2bd89e7ade Tidy up Hebrew tests and test for punctuation (see #995) 2017-04-19 19:28:03 +02:00
ines
48da244058 Use spacy.compat.json_dumps for Python 2/3 compatibility (resolves #991) 2017-04-19 11:50:36 +02:00
ines
ddd5194088 Update Language docs and docstrings 2017-04-17 01:52:13 +02:00
ines
f62b740961 Use compat.json_dumps 2017-04-17 01:46:14 +02:00
ines
8e83f8e2fa Update docstrings 2017-04-17 01:40:26 +02:00
ines
e2299dc389 Ensure path in save_to_directory 2017-04-17 01:40:14 +02:00
ines
82f5f1f98f Replace str with compat.unicode_ 2017-04-17 01:29:54 +02:00
ines
16a8521efa Increment version 2017-04-16 22:38:38 +02:00
Matthew Honnibal
4efd6fb9d6 Fix training 2017-04-16 15:28:27 -05:00
Matthew Honnibal
17c9fffb9e Fix naked except 2017-04-16 15:28:16 -05:00
ines
5610fdcc06 Get language name first if no model path exists
Makes sure spaCy fails early if no tokenizer exists, and allows
printing better error message.
2017-04-16 22:16:47 +02:00
ines
ad168ba88c Set model name to empty string if path override exists
Required for parse_package_meta, which composes path of data_path and
model_name (needs to be fixed in the future)
2017-04-16 22:15:51 +02:00
ines
97647c46cd Add docstring and todo note 2017-04-16 22:14:45 +02:00
ines
5c5f8c0a72 Check if full string is found in lang classes first
This allows users to set arbitrary strings. (Otherwise, custom lang
class "my_custom_class" would always load Burmese "my" tokenizer if one
was available.)
2017-04-16 22:14:38 +02:00
ines
13d30b6c01 xfail lemmatizer test that's causing problems (see #546) 2017-04-16 21:18:39 +02:00
Matthew Honnibal
4931c56afc Increment version 2017-04-16 13:59:38 -05:00
ines
6145b7c153 Remove redundant Path 2017-04-16 20:53:25 +02:00
Matthew Honnibal
fa89613444 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-16 13:42:56 -05:00
ines
1f9f867c70 Remove unused util function 2017-04-16 20:37:45 +02:00
ines
7670c745b6 Update spacy.load() and fix path checks 2017-04-16 20:37:45 +02:00
ines
d3759dfb32 Fix docstring 2017-04-16 20:37:45 +02:00
ines
ed7e19ad68 Remove unused import 2017-04-16 20:37:45 +02:00
ines
0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
Matthew Honnibal
89a4f262fc Fix training methods 2017-04-16 13:00:37 -05:00
Matthew Honnibal
6a4221a6de Allow lemma to be set from Python. Re #973 2017-04-16 18:07:53 +02:00
Matthew Honnibal
137b210bcf Restore use of FTRL training 2017-04-16 18:02:42 +02:00
ines
d10bd0eaf9 Fix formatting 2017-04-16 13:42:34 +02:00
ines
8191e33cf1 Update link error message with info on permissions 2017-04-16 13:32:31 +02:00
ines
a3ddbc0444 Add note about --force flag to error message 2017-04-16 13:14:36 +02:00
ines
e3de035814 Add meta validation to check for required settings
Complain if no "lang", "name" or "version" is found (those settings are
used in directory / package names). Package will still build without,
but it'll inevitably fail somewhere down the line.
2017-04-16 13:13:17 +02:00
ines
a7574b7572 Add more options to read in meta data in package command
Add meta option to supply path to meta.json. If no meta path is set,
check if meta.json exists in input directory and use it. Otherwise,
prompt for details on the command line.
2017-04-16 13:06:02 +02:00
ines
13c8a42d2b Fix typos 2017-04-16 13:03:58 +02:00
ines
31fa73293a Move read_json out to own util function 2017-04-16 13:03:28 +02:00
Matthew Honnibal
45464d065e Remove print statement 2017-04-15 16:11:43 +02:00
Matthew Honnibal
c76cb8af35 Fix training for new labels 2017-04-15 16:11:26 +02:00
Matthew Honnibal
4884b2c113 Refix StepwiseState 2017-04-15 16:00:28 +02:00
Matthew Honnibal
e6ee7e130f Fix parse package meta 2017-04-15 13:38:53 +02:00
Matthew Honnibal
1a98e48b8e Fix Stepwisestate' 2017-04-15 13:35:01 +02:00
ines
0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
ines
fefe6684cd Fix symlink function to check for Windows 2017-04-15 12:17:27 +02:00
ines
35fb4febe2 Fix whitespace 2017-04-15 12:13:45 +02:00
ines
e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines
958b12dec8 Use pathlib instead of os.path 2017-04-15 12:13:00 +02:00
ines
956dc36785 Move functions to deprecated 2017-04-15 12:12:31 +02:00
ines
c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
26445ee304 Add compat module for Python2/3 and platform compatibility 2017-04-15 12:07:02 +02:00
ines
d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines
561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal
d13f0a7017 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-04-14 23:54:57 +02:00
Matthew Honnibal
354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal
33ba5066eb Refactor Language.end_training, making new save_to_directory method 2017-04-14 23:51:24 +02:00
ines
84341c2975 Only compile list of models if data_path exists 2017-04-14 16:48:02 +02:00
Gyorgy Orosz
dd3244c08a Made json dump to produce unicode strings in py2 2017-04-13 23:30:47 +02:00
Gyorgy Orosz
a9469c8173 Fixed typo 2017-04-13 15:24:14 +02:00
ines
41037f0f07 Remove unused imports 2017-04-13 13:52:11 +02:00
ines
1b92c8d5d5 Use unicode paths on Windows/Python 2 and catch other errors (resolves #970)
try/except here is quite dirty, but it'll at least make sure users see
an error message that explains what's going on
2017-04-10 17:49:51 +02:00
Matthew Honnibal
49e2de900e Add costs property to StepwiseState, to show which moves are gold. 2017-04-10 11:37:04 +02:00
Matthew Honnibal
e26577b202 Increment version 2017-04-07 18:45:06 +02:00
Matthew Honnibal
40bf7ecf27 Increment version 2017-04-07 18:44:20 +02:00
Matthew Honnibal
1dca7eeb03 Add unicode declaration on new regression test 2017-04-07 18:09:23 +02:00
ines
887827fc6a Merge branch 'develop' 2017-04-07 17:36:23 +02:00
ines
444dd511c5 Fix xpassing URL test case 2017-04-07 17:36:05 +02:00
ines
bf0f15e762 Add / to tokenizer infixes (resolves #891) 2017-04-07 17:30:44 +02:00
ines
00b9011a49 Fix whitespace 2017-04-07 17:29:59 +02:00
ines
f9869e4dc5 Merge branch 'master' into develop 2017-04-07 17:23:40 +02:00
Matthew Honnibal
4a6204dbad Merge remote-tracking branch 'origin/develop' 2017-04-07 17:20:09 +02:00
Matthew Honnibal
0513c43bf0 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 17:07:10 +02:00
Matthew Honnibal
cc36c308f4 Fix noun_chunk rules around coordination
Closes #693.
2017-04-07 17:06:40 +02:00
Matthew Honnibal
ab846256cf Merge pull request #966 from recognai/master
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal
83dca920d4 Rename test #913 -> #957, comment
Make test for #957 reference correct bug. Add comment.

Previous commit closes #957.
2017-04-07 15:54:25 +02:00
Matthew Honnibal
be204ed714 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 15:50:14 +02:00
Matthew Honnibal
e7b1ee9efd Switch to regex module for URL identification
The URL detection regex was failing on input such as 0.1.2.3, as this
input triggered excessive back-tracking in the builtin re module.
The solution was to switch to the regex module, which behaves better.

Closes #913.
2017-04-07 15:47:36 +02:00
Matthew Honnibal
5887383fc0 Add test for Issue #913: Hang from bad regex 2017-04-07 15:47:27 +02:00
ines
7ea1673072 Fix whitespace 2017-04-07 13:28:48 +02:00
ines
255650dbc2 Add connlu2json converter from explosion/spacy-dev-resources/#11 2017-04-07 13:05:12 +02:00
ines
789ce8a45e Add convert command 2017-04-07 13:04:17 +02:00
ines
9952d3b08a Fix whitespace 2017-04-07 13:02:05 +02:00
ines
47ddce6eb7 Remove unused variable 2017-04-07 13:01:48 +02:00
ines
dcf8ab0c47 Merge branch 'develop' 2017-04-07 12:00:09 +02:00
ines
75f9b4c6e2 Fix whitespace 2017-04-07 10:22:18 +02:00
oeg
c693d40791 feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests 2017-04-06 18:48:45 +02:00
oeg
010293fb2f fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli 2017-04-06 17:33:15 +02:00
ines
808cd6cf7f Add missing tags to verbs (resolves #948) 2017-04-03 18:12:52 +02:00
ines
ad8bf1829f Import and combine Portuguese tokenizer exceptions (see #943) 2017-04-01 10:37:42 +02:00
Ines Montani
f8b2d9c3b7 Merge pull request #943 from mamoit/master
Portuguese improvements
2017-04-01 10:32:00 +02:00
ines
3b667a24d4 Remove whitespace 2017-04-01 10:21:08 +02:00
ines
e71a1f4bd0 Fix download commands in error messages (see #946) 2017-04-01 10:20:57 +02:00
ines
42382d5692 Fix download commands in error messages (see #946) 2017-04-01 10:19:32 +02:00
ines
d4a59c254b Remove whitespace 2017-04-01 10:19:01 +02:00
Matthew Honnibal
51882ee2b8 Fix check for setting ent_id in merge 2017-03-31 19:32:01 +02:00
Miguel Almeida
4fde64c4ea Portuguese contractions and some abreviations 2017-03-31 15:52:55 +01:00
Miguel Almeida
465b240bcb Review Portuguese stop words
Mainly to review typos and add missing masculines/feminines
2017-03-31 13:00:47 +01:00
Matthew Honnibal
fc3900e5b2 Allow ent_id to be set in Token 2017-03-31 14:00:14 +02:00
Matthew Honnibal
9720103428 Improve attribute handlign in doc.merge(). Still unsatisfying 2017-03-31 13:59:58 +02:00
Matthew Honnibal
cfff4e0f61 Improve test 2017-03-31 13:59:32 +02:00
Matthew Honnibal
1bb7b4ca71 Add comment 2017-03-31 13:59:19 +02:00
Matthew Honnibal
725249c59a Add merge_phrase callback in matcher.pyx 2017-03-31 13:58:59 +02:00
Matthew Honnibal
e854f28304 Add test for Issue #758
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Miguel Almeida
c1d020b0a6 Remove "ista" from portuguese stop words 2017-03-31 12:26:13 +01:00
Miguel Almeida
17a1e7a119 Add Portuguese numbers and ordinals 2017-03-31 12:21:01 +01:00
Matthew Honnibal
47a3ef06a6 Unhack deprojetivization, moving it into pipeline
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898.
2017-03-31 12:31:50 +02:00
Joshua Reeter
564daf6dec Issue #934 symlink should not convert paths as_posix under windows. 2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita
c2d48974bc Fix typos in Portuguese stop words 2017-03-30 21:59:18 +13:00