Commit Graph

3171 Commits

Author SHA1 Message Date
Gregory Howard
0e8c41ea4f Adding method lemmatizer for every class 2017-05-03 12:14:42 +02:00
Gregory Howard
32ca07989e adding export japanese 2017-05-03 11:07:29 +02:00
Grégory Howard
f9d7144224 Merge branch 'master' into master 2017-05-03 11:04:51 +02:00
Gregory Howard
f2ab7d77b4 Lazy imports language 2017-05-03 11:01:42 +02:00
Ines Montani
3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Ines Montani
949ad6594b Add newline 2017-05-03 09:38:43 +02:00
Ines Montani
d12ca587ea Add newline 2017-05-03 09:38:29 +02:00
Ines Montani
8676cd0135 Add newline 2017-05-03 09:38:07 +02:00
Yasuaki Uechi
c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Gregory Howard
c0afcd22bb Merge remote-tracking branch 'remotes/upstream/master' 2017-04-27 14:42:54 +02:00
Matthew Honnibal
31ec9e1371 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-27 13:21:39 +02:00
Matthew Honnibal
2da16adcc2 Add dropout optin for parser and NER
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.

    nlp.entity.update(doc, gold, drop=0.4)

This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.

This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard
92f368f83b Removing extra spaces 2017-04-27 12:02:14 +02:00
Gregory Howard
13b6957c8e Adding unitest for tokenization in french (with title) 2017-04-27 11:53:44 +02:00
Gregory Howard
8ff4682255 correcting tokenizer exception.
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani
7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani
c9e592ae6c Add newline 2017-04-27 11:15:41 +02:00
Ines Montani
5942adccc2 Add newline 2017-04-27 11:15:19 +02:00
Ines Montani
4cd9269aef Add newline 2017-04-27 11:15:04 +02:00
Ines Montani
ccf13ecc21 Add newline 2017-04-27 11:14:42 +02:00
Ines Montani
03d2b0cc05 Add newline 2017-04-27 11:14:26 +02:00
Gregory Howard
44cb486849 Adding unitest for tokenization in french (with title) 2017-04-27 10:59:38 +02:00
Gregory Howard
ad8129cb45 Improvement of rules now title insentive and have same declaration format 2017-04-27 10:23:56 +02:00
luvogels
d12a0b6431 Hooked up tokenizer tests 2017-04-26 23:21:41 +02:00
Matthew Honnibal
f0e1606d27 Increment version 2017-04-26 20:25:41 +02:00
luvogels
b331929a7e Merge branch 'master' of https://github.com/luvogels/spaCy 2017-04-26 19:15:48 +02:00
luvogels
8de59ce3b9 Added tokenizer tests 2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13 Try to make test999 less flakey 2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang
460094bf09 Update __init__.py 2017-04-26 18:27:55 +02:00
ines
527d51ac9a Fetch shortcuts from GitHub and improve error handling 2017-04-26 18:00:28 +02:00
Gregory Howard
ed5f094451 Adding insensitive lemmatisation test 2017-04-25 18:07:02 +02:00
ghoward
26e31afc18 renamming tests 2017-04-25 17:46:01 +02:00
ghoward
c085c2d391 Adding some unitests 2017-04-25 17:44:16 +02:00
ghoward
55c6910f90 Look_up table for languages in spacy.
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Matthew Honnibal
c4be9c36fe Fix unicode header in tests 2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5 Fix test 2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1 Fix flakey test 2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15 Make training test less flakey 2017-04-23 22:59:34 +02:00
Matthew Honnibal
4f9657b42b Fix reporting if no dev data with train 2017-04-23 22:27:10 +02:00
Matthew Honnibal
df2ac8b843 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 21:25:07 +02:00
Matthew Honnibal
d0e19267e8 Create directory if missing in save_to_directory 2017-04-23 21:24:43 +02:00
ines
42305bc519 Remove unnecessary test 2017-04-23 21:21:41 +02:00
ines
012ea594d1 Add file for misc tests 2017-04-23 21:06:51 +02:00
ines
83f66947dc Rename test_download to test_cli 2017-04-23 21:06:50 +02:00
ines
401045433c Simplify compat.fix_text 2017-04-23 21:06:50 +02:00
Matthew Honnibal
e033c86a64 Increment version 2017-04-23 21:03:43 +02:00
Matthew Honnibal
d2436dc17b Update fix for Issue #999 2017-04-23 18:14:37 +02:00
Matthew Honnibal
874a3cbb07 Add test for Issue #955 2017-04-23 17:57:01 +02:00
Matthew Honnibal
60703cede5 Ensure noun chunks can't be nested. Closes #955 2017-04-23 17:56:39 +02:00
Matthew Honnibal
c9ec24b257 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-23 17:07:46 +02:00
Matthew Honnibal
5d8af40445 Add test for Issue #999 2017-04-23 17:06:30 +02:00
Matthew Honnibal
4d2a659c52 Fix json dump for Python3 2017-04-23 17:05:53 +02:00
Matthew Honnibal
040751ad17 Remove xfail on Test #910 2017-04-23 16:28:55 +02:00
ines
3a9710f356 Pass dev_scores to print_progress correctly (resolves #1008)
Only read scores attribute if command is used with dev_data, otherwise
default dev_scores to empty dict.
2017-04-23 15:58:40 +02:00
Matthew Honnibal
1b12f342e4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-20 17:03:11 +02:00
Matthew Honnibal
4eef200bab Persist the actions within spacy.parser.cfg 2017-04-20 17:02:44 +02:00
ines
25c70b4cc5 Move fix_text to spacy.compat (see #1002) 2017-04-20 15:47:17 +02:00
Ines Montani
60b5243bee Merge pull request #1002 from oroszgy/model_cli_fix
Fixes for the `model` CLI
2017-04-20 15:41:03 +02:00
Gyorgy Orosz
4a06a2572c Using ftfy for handling broken encoded strings. 2017-04-20 13:34:51 +02:00
Ines Montani
3800b29046 Merge pull request #1001 from recognai/master
Add SPACE to es tag map
2017-04-20 12:16:34 +02:00
oeg
f0bcd0babb fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing 2017-04-20 11:36:24 +02:00
Ben Eyal
e90e8a3f10 Enable test 2017-04-20 02:25:24 +03:00
Ben Eyal
33af52599e Redefine alphabetic characters
For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.
2017-04-20 02:25:02 +03:00
Ben Eyal
d8098a8be2 Use regex instead of re 2017-04-20 02:22:52 +03:00
oeg
daaa42dd25 Merge remote-tracking branch 'upstream/master' 2017-04-19 23:30:36 +02:00
oeg
936a297241 fix(model): Fix tag map for fixing issues with tag SPACE 2017-04-19 23:30:21 +02:00
luvogels
c7cec7e5e2 Update __init__.py 2017-04-19 21:06:30 +02:00
luvogels
55e8cade36 Update __init__.py 2017-04-19 21:06:30 +02:00
luvogels
03abd0c8e6 Update __init__.py 2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang
538a8d6b12 Resolved merge conflict by incorporating both suggestions. 2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang
e821c48489 Norwegian language basics 2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang
3796c668d9 more norwegian 2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang
bc9557b21f Norwegian language basics 2017-04-19 21:00:01 +02:00
ines
2bd89e7ade Tidy up Hebrew tests and test for punctuation (see #995) 2017-04-19 19:28:03 +02:00
ines
48da244058 Use spacy.compat.json_dumps for Python 2/3 compatibility (resolves #991) 2017-04-19 11:50:36 +02:00
ines
ddd5194088 Update Language docs and docstrings 2017-04-17 01:52:13 +02:00
ines
f62b740961 Use compat.json_dumps 2017-04-17 01:46:14 +02:00
ines
8e83f8e2fa Update docstrings 2017-04-17 01:40:26 +02:00
ines
e2299dc389 Ensure path in save_to_directory 2017-04-17 01:40:14 +02:00
ines
82f5f1f98f Replace str with compat.unicode_ 2017-04-17 01:29:54 +02:00
ines
16a8521efa Increment version 2017-04-16 22:38:38 +02:00
Matthew Honnibal
4efd6fb9d6 Fix training 2017-04-16 15:28:27 -05:00
Matthew Honnibal
17c9fffb9e Fix naked except 2017-04-16 15:28:16 -05:00
ines
5610fdcc06 Get language name first if no model path exists
Makes sure spaCy fails early if no tokenizer exists, and allows
printing better error message.
2017-04-16 22:16:47 +02:00
ines
ad168ba88c Set model name to empty string if path override exists
Required for parse_package_meta, which composes path of data_path and
model_name (needs to be fixed in the future)
2017-04-16 22:15:51 +02:00
ines
97647c46cd Add docstring and todo note 2017-04-16 22:14:45 +02:00
ines
5c5f8c0a72 Check if full string is found in lang classes first
This allows users to set arbitrary strings. (Otherwise, custom lang
class "my_custom_class" would always load Burmese "my" tokenizer if one
was available.)
2017-04-16 22:14:38 +02:00
ines
13d30b6c01 xfail lemmatizer test that's causing problems (see #546) 2017-04-16 21:18:39 +02:00
Matthew Honnibal
4931c56afc Increment version 2017-04-16 13:59:38 -05:00
ines
6145b7c153 Remove redundant Path 2017-04-16 20:53:25 +02:00
Matthew Honnibal
fa89613444 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-16 13:42:56 -05:00
ines
1f9f867c70 Remove unused util function 2017-04-16 20:37:45 +02:00
ines
7670c745b6 Update spacy.load() and fix path checks 2017-04-16 20:37:45 +02:00
ines
d3759dfb32 Fix docstring 2017-04-16 20:37:45 +02:00
ines
ed7e19ad68 Remove unused import 2017-04-16 20:37:45 +02:00
ines
0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
Matthew Honnibal
89a4f262fc Fix training methods 2017-04-16 13:00:37 -05:00
Matthew Honnibal
6a4221a6de Allow lemma to be set from Python. Re #973 2017-04-16 18:07:53 +02:00
Matthew Honnibal
137b210bcf Restore use of FTRL training 2017-04-16 18:02:42 +02:00
ines
d10bd0eaf9 Fix formatting 2017-04-16 13:42:34 +02:00
ines
8191e33cf1 Update link error message with info on permissions 2017-04-16 13:32:31 +02:00
ines
a3ddbc0444 Add note about --force flag to error message 2017-04-16 13:14:36 +02:00
ines
e3de035814 Add meta validation to check for required settings
Complain if no "lang", "name" or "version" is found (those settings are
used in directory / package names). Package will still build without,
but it'll inevitably fail somewhere down the line.
2017-04-16 13:13:17 +02:00
ines
a7574b7572 Add more options to read in meta data in package command
Add meta option to supply path to meta.json. If no meta path is set,
check if meta.json exists in input directory and use it. Otherwise,
prompt for details on the command line.
2017-04-16 13:06:02 +02:00
ines
13c8a42d2b Fix typos 2017-04-16 13:03:58 +02:00
ines
31fa73293a Move read_json out to own util function 2017-04-16 13:03:28 +02:00
Matthew Honnibal
45464d065e Remove print statement 2017-04-15 16:11:43 +02:00
Matthew Honnibal
c76cb8af35 Fix training for new labels 2017-04-15 16:11:26 +02:00
Matthew Honnibal
4884b2c113 Refix StepwiseState 2017-04-15 16:00:28 +02:00
Matthew Honnibal
e6ee7e130f Fix parse package meta 2017-04-15 13:38:53 +02:00
Matthew Honnibal
1a98e48b8e Fix Stepwisestate' 2017-04-15 13:35:01 +02:00
ines
0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
ines
fefe6684cd Fix symlink function to check for Windows 2017-04-15 12:17:27 +02:00
ines
35fb4febe2 Fix whitespace 2017-04-15 12:13:45 +02:00
ines
e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines
958b12dec8 Use pathlib instead of os.path 2017-04-15 12:13:00 +02:00
ines
956dc36785 Move functions to deprecated 2017-04-15 12:12:31 +02:00
ines
c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
26445ee304 Add compat module for Python2/3 and platform compatibility 2017-04-15 12:07:02 +02:00
ines
d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines
561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal
d13f0a7017 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-04-14 23:54:57 +02:00
Matthew Honnibal
354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal
33ba5066eb Refactor Language.end_training, making new save_to_directory method 2017-04-14 23:51:24 +02:00
ines
84341c2975 Only compile list of models if data_path exists 2017-04-14 16:48:02 +02:00
Gyorgy Orosz
dd3244c08a Made json dump to produce unicode strings in py2 2017-04-13 23:30:47 +02:00
Gyorgy Orosz
a9469c8173 Fixed typo 2017-04-13 15:24:14 +02:00
ines
41037f0f07 Remove unused imports 2017-04-13 13:52:11 +02:00
ines
1b92c8d5d5 Use unicode paths on Windows/Python 2 and catch other errors (resolves #970)
try/except here is quite dirty, but it'll at least make sure users see
an error message that explains what's going on
2017-04-10 17:49:51 +02:00
Matthew Honnibal
49e2de900e Add costs property to StepwiseState, to show which moves are gold. 2017-04-10 11:37:04 +02:00
Matthew Honnibal
e26577b202 Increment version 2017-04-07 18:45:06 +02:00
Matthew Honnibal
40bf7ecf27 Increment version 2017-04-07 18:44:20 +02:00
Matthew Honnibal
1dca7eeb03 Add unicode declaration on new regression test 2017-04-07 18:09:23 +02:00
ines
887827fc6a Merge branch 'develop' 2017-04-07 17:36:23 +02:00
ines
444dd511c5 Fix xpassing URL test case 2017-04-07 17:36:05 +02:00
ines
bf0f15e762 Add / to tokenizer infixes (resolves #891) 2017-04-07 17:30:44 +02:00
ines
00b9011a49 Fix whitespace 2017-04-07 17:29:59 +02:00
ines
f9869e4dc5 Merge branch 'master' into develop 2017-04-07 17:23:40 +02:00
Matthew Honnibal
4a6204dbad Merge remote-tracking branch 'origin/develop' 2017-04-07 17:20:09 +02:00
Matthew Honnibal
0513c43bf0 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 17:07:10 +02:00
Matthew Honnibal
cc36c308f4 Fix noun_chunk rules around coordination
Closes #693.
2017-04-07 17:06:40 +02:00
Matthew Honnibal
ab846256cf Merge pull request #966 from recognai/master
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal
83dca920d4 Rename test #913 -> #957, comment
Make test for #957 reference correct bug. Add comment.

Previous commit closes #957.
2017-04-07 15:54:25 +02:00
Matthew Honnibal
be204ed714 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 15:50:14 +02:00
Matthew Honnibal
e7b1ee9efd Switch to regex module for URL identification
The URL detection regex was failing on input such as 0.1.2.3, as this
input triggered excessive back-tracking in the builtin re module.
The solution was to switch to the regex module, which behaves better.

Closes #913.
2017-04-07 15:47:36 +02:00
Matthew Honnibal
5887383fc0 Add test for Issue #913: Hang from bad regex 2017-04-07 15:47:27 +02:00
ines
7ea1673072 Fix whitespace 2017-04-07 13:28:48 +02:00
ines
255650dbc2 Add connlu2json converter from explosion/spacy-dev-resources/#11 2017-04-07 13:05:12 +02:00
ines
789ce8a45e Add convert command 2017-04-07 13:04:17 +02:00
ines
9952d3b08a Fix whitespace 2017-04-07 13:02:05 +02:00
ines
47ddce6eb7 Remove unused variable 2017-04-07 13:01:48 +02:00
ines
dcf8ab0c47 Merge branch 'develop' 2017-04-07 12:00:09 +02:00
ines
75f9b4c6e2 Fix whitespace 2017-04-07 10:22:18 +02:00
oeg
c693d40791 feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests 2017-04-06 18:48:45 +02:00
oeg
010293fb2f fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli 2017-04-06 17:33:15 +02:00
ines
808cd6cf7f Add missing tags to verbs (resolves #948) 2017-04-03 18:12:52 +02:00
ines
ad8bf1829f Import and combine Portuguese tokenizer exceptions (see #943) 2017-04-01 10:37:42 +02:00
Ines Montani
f8b2d9c3b7 Merge pull request #943 from mamoit/master
Portuguese improvements
2017-04-01 10:32:00 +02:00
ines
3b667a24d4 Remove whitespace 2017-04-01 10:21:08 +02:00
ines
e71a1f4bd0 Fix download commands in error messages (see #946) 2017-04-01 10:20:57 +02:00
ines
42382d5692 Fix download commands in error messages (see #946) 2017-04-01 10:19:32 +02:00
ines
d4a59c254b Remove whitespace 2017-04-01 10:19:01 +02:00
Matthew Honnibal
51882ee2b8 Fix check for setting ent_id in merge 2017-03-31 19:32:01 +02:00
Miguel Almeida
4fde64c4ea Portuguese contractions and some abreviations 2017-03-31 15:52:55 +01:00
Miguel Almeida
465b240bcb Review Portuguese stop words
Mainly to review typos and add missing masculines/feminines
2017-03-31 13:00:47 +01:00
Matthew Honnibal
fc3900e5b2 Allow ent_id to be set in Token 2017-03-31 14:00:14 +02:00
Matthew Honnibal
9720103428 Improve attribute handlign in doc.merge(). Still unsatisfying 2017-03-31 13:59:58 +02:00
Matthew Honnibal
cfff4e0f61 Improve test 2017-03-31 13:59:32 +02:00
Matthew Honnibal
1bb7b4ca71 Add comment 2017-03-31 13:59:19 +02:00
Matthew Honnibal
725249c59a Add merge_phrase callback in matcher.pyx 2017-03-31 13:58:59 +02:00
Matthew Honnibal
e854f28304 Add test for Issue #758
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Miguel Almeida
c1d020b0a6 Remove "ista" from portuguese stop words 2017-03-31 12:26:13 +01:00
Miguel Almeida
17a1e7a119 Add Portuguese numbers and ordinals 2017-03-31 12:21:01 +01:00
Matthew Honnibal
47a3ef06a6 Unhack deprojetivization, moving it into pipeline
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898.
2017-03-31 12:31:50 +02:00
Joshua Reeter
564daf6dec Issue #934 symlink should not convert paths as_posix under windows. 2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita
c2d48974bc Fix typos in Portuguese stop words 2017-03-30 21:59:18 +13:00
Matthew Honnibal
0fefdfcbda Merge pull request #935 from ericzhao28/master
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)
2017-03-30 02:51:24 +02:00
ines
4759fd437d Merge branch 'master' into develop 2017-03-29 10:37:13 +02:00
ines
7e4befec88 Add Hebrew to init and setup.py 2017-03-29 10:34:57 +02:00
Grégory Howard
9c2996b27f correction of package.py (encoding on open instead of write) 2017-03-29 09:11:02 +02:00
Eric Zhao
aafdf6ffb8 Add option to use label karg to determine ent_type in doc.merge 2017-03-28 23:35:03 -07:00
ines
7198cf1c8a Remove unused import 2017-03-26 20:56:05 +02:00
ines
7ceaa1614b Add experimental model init command 2017-03-26 20:51:40 +02:00
Matthew Honnibal
83ba6c247c Fix init of Language without model 2017-03-26 16:46:00 +02:00
Matthew Honnibal
fa107f95f6 Remove unused train_config command 2017-03-26 09:28:59 -05:00
Matthew Honnibal
df83921f0a Increment version 2017-03-26 09:27:32 -05:00
Matthew Honnibal
92ac3af21d Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-26 09:26:59 -05:00
Matthew Honnibal
a9b1f23c7d Enable regression loss for parser 2017-03-26 09:26:30 -05:00
ines
c00d997924 Merge branch 'develop' 2017-03-26 15:57:00 +02:00
Matthew Honnibal
2efdbc08ff Make training work with directories 2017-03-26 08:46:44 -05:00
ines
007a2492bd Remove train_config command for now 2017-03-26 15:40:50 +02:00
ines
b297fab062 Update error message for missing commands 2017-03-26 15:40:02 +02:00
ines
7f95023fc0 Fix formatting 2017-03-26 15:37:37 +02:00
ines
5901c8f7f0 Update spacy train CLI documentation 2017-03-26 15:33:48 +02:00
Matthew Honnibal
9dcb58aaaf Merge CLI changes 2017-03-26 07:30:45 -05:00
Matthew Honnibal
6b7f7a2060 Connect parser L1 option to train CLI 2017-03-26 07:24:07 -05:00
Matthew Honnibal
ed2b106f4d Fix circular import in lemmatizer 2017-03-26 07:17:07 -05:00
Matthew Honnibal
dec5571bf3 Update train CLI 2017-03-26 07:16:52 -05:00
ines
53cf2f1c0e Make dev data optional 2017-03-26 11:48:17 +02:00
Matthew Honnibal
5eac089fbe Merge branch 'master' into develop 2017-03-26 04:45:43 -05:00
ines
0fc56e2544 Update flag and defaults 2017-03-26 11:42:11 +02:00
Matthew Honnibal
2f63806ddb Update config when adding label. Re #910 2017-03-25 22:35:44 +01:00
Matthew Honnibal
b94286de30 Fix regression test 2017-03-25 22:35:07 +01:00
Matthew Honnibal
c748907a66 Fix errors in previous commit 2017-03-25 22:25:01 +01:00
Matthew Honnibal
4f400fa486 Prevent lemmatization of base nouns
Update lemmatizer's base-form check, for change in morphology class.
Closes #903.
2017-03-25 21:51:12 +01:00
Matthew Honnibal
850d35dcb3 Make morphology use int attributes internally
The morphology class was calling the lemmatizer inconsistently,
which some string-valued attributes. This caused Issue #903.
2017-03-25 21:49:10 +01:00
Matthew Honnibal
4454c1b23f Block lemmatization of base-form adjectives
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912.
2017-03-25 21:29:57 +01:00
ines
97814f8da6 Update Windows Python 2 link workaround to use helper functions 2017-03-25 14:04:27 +01:00
ines
fdec758113 Add is_windows and is_python2 utility functions 2017-03-25 14:04:02 +01:00
Ines Montani
09837158e4 Merge pull request #921 from solresol/master
Possible solution to #909
2017-03-25 13:51:55 +01:00
Greg Baker
b7f714b498 Possible solution to #909 2017-03-25 21:36:38 +11:00
Ines Montani
97cb4d5e3c Merge branch 'master' into master 2017-03-25 10:03:47 +01:00
Iddo Berger
da135bd823 add hebrew tokenizer 2017-03-24 18:27:44 +03:00
Matthew Honnibal
f40fbc3710 Add test for Issue #910: Resuming entity training 2017-03-23 23:38:57 +01:00
Matthew Honnibal
9c9cd99144 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-23 11:11:24 +01:00
ines
0035fd9efe Add spacy train work in progress 2017-03-23 11:08:41 +01:00
ines
d5ebf583a4 Fix formatting 2017-03-23 11:08:30 +01:00
ines
3f20efe165 Merge branch 'develop'
# Conflicts:
#	spacy/util.py
2017-03-22 17:14:15 +01:00
Ines Montani
f86a3a92d5 Merge pull request #899 from raphael0202/duplicate_keys
Remove duplicate keys in [en|fi] language data dicts
2017-03-22 10:20:11 +01:00
Ines Montani
87a2c85e1b Merge pull request #900 from raphael0202/unused_imports
Remove unused import statements
2017-03-22 10:10:43 +01:00
ines
ce065e5d65 Fix imports 2017-03-22 10:02:14 +01:00
Andrew Poliakov
07199c3e8b Fix infinite recursion in spacy.info 2017-03-22 11:43:22 +03:00
Raphaël Bournhonesque
f332bf05be Remove unused import statements 2017-03-21 21:08:54 +01:00
ines
c3a9f73896 Fix writing to file 2017-03-21 12:35:22 +01:00
ines
d74aa428ad Fix path 2017-03-21 12:26:00 +01:00
ines
83a999ea83 Change default license from MIT to CC 2017-03-21 12:24:43 +01:00
ines
ae46647560 Fix brackets 2017-03-21 12:21:42 +01:00
ines
3e134b5b2b Make sure paths in copytree and rmtree are strings 2017-03-21 12:15:33 +01:00
ines
cf0094187e Fetch MANIFEST.in from GitHub as well 2017-03-21 11:32:38 +01:00
ines
09b24bc5a9 Add docs for package command 2017-03-21 11:19:21 +01:00
ines
3f4e3fda1d Update command and fetch file templates from GitHub
While feature is still experimental, this allows files to be modified
without having to ship a new version of spaCy.
2017-03-21 11:17:36 +01:00
ines
5230ed5b98 Move directory check and overwriting/creating dirs to own function 2017-03-21 02:06:53 +01:00
ines
46bc3c36b0 Fix typo 2017-03-21 02:06:37 +01:00
ines
64e38f304e Only import shutil 2017-03-21 02:06:29 +01:00
ines
448a916d0d Add --force option to override directory 2017-03-21 02:05:34 +01:00
ines
8eb9a2b355 Fix formatting 2017-03-21 02:05:14 +01:00
ines
b2bcdec0f6 Update docstring 2017-03-20 22:50:55 +01:00
ines
bf240132d7 Add cli.package command to build model packages 2017-03-20 22:50:13 +01:00
ines
a54e3c2efe Remove empty line 2017-03-20 22:49:36 +01:00
ines
5aea327a5b Add util function to get raw user input 2017-03-20 22:48:56 +01:00
ines
a6c0361803 Handle raw_input vs input in Python 2 and 3 2017-03-20 22:48:32 +01:00
ines
adbcac6591 Fix spacing 2017-03-20 22:48:21 +01:00
Matthew Honnibal
692eb0603d Fix high memory usage in download command
Due to PyPi issue #2984, installing large packages via pip causes
a large spike in memory usage. The recommended fix is to disable
caching.
2017-03-20 18:24:44 +01:00
ines
f830213c4c Remove compatibility check test
Will only cause problems when incrementing version and not updating
table. Also depends on external URL, which is bad.
2017-03-20 13:20:26 +01:00
Matthew Honnibal
f314d3d044 Increment version 2017-03-20 12:58:24 +01:00
Matthew Honnibal
b487b8735a Decrease beam density, and fix Python 3 problem in beam 2017-03-20 12:56:05 +01:00
Ines Montani
b6ee241e26 Fix print statements 2017-03-20 11:46:37 +01:00
ines
b8f8d5d8bf Make sure model_path is a Posix path
Otherwise, formatting the success message with model_path.as_posix()
fails when using a local path for linking (linking still works, but the
error message is confusing)
2017-03-19 11:57:13 +01:00
ines
fe0ff00fe1 Fix spacing 2017-03-19 11:55:37 +01:00
ines
5712da6095 Add regression test for #891 2017-03-19 11:48:01 +01:00
Raphaël Bournhonesque
7f579ae834 Remove duplicate keys in [en|fi] data dicts 2017-03-19 11:40:29 +01:00
ines
8de5108af6 Exclude common cache directories from mode list in cli.info
This means models called "cache" etc. won't show up in the list, but it
seems worth it.
2017-03-19 01:44:43 +01:00
Matthew Honnibal
6ee2ea1128 Increment version 2017-03-19 01:40:52 +01:00
Matthew Honnibal
797f286c38 Use import to find data package 2017-03-19 01:39:36 +01:00
Matthew Honnibal
5941fb9e92 Make spacy/data a package 2017-03-18 20:04:22 +01:00
Matthew Honnibal
bc10d06bc2 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-18 19:32:54 +01:00
Matthew Honnibal
583628c350 Import metadata into __init__ 2017-03-18 19:30:03 +01:00
Matthew Honnibal
1754e0db9b Call pip via subprocess, to make it use virtualenv 2017-03-18 19:29:36 +01:00
ines
1277abcde2 Remove print statement 2017-03-18 19:14:58 +01:00
Matthew Honnibal
dcec104643 Remove unused import 2017-03-18 18:57:45 +01:00
Matthew Honnibal
703eb7bdbd Fix link module 2017-03-18 18:57:31 +01:00
Matthew Honnibal
f6c6c89546 Add empty data directory 2017-03-18 18:32:29 +01:00
ines
7d33104180 Use distutils.sysconfig.get_python_lib
site.getsitepackages seems to not work as expected in Python 2
2017-03-18 18:20:40 +01:00
Matthew Honnibal
1a53fcc685 Fix CLI for Python 2 2017-03-18 18:14:03 +01:00
ines
aefb898e37 Add title-case version of morph rules (resolves #686) 2017-03-18 17:27:11 +01:00
ines
64ec17abc1 Pass xpassing tests and add xfails for failures 2017-03-18 17:20:46 +01:00
ines
d0b85faf69 Pass regression test for #401 (resolves #401)
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines
be9daefbdd Remove actual model downloading from tests 2017-03-18 17:01:10 +01:00
ines
850650221a Use correct command in deprecated download command message 2017-03-18 17:01:01 +01:00
ines
0dd7710556 Make sure paths are paths 2017-03-18 16:48:52 +01:00
Matthew Honnibal
de0e6385b4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-18 16:17:28 +01:00
Matthew Honnibal
fe442cac53 Fix #717: Set correct lemma for contracted verbs 2017-03-18 16:16:10 +01:00
ines
ad934a9abd Add regression test for #693 2017-03-18 16:12:30 +01:00
ines
f57c616830 Add regression test for #704 and test new model (resolves #704)
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal
413138de79 Fix #719: Lemmatizer can no longer output empty string 2017-03-18 16:02:06 +01:00
ines
ab1451f997 Don't mark compatibility test as slow 2017-03-18 15:17:39 +01:00
ines
ec3e810662 Add directory cli and set up command line interface 2017-03-18 15:14:48 +01:00
ines
cd94ea1095 Use info module for spacy.info() 2017-03-18 13:01:26 +01:00
ines
e3e25c0a33 Add spacy.info module
Print info about spaCy installation, local setup and models. Allow
export in Markdown format to copy-paste into GitHub issues.
2017-03-18 13:01:16 +01:00
ines
0eafc0f2c6 Add util functions to print data as table or markdown list 2017-03-18 13:00:14 +01:00
ines
6b9b444065 Fix imports 2017-03-18 12:59:41 +01:00
ines
a035ebd32a Use pathlib.Path instead of os.path 2017-03-18 12:59:21 +01:00
ines
9605cf39cc Handle default path in Language classes 2017-03-18 12:58:45 +01:00
Matthew Honnibal
ac4b88cce9 Fix auto-linking in download command 2017-03-17 21:36:13 +01:00
ines
8a34c3e666 Fix shortcut name 2017-03-17 20:07:34 +01:00
Matthew Honnibal
6420f86f02 Merge changes to __init__.py 2017-03-17 19:51:45 +01:00
ines
e01fbacf81 Update resolve_model_name 2017-03-17 19:26:28 +01:00
ines
aedefef49d Add function to resolve model names and link them 2017-03-17 18:47:05 +01:00
Matthew Honnibal
d013aba7b5 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 18:30:53 +01:00
Matthew Honnibal
854cfce7cf Make vocabs more compatible across versions
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal
1cc841e600 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 08:18:11 -05:00
Matthew Honnibal
4bfc55b532 Auto-add words to vocab when loading vectors
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
ines
0e533ad0cc Mark compatibility table test as slow (temporary)
Prevent Travis from running test test until models repo is published
2017-03-17 13:11:36 +01:00
ines
279b1d1965 Update version 2017-03-17 12:43:08 +01:00
ines
8af4b9e4df Fix compatibility.json link 2017-03-17 12:43:03 +01:00
Matthew Honnibal
a630726b13 Fix typo in tests 2017-03-16 20:50:36 -05:00
Matthew Honnibal
f98b30583f Fix tests 2017-03-16 19:48:00 -05:00
Matthew Honnibal
db51abf685 Fix tests 2017-03-16 18:53:47 -05:00
Matthew Honnibal
adb0b7e43b Fix loading when no package found 2017-03-16 18:30:23 -05:00
Matthew Honnibal
5c66cffafd Add tag map for Spanish 2017-03-16 18:05:15 -05:00
Matthew Honnibal
c4351e1165 Update base-form check in lemmatizer, for UD 2.0 morphology 2017-03-16 17:59:31 -05:00
Matthew Honnibal
1e10383e1b Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-16 17:41:13 -05:00
Matthew Honnibal
859315863a Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-16 17:40:07 -05:00
Matthew Honnibal
fea9fe08af Merge pull request #866 from juanmirocks/master
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
ffd4a19383 Increment version 2017-03-16 17:35:57 -05:00
Matthew Honnibal
28bb546939 Merge pull request #883 from ericzhao28/master
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
fd60961825 Fix spacing 2017-03-16 23:23:26 +01:00
Matthew Honnibal
890747d8ff Fix trailing whitespace on morphology features 2017-03-16 17:07:37 -05:00
Matthew Honnibal
af41a9790c Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 20:41:37 +01:00
Matthew Honnibal
303a56f173 Get absolute path for linking 2017-03-16 20:41:23 +01:00
ines
3d484c3faf Don't print in parse_package_meta and accept on_erro callback instead
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
d8c984b65e Don't exit if no model meta data is present 2017-03-16 20:33:33 +01:00
Matthew Honnibal
2524efc0ac Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 20:20:41 +01:00
ines
8253581057 Link model automatically if not direct download 2017-03-16 19:54:51 +01:00
Matthew Honnibal
8843b84bd1 Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 12:00:42 -05:00
Matthew Honnibal
55f813bfbb Don't reapply the model during training 2017-03-16 11:59:43 -05:00
Matthew Honnibal
c90dc7ac29 Clean up state initiatisation in transition system 2017-03-16 11:59:11 -05:00
Matthew Honnibal
a46933a8fe Clean up FTRL parsing stuff. 2017-03-16 11:58:20 -05:00
ines
618ce3b425 Add .meta to Language object
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
ines
e348d4434c Add spacy.info(model_name) to show model meta
Allows "previewing" model before loading and making sure it's linked
correctly.
2017-03-16 17:13:40 +01:00
ines
eea3b35e3f Update model loading to support links
Remove match_best_version check, fetch model language from meta instead
of directory name, and don't make too many assumptions – if model is
downloaded via downloader, version should match anyway. (Otherwise,
users should be free to add and load whichever models they want.)
2017-03-16 17:13:08 +01:00
ines
5f3f04bd0a Add util function to load and parse package meta.json 2017-03-16 17:10:05 +01:00
ines
7f920c2f75 Don't break text in when rendering print_msg 2017-03-16 17:09:50 +01:00
ines
16a63d9676 Add docstring 2017-03-16 17:09:11 +01:00
ines
68c04fa897 Move sys_exit() function to util 2017-03-16 17:08:58 +01:00
ines
ccd1a79988 Add spacy.link module to link model directories to shortcuts 2017-03-16 17:01:51 +01:00
Matthew Honnibal
2611ac2a89 Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens 2017-03-16 09:38:28 -05:00
ines
595d89698a Add basestring 2017-03-16 10:01:14 +01:00
ines
7b2eca36e4 Revert "Fix formatting and remove unused code"
This reverts commit d7898d586f.
2017-03-16 09:58:41 +01:00
ines
2f0db1dd36 Use small English model as default 2017-03-16 09:54:40 +01:00
Matthew Honnibal
3d0833c3df Fix off-by-1 in parse features fill_context 2017-03-15 19:55:35 -05:00
Matthew Honnibal
4ef68c413f Approximate cost in Break transition, to speed things up a bit. 2017-03-15 16:40:27 -05:00
Matthew Honnibal
8543db8a5b Use ftrl optimizer in parser 2017-03-15 11:56:37 -05:00
ines
4cfc8ffbd2 Reformat pickle tests 2017-03-15 17:39:54 +01:00
ines
2a0fcf1354 Add tests for new download module 2017-03-15 17:39:43 +01:00
ines
71956c94db Handle deprecated language-specific model downloading 2017-03-15 17:37:55 +01:00
ines
58b884b6d4 Refactor download script and about.py to use new download method 2017-03-15 17:37:18 +01:00
ines
f5d1a39a5b Add util functions for printing and wrapping messages 2017-03-15 17:35:57 +01:00
ines
d7898d586f Fix formatting and remove unused code 2017-03-15 17:35:41 +01:00
ines
b672e95045 Fix formatting 2017-03-15 17:35:04 +01:00
ines
0474e706a0 Remove unused deprecated functions for sputnik 2017-03-15 17:34:54 +01:00
ines
b13e7f79b4 Fix formatting and remove unused imports 2017-03-15 17:33:57 +01:00
ines
1101fd3855 Fix formatting and remove unused imports 2017-03-15 17:33:39 +01:00
ines
842782c128 Move fix_deprecated_glove_vectors_loading to deprecated.py 2017-03-15 17:33:29 +01:00
Matthew Honnibal
4cab8ac136 Update morph exceptions test 2017-03-15 09:31:34 -05:00
Matthew Honnibal
d719f8e77e Use nogil in parser, and set L1 to 0.0 by default 2017-03-15 09:31:01 -05:00
Matthew Honnibal
c61c501406 Update beam-parser to allow parser to maintain nogil 2017-03-15 09:30:22 -05:00
Matthew Honnibal
3d4e389d23 Whitespace 2017-03-15 09:29:42 -05:00
Matthew Honnibal
7769bc31e3 Add beam-search classes 2017-03-15 09:27:41 -05:00
Matthew Honnibal
c79b3129e3 Fix setting of empty lexeme in initial parse state 2017-03-15 09:26:53 -05:00
Matthew Honnibal
d864708072 Add more morphology names in attrs.pyx 2017-03-15 09:26:16 -05:00
Matthew Honnibal
b382dc902c Add morph rules in Language 2017-03-15 09:24:40 -05:00
Matthew Honnibal
8dbff4f5f4 Wire up English lemma and morph rules. 2017-03-15 09:23:22 -05:00
Matthew Honnibal
f70be44746 Use lemmatizer in code, not from downloaded model. 2017-03-15 04:52:50 -05:00
ines
42ba740dde Revert "Merge branch 'debug'"
This reverts commit 89b79d1178, reversing
changes made to 02bdf490a1.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e Update regression test 2017-03-13 15:16:11 +01:00
ines
02bdf490a1 Remove regression test to see if it caused pytest Travis error 2017-03-13 13:00:22 +01:00
ines
17018750ac Add regression test for #717 2017-03-13 12:58:22 +01:00
ines
2883ebfca2 Remove print statement 2017-03-13 12:30:42 +01:00
ines
98c13d8aa9 Add regression test for #401 2017-03-13 12:28:41 +01:00
ines
444d665f9d Add regression test for #686 2017-03-13 12:23:35 +01:00
ines
46b17e5b51 Add regression test for #719 2017-03-13 12:17:35 +01:00
ines
c8ae682ff9 Add regression test for #636 2017-03-13 12:08:31 +01:00
ines
337f9601f2 Add missing unicode declaration 2017-03-13 12:08:19 +01:00
ines
d70386ec6e Update docstring in #886 regression test 2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8 Add regression test for #886 2017-03-13 11:44:58 +01:00
ines
eec3f21c50 Add WordNet license 2017-03-12 13:58:24 +01:00
ines
f9e603903b Rename stop_words.py to word_sets.py and include more sets
NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)
2017-03-12 13:58:22 +01:00
ines
f24f9b4b7b Remove unused code 2017-03-12 13:58:22 +01:00
ines
1da29a7146 Use new Lemmatizer data and remove file import
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
0957737ee8 Add Python-formatted lemmatizer data and rules 2017-03-12 13:58:22 +01:00
ines
c89e30d1a3 Add test for English time exceptions ("1a.m." etc.) 2017-03-12 13:58:22 +01:00
ines
ce9568af84 Move English time exceptions ("1a.m." etc.) and refactor 2017-03-12 13:58:22 +01:00
ines
6b30541774 Fix formatting 2017-03-12 13:58:22 +01:00
Ines Montani
e97a30b99a Merge pull request #885 from PySUST/master
[Bengali] 	Spell checked and add new stop words
2017-03-12 13:20:59 +01:00
ines
66c1f194f9 Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
shuvanon
91cb4cdb2b Sort stop_words 2017-03-12 17:55:51 +06:00
shuvanon
784f6cfa49 Update stop_words 2017-03-12 17:41:01 +06:00
shuvanon
73cc17078e Merge branch 'master' of https://github.com/PySUST/spaCy 2017-03-12 14:52:17 +06:00
shuvanon
35ec7135bb Spell checked and add new stop words 2017-03-12 14:51:34 +06:00
Em
9c809efc25 Removed mapStr 2017-03-11 16:23:26 -08:00
Matthew Honnibal
fa23278ee3 Add classes for beam parser and beam NER 2017-03-11 12:45:37 -06:00
Matthew Honnibal
6c4108c073 Add header for beam parser 2017-03-11 12:45:12 -06:00
Matthew Honnibal
4382f175b3 Squelch compiler warnings 2017-03-11 12:44:43 -06:00
Matthew Honnibal
ea2592879f Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-11 11:13:37 -06:00
Matthew Honnibal
1224c4d3c6 Improve output on trainer 2017-03-11 11:12:48 -06:00
Matthew Honnibal
b438dfd3f3 Add itn argument to tagger.update 2017-03-11 11:12:21 -06:00
Matthew Honnibal
931feb3360 Allow beam parsing for NER 2017-03-11 11:12:01 -06:00
Matthew Honnibal
f77a5bb60a Switch back to greedy parser 2017-03-11 11:11:30 -06:00
Matthew Honnibal
ca9c8c57c0 Add iteration argument to parser.update 2017-03-11 07:00:47 -06:00
Matthew Honnibal
dcce9ca3f3 Use beam parser 2017-03-11 07:00:20 -06:00
Matthew Honnibal
e30ffdd003 Use ftrl optimizer in tagger 2017-03-11 06:59:13 -06:00
Matthew Honnibal
d59c6926c1 I think this fixes the segfault 2017-03-11 06:58:34 -06:00
Matthew Honnibal
318b9e32ff WIP on beam parser. Currently segfaults. 2017-03-11 06:19:52 -06:00
Em
426d17167f Added string manipulation for spans 2017-03-10 16:50:02 -08:00
Matthew Honnibal
b0d80dc9ae Update name of 'train' function in BeamParser 2017-03-10 14:35:43 -06:00
Matthew Honnibal
d11f1a4ddf Record negative costs in non-monotonic arc eager oracle 2017-03-10 11:22:04 -06:00
Matthew Honnibal
ecf91a2dbb Support beam parser 2017-03-10 11:21:21 -06:00