spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 11:29:13 +03:00

Author	SHA1	Message	Date
Ines Montani	89ad095900	Fix whitespace	2019-02-05 12:32:20 +01:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Ines Montani	e00680a33a	Remove unused outdated file	2019-02-01 11:39:48 +01:00
Matthew Honnibal	27e3f98cae	Set version to v2.1.0a7.dev0	2019-02-01 18:06:34 +11:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Amandine Périnet	d570e75dbb	Improving the French lookup dictionnary for ambiguous words (#3185 ) * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * updating the contributor agreement for amperinet	2019-01-31 23:53:45 +01:00
Ines Montani	e9a6dbe4f3	Don't check for Jupyter in global scope and fix check (#3213 ) Resolves #3208. Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-31 23:49:13 +01:00
Amandine Périnet	b34bc9d2e9	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
Loghi	5ca8e2b269	Tamil (#3194 ) * Tamil language support stop wors, examples and numerical attribite supports added Contributor agreement signed * Create Loghijiaha.md Added contributor agreement * Update CONTRIBUTOR_AGREEMENT.md Adjusted contributor_agreement.md * Norm exceptions added	2019-01-27 06:02:04 +01:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Sofie	66016ac289	Batch UD evaluation script (#3174 ) * running UD eval * printing timing of tokenizer: tokens per second * timing of default English model * structured output and parameterization to compare different runs * additional flag to allow evaluation without parsing info * printing verbose log of errors for manual inspection * printing over- and undersegmented cases (and combo's) * add under and oversegmented numbers to Score and structured output * print high-freq over/under segmented words and word shapes * printing examples as part of the structured output * print the results to file * batch run of different models and treebanks per language * cleaning up code * commandline script to process all languages in spaCy & UD * heuristic to remove blinded corpora and option to run one single best per language * pathlib instead of os for file paths	2019-01-27 06:01:02 +01:00
Matthew Honnibal	5a4737df09	Set version to 2.1.0a6	2019-01-21 18:32:34 +01:00
Matthew Honnibal	246538be2e	Set version to 2.1.0a6.dev1	2019-01-21 15:12:17 +01:00
Matthew Honnibal	77ddcf7381	💫 Update matcher engine for regex and extensions (#3173 ) * Update matcher engine for regex and extensions Add support for matching over arbitrary Python predicate functions, and arbitrary Python attribute getters. This will allow matching over regex patterns, and allow supporting extension attributes. The results of the Python predicate functions are cached, so that we don't call the same predicate function twice for the same token. The extension attributes are fetched into an array for each token in the doc. This should minimise the performance impact of the new features. We still need to wire up these features to the patterns, and test it all. * Work on wiring up extra attributes in matcher * Work on tests for extra matcher attrs * Add support for extension attrs to matcher * Test extension attribute matching * Work on implementing predicate-based match patterns * Get predicates working for set membership * Add test for set membership * Make extensions+predicates work * Test matcher extensions * Cache predicate results better in Matcher * Remove print statement in matcher test * Use srsly to get key for predicates	2019-01-21 13:23:15 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Gavriel Loria	9a5003d5c8	iob converter: add 'exception' for error 'too many values' (#3159 ) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com>	2019-01-16 13:44:16 +01:00
Mark Neumann	e599ed9ef8	Allow vectors to be optional in init-model, more robust string counting (#3155 ) * more robust init-model * key not word * add license agreement	2019-01-14 23:48:30 +01:00
mauryaland	214c2ec263	check if argument flat is true or not (#3156 )	2019-01-14 23:47:05 +01:00
Loghi	d97661d18b	Tamil language support (#3154 ) Tamil language support to spaCy Description Hereby, creating new PR to add support for Tamil language in spaCy added stop words, examples and numerical attributes <--Working on other language data--> Types of change Enhancement Checklist [ x] I have submitted the spaCy Contributor Agreement. [x ] I ran the tests, and all new and existing tests passed. [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-14 15:32:30 +01:00
Amandine Périnet	ee24e2534d	French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131 ) * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * updating contributor agreement for amperinet	2019-01-10 15:41:15 +01:00
Kirill Bulygin	7b064542f7	Making `lang/th/test_tokenizer.py` pass by creating `ThaiTokenizer` (#3078 )	2019-01-10 15:40:37 +01:00
Álvaro Abella Bascarán	1cd8f9823f	Correct docs of `Token.subtree` and `Span.subtree` (issue #3122 ) (#3124 ) * solve inconsistency between docs and Span.subtree (issue #3122) * solve inconsistency between docs and Token.subtree (issue #3122)	2019-01-09 03:11:15 +01:00
Amandine Périnet	eef11a7a2c	French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104 ) * modifying French lookup that contained wrong lemmas * correcting wrong line breaks on hyphen * adding contributor agreement for amperinet@ * correcting a typo	2019-01-07 14:15:19 +01:00
Álvaro Abella Bascarán	e03e1eee92	Bugfix/get lca matrix (#3110 ) This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-06 19:07:50 +01:00
Matthew Honnibal	fe4e68cb71	Set version to v2.1.0a6.dev0	2019-01-05 14:44:42 +01:00
Matthew Honnibal	3c09d3d986	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-30 15:49:57 +01:00
Matthew Honnibal	d8d0ce081b	Fix clobber of doc.is_tagged in doc.from_array() If doc.from_array() was called with say, only entity information, this would cause doc.is_tagged to be set to False, even if tags were set. This caused tags to be dropped from serialisation. The same was true for doc.is_parsed. Closes #3012.	2018-12-30 15:48:10 +01:00
Matthew Honnibal	bf20252ae0	Update test for #3012	2018-12-30 15:46:46 +01:00
Matthew Honnibal	63b7accd74	💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107 ) Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent. This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537 * Update test for span.as_doc() * Make span.as_doc() return a copy. Closes #1537 * Document change to Span.as_doc()	2018-12-30 15:17:46 +01:00
Matthew Honnibal	72e4d3782a	Resize doc.tensor when merging spans. Closes #1963 (#3106 ) The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token. * Add test for resizing doc.tensor when merging * Add test for resizing doc.tensor when merging. Closes #1963 * Update get_lca_matrix test for develop * Fix retokenize if tensor unset	2018-12-30 15:17:17 +01:00
Matthew Honnibal	3d64eb4a74	Update get_lca_matrix test for develop	2018-12-30 14:28:07 +01:00
Matthew Honnibal	ac9e3a4a8b	Add test for #1773	2018-12-30 13:16:05 +01:00
Matthew Honnibal	ee4d06fb1b	Prevent exceptions from setting POS but not TAG. Closes #1773	2018-12-30 13:16:05 +01:00
Kirill Bulygin	b665a32b95	Enabling `tests/lang/ru/test_lemmatizer.py`, fixing a `unicode` issue (#3084 ) <!--- Provide a general summary of your changes in the title. --> ## Description See #3079. Here I'm merging into `develop` instead of `master`. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-30 12:10:26 +01:00
Álvaro Abella Bascarán	9bc4cc1352	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:05:52 +01:00
Álvaro Abella Bascarán	6fe276f85d	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:02:26 +01:00
Matthew Honnibal	76e3e695af	Allow single string attributes in doc.to_array() Previously inputs like doc.to_array('ORTH') didn't work. Closes #3064	2018-12-29 16:24:40 +01:00
Matthew Honnibal	174e85439b	Fix behaviour of Matcher's ? quantifier for v2.1 (#3105 ) * Add failing test for matcher bug #3009 * Deduplicate matches from Matcher * Update matcher ? quantifier test * Fix bug with ? quantifier in Matcher The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009 * Fix Python2	2018-12-29 16:18:09 +01:00
Matthew Honnibal	e808bdd076	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-29 13:54:15 +01:00
Jari Bakken	ba8a840f84	spacy.cli.evaluate: fix TypeError (#3101 )	2018-12-28 11:14:28 +01:00
Jari Bakken	0546135fba	Set vectors.name when updating meta.json during training (#3100 ) * Set vectors.name when updating meta.json during training * add vectors name to meta in `spacy package`	2018-12-27 19:55:40 +01:00
Jari Bakken	cc95167b6d	cli.convert: fix typo in converter arguments (#3099 )	2018-12-27 18:08:41 +01:00
Jari Bakken	e172f2478e	Add three missing tags from the `nb` tag map (#3085 ) * Contributors agreement for jarib * Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.	2018-12-27 14:48:40 +01:00
Will Price	4a6af0852a	Improve random prefix generation in displaCy arcs (#3096 ) * Improve random prefix generation in displaCy arcs * Add @willprice contributor agreement	2018-12-27 14:46:02 +01:00
Özcan Kasal	b573ebca77	trilyon forgotten (#3083 ) * trilyon forgotten * contributor added	2018-12-27 14:44:23 +01:00
Matthew Honnibal	978d8be8f9	Set version to v2.1.0a5	2018-12-21 00:26:39 +01:00
Matthew Honnibal	d3f03b1668	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-21 00:25:57 +01:00
Ines Montani	bb9ad37e05	Improve entry points and allow custom language classes via entry points (#3080 ) * Remove check for overwritten factory This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in. * Add helper to only load specific entry point Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront. * Check entry points for custom languages	2018-12-20 23:58:43 +01:00
Matthew Honnibal	f6ac00fab3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-20 18:45:44 +01:00
Matthew Honnibal	d8d27f9129	Set version to v2.1.0a5.dev0	2018-12-20 18:45:34 +01:00
Ines Montani	ca244f5f84	Small fixes to displaCy (#3076 ) ## Description - [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set) - [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting) - [x] add option to customise host for web server - [x] show warning if `displacy.serve` is called from within Jupyter notebooks - [x] move error message to `spacy.errors.Errors`. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-20 17:32:04 +01:00
Matthew Honnibal	f57bea8ab6	💫 Prevent parser from predicting unseen classes (#3075 ) The output weights often return negative scores for classes, especially via the bias terms. This means that when we add a new class, we can't rely on just zeroing the weights, or we'll end up with positive predictions for those labels. To solve this, we use nan values as the initial weights for new labels. This prevents them from ever coming out on top. During backprop, we replace the nan values with the minimum assigned score, so that we're still able to learn these classes.	2018-12-20 16:12:22 +01:00
Matthew Honnibal	9ec9f89b99	💫 Raise better error when using uninitialized pipeline component (#3074 ) After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model. To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.	2018-12-20 15:54:53 +01:00
Muhammad Irfan	2e84ec1513	Fixed ISO code for Urdu. (#3073 )	2018-12-20 12:28:53 +01:00
Matthew Honnibal	c315e08e6e	Fix formatting of meta.json after spacy package	2018-12-19 14:36:08 +01:00
Matthew Honnibal	e24f94ce39	Fix handling of preset entities. closes #2779	2018-12-19 02:13:31 +01:00
Matthew Honnibal	faa8656582	Port parser fix for large label sets from master	2018-12-19 02:11:26 +01:00
Matthew Honnibal	99a84e4d0e	Make ParserModel.resize_output idempotent	2018-12-19 02:10:36 +01:00
Matthew Honnibal	0f83b98afa	Remove unused code from spacy pretrain	2018-12-18 19:19:26 +01:00
Ken	5f0c5fbfa4	issue #3012 : add test (#3021 ) * issue #3012: add test * add contributor aggreement * Make test work without models and fix typos ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10	2018-12-18 15:02:49 +01:00
Ines Montani	77a47b2b20	Auto-format	2018-12-18 15:02:11 +01:00
Kirill Bulygin	2fb004832f	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 15:01:06 +01:00
Kirill Bulygin	10189d9092	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 14:53:50 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Brixjohn	52f3c95004	Added alpha support for Tagalog language (#3062 ) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template	2018-12-18 13:08:38 +01:00
Matthew Honnibal	92f4b9c8ea	set max batch size to 1000	2018-12-17 23:15:39 +00:00
Matthew Honnibal	3c4a2edf4a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-17 23:08:40 +00:00
Matthew Honnibal	95fc0176d1	Pass tagger options in begin_training	2018-12-17 23:08:31 +00:00
Matthew Honnibal	7c504b6ddb	Try to implement more losses for pretraining * Try to implement cosine loss This one seems to be correct? Still unsure, but it performs okay * Try to implement the von Mises-Fisher loss This one's definitely not right yet.	2018-12-17 14:48:27 +00:00
Matthew Honnibal	ab4b61fb6e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-16 20:11:43 +01:00
Matthew Honnibal	9ef30b0cde	Accept 'text' in matcher as an alternative to ORTH	2018-12-16 20:10:43 +01:00
Amandine Périnet	361554f629	Lemmatization of Adjectives - French : adding rules and vocabulary (#3045 ) * modifying FR lemmatisation for Adjectives * adding contributor agreement for amperinet * correcting some errors in vocabulary files	2018-12-16 18:11:07 +01:00
Sofie	c6ad557cea	French regular expressions instead of extensive exceptions list (on develop) (#3046 ) (resolves #2679 ) * merge changes of PR 3023 into develop branch instead of master * further deletions from exception list according to PR 3023	2018-12-16 18:04:55 +01:00
Ines Montani	7bbdffd36e	Remove pre-set lemma for "cause" (resolves #2165 )	2018-12-14 12:51:18 +01:00
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Matthew Honnibal	ab9494b2a3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-12 21:08:50 +00:00
Matthew Honnibal	fb56028476	Remove b1 and b2 decay	2018-12-12 12:37:07 +01:00
Matthew Honnibal	df15279e88	Reduce batch size during pretrain	2018-12-10 15:30:23 +00:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	449b889454	Fix KeyError in Vectors.most_similar. Fixes #2648	2018-12-10 16:19:18 +01:00
Matthew Honnibal	90aec6d2f6	Fix vectors for reserved words. Closes #2871	2018-12-10 16:09:49 +01:00
Matthew Honnibal	16fd8dce1d	Add get_string_id helper to spacy.strings	2018-12-10 16:09:26 +01:00
Matthew Honnibal	cc1ea03004	Add test for issue #2871 -- vectors for reserved words	2018-12-10 16:09:10 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	b1c8731b4d	Make spacy train respect LOG_FRIENDLY	2018-12-10 09:46:53 +01:00
Matthew Honnibal	6936ca1664	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 09:44:07 +01:00
Matthew Honnibal	4405b5c875	Fix resizing edge-case for NER	2018-12-10 06:25:17 +00:00
Matthew Honnibal	0994dc50d8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 05:35:01 +00:00
Matthew Honnibal	24f2e9bc07	Tweak training params	2018-12-09 17:08:58 +00:00
Matthew Honnibal	16c5861d29	Fix NER space constraints Allow entities to end on spaces, to avoid stumping the oracle when we're inside an entity, and there's a space just before a correct entity.	2018-12-09 08:06:45 +01:00
Matthew Honnibal	1b1a1af193	Fix printing in spacy train	2018-12-09 06:03:49 +01:00
Matthew Honnibal	d2ac618af1	Set cbb_maxout_pieces=3	2018-12-08 23:27:29 +01:00
Matthew Honnibal	cb16b78b0d	Set dropout rate to 0.2	2018-12-08 19:59:11 +01:00
Matthew Honnibal	2c2db0c492	💫 Allow Span to take text label (#3031 ) Fixes #3027. * Allow Span.__init__ to take unicode values for the `label` argument. * Allow `Span.label_` to be writeable. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-08 13:08:41 +01:00
Matthew Honnibal	11a29af751	Set cupy.random seed in fix_random_seed helper	2018-12-08 12:37:38 +01:00
Ines Montani	ffdd5e964f	Small CLI improvements (#3030 ) * Add todo * Auto-format * Update wasabi pin * Format training results with wasabi * Remove loading animation from model saving Currently behaves weirdly * Inline messages * Remove unnecessary path2str Already taken care of by printer * Inline messages in CLI * Remove unused function * Move loading indicator into loading function * Check for invalid whitespace entities	2018-12-08 11:49:43 +01:00
Matthew Honnibal	8aa7882762	Make NORM a token attribute (#3029 ) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.	2018-12-08 10:49:10 +01:00
Matthew Honnibal	a338c6f8f6	Fix JSON segmentation bug that affected French Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped slashes were being handled incorrectly. This bug caused low scores for French in the early v2.1.0 alphas, because most of the data was not being read in. Fittingly, the document that triggered the bug was a Wikipedia article about Perl. Parsing perl remains difficult!	2018-12-08 10:41:24 +01:00
Matthew Honnibal	b2bfd1e1c8	Move dropout and batch sizes out of global scope in train cmd	2018-12-07 20:54:35 +01:00
Matthew Honnibal	40e0da9cc1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-07 00:12:22 +00:00
Matthew Honnibal	1e6725e9b7	Try to prevent spaces from being tagged as entities	2018-12-07 00:12:12 +00:00
Matthew Honnibal	427c0693c8	Fix missing comma in init-model command	2018-12-06 22:48:31 +01:00
Amandine Périnet	0b44ea23bd	Lemmatization of Nouns - French : adding rules and vocabulary (#2992 ) * modifying FR lemmatization for nouns * modifying FR lemmatization for nouns * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized * adding contributor agreement for amperinet * adding a missing comma	2018-12-06 22:42:18 +01:00
Matthew Honnibal	d896fbca62	Fix batch size in parser.pipe	2018-12-06 21:45:56 +01:00
Matthew Honnibal	bb3304a4f1	Fix pickle tests	2018-12-06 20:46:36 +01:00
Matthew Honnibal	e619f45287	Fix pickle tests	2018-12-06 20:43:47 +01:00
Matthew Honnibal	0a60726215	Remove cytoolz usage in CLI	2018-12-06 20:37:00 +01:00
Matthew Honnibal	c0af627f32	Fix dill usage in vocab	2018-12-06 18:53:16 +01:00
Matthew Honnibal	9520489225	Fix removabl of dill (for srsly)	2018-12-06 18:46:09 +01:00
Matthew Honnibal	711f108532	Fix cytoolz import cytoolz	2018-12-06 16:04:12 +01:00
Gavriel Loria	9c8c4287bf	Accept iob2 and allow generic whitespace (#2999 ) * accept non-pipe whitespace as delimiter; allow iob2 filename * added small documentation note for IOB2 allowance * added contributor agreement	2018-12-06 15:50:25 +01:00
Amandine Périnet	2457318b7a	Lemmatization of Verbs - French : adding rules and vocabulary (#3006 ) * updating rules and vocabulary for French lemmatization of verbs * updating the file with French auxiliary verb * updating rules and vocabulary for French lemmatization of verbs * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized	2018-12-06 15:49:28 +01:00
Beate Sildnes	f0d7e206ec	Updated wordforms for Norwegian lemmatizer (#3007 ) * Updated wordforms for Norwegian lemmatizer Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup). * Add spaCy contributor agreement for user beatesi * Updated wordforms for Norwegian lemmatizer	2018-12-06 15:46:18 +01:00
Matthew Honnibal	cabaadd793	Fix build error from bad import Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.	2018-12-06 15:12:39 +01:00
Matthew Honnibal	ea00dbaaa4	Remove usage of itertools.islice	2018-12-03 02:43:03 +01:00
Matthew Honnibal	c7b33b24f1	Fix conflict	2018-12-03 02:20:20 +01:00
Matthew Honnibal	2402ef498b	Remove unused import	2018-12-03 02:19:23 +01:00
Matthew Honnibal	1c71fdb805	Remove cytoolz usage from spaCy	2018-12-03 02:19:12 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Matthew Honnibal	a7b085ae46	Set version back to 2.1.0a4	2018-12-03 02:03:26 +01:00
Matthew Honnibal	8e9a4d2f5e	Increment version to 2.1.0a5	2018-12-03 01:59:50 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Matthew Honnibal	40a273245c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-01 14:43:29 +01:00
Matthew Honnibal	d9d339186b	Fix dropout and batch-size defaults	2018-12-01 13:42:35 +00:00
Matthew Honnibal	9536ee787c	Add comma deletion to data noising	2018-12-01 13:42:18 +00:00
Matthew Honnibal	21ee1c7a17	Improve parser multi-task objective	2018-12-01 13:41:24 +00:00
Matthew Honnibal	fe7d6f36b1	Fix parser default	2018-12-01 13:41:04 +00:00
Matthew Honnibal	a31d557f2d	Set version to v2.1.0a4	2018-12-01 14:40:03 +01:00
Ines Montani	5c966d0874	Simplify function	2018-12-01 04:59:12 +01:00
Ines Montani	ce7eec846b	Move CLi-specific Markdown helper to CLI	2018-12-01 04:55:48 +01:00
Ines Montani	40ae499f32	Remove unused helper function Now imported from wasabi	2018-12-01 04:54:46 +01:00
Matthew Honnibal	bbaca991ba	Set version to v2.0.18	2018-12-01 03:35:09 +01:00
Matthew Honnibal	e1a4b0d7f7	Set version to v2.0.18.dev1	2018-12-01 03:12:12 +01:00
Matthew Honnibal	413530b269	Set version to 2.0.18	2018-12-01 03:00:27 +01:00
Matthew Honnibal	24d52876e1	Set version to v2.0.18.dev0	2018-12-01 02:38:04 +01:00
Matthew Honnibal	3139b020b5	Fix train script	2018-11-30 22:17:08 +00:00
Matthew Honnibal	4aa1002546	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-30 20:58:51 +00:00
Matthew Honnibal	6bd1cc57ee	Increase length limit for pretrain	2018-11-30 20:58:18 +00:00
Gavriel Loria	919729d38c	replace user-facing references to "sbd" with "sentencizer" (#2985 ) ## Description Fixes #2693 Previously, the tokens `sbd` and `sentencizer` would create the same nlp pipe. Internally, both would be called `sbd`. This setup became problematic because it was hard for a user relying on the `sentencizer` pipe name to realize that their pipe's name would be `sbd` for all functions other than creating a pipe. This PR intends to change the API and API documentation to fully support `sentencizer` and drop any user-facing references to `sbd`. ### Types of change end-user API bug ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 21:22:40 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	0369db75c1	Fix support for parser multi-task objectives	2018-11-30 19:53:59 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Matthew Honnibal	1b240f2119	Fix default token_vector_width	2018-11-30 16:40:11 +00:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
Ines Montani	c9bdeafbc7	Don't run weird failing test for now	2018-11-30 16:13:40 +01:00
Sofie	585de273cd	Fix small typo bug in French regexp + relevant unit test (#2980 ) * additional unit test for new entr word not in other lists * bugfix - unit test works * use _latin_lower instead of alpha_lower for french * revert back to ALPHA_LOWER (following the code for languages) * contributor agreement	2018-11-29 20:16:13 +01:00
Ines Montani	d33953037e	💫 Port master changes over to develop (#2979 ) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit `70f4e8adf3`. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit `bdebbef455`. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit `62358dd867`. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests	2018-11-29 16:30:29 +01:00
Matthew Honnibal	681258e29b	Add support for pretrained tok2vec to ud-train	2018-11-29 14:54:47 +00:00
Matthew Honnibal	93be3ad038	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-29 12:37:06 +00:00
Matthew Honnibal	008e1ee1dd	Update pretrain command	2018-11-29 12:36:43 +00:00
Ines Montani	8d3bfb3c04	Remove outdated options and fix formatting	2018-11-28 23:33:34 +01:00
Adam Schwalm	00566949de	Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977 ) Fixes #2976	2018-11-28 19:49:33 +01:00
Nathaniel J. Smith	73255091f8	Fix conftest getoption	2018-11-28 19:07:24 +01:00
Matthew Honnibal	87da5bcf5b	Set version to v2.1.0a3	2018-11-28 18:22:09 +01:00
Matthew Honnibal	647d1a1efc	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-28 18:21:45 +01:00
Matthew Honnibal	61e435610e	💫 Feature/improve pretraining (#2971 ) * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Tweak pretraining script * Fix data limits in spacy.gold * Fix pretrain script	2018-11-28 18:04:58 +01:00
Matthew Honnibal	0fdb25b958	Fix msgpack error	2018-11-27 19:35:55 +01:00
Matthew Honnibal	ef0820827a	Update hyper-parameters after NER random search (#2972 ) These experiments were completed a few weeks ago, but I didn't make the PR, pending model release. Token vector width: 128->96 Hidden width: 128->64 Embed size: 5000->2000 Dropout: 0.2->0.1 Updated optimizer defaults (unclear how important?) This should improve speed, model size and load time, while keeping similar or slightly better accuracy. The tl;dr is we prefer to prevent over-fitting by reducing model size, rather than using more dropout.	2018-11-27 18:49:52 +01:00
Matthew Honnibal	c9f6acc564	Set version to 2.1.0a3.dev0	2018-11-27 05:15:27 +01:00
Ines Montani	b6e991440c	💫 Tidy up and auto-format tests (#2967 ) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility	2018-11-27 01:09:36 +01:00
Matthew Honnibal	2c37e0ccf6	💫 Use Blis for matrix multiplications (#2966 ) Our epic matrix multiplication odyssey is drawing to a close... I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython. The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced. With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with. * Use blis * Use -2 arg to Cython * Update dependencies * Fix requirements * Update setup dependencies * Fix requirement typo * Fix msgpack errors * Remove Python27 test from Appveyor, until Blis works there * Auto-format setup.py * Fix murmurhash version	2018-11-27 00:44:04 +01:00
Ines Montani	41c6002fd8	Tidy up [ci skip]	2018-11-26 18:56:04 +01:00
Ines Montani	c62d06ea5c	Port over #2949	2018-11-26 18:54:27 +01:00
Ines Montani	ec5ee9e616	Auto-format	2018-11-26 18:54:20 +01:00
Ines Montani	968aff2f6a	Update tests for pytest 4.x (#2965 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-26 18:14:57 +01:00
Marc Puig	98fe1ab259	Catalan Language Support (#2940 ) * Catalan language Support * Ddding Catalan to documentation	2018-11-26 15:25:47 +01:00
Ines Montani	048416f265	Fix formatting	2018-11-26 13:27:41 +01:00
Shawn Cicoria	7601ae0cff	fixes symbolic link on py3 and windows (#2949 ) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com>	2018-11-24 15:34:23 +01:00
Ines Montani	350c8d25b0	Add EntityRecognizer.label property	2018-11-18 00:06:26 +01:00
Ines Montani	017bc2ef2f	Expose TextCategorizer via __all__	2018-11-18 00:06:13 +01:00
Ines Montani	b4581435f6	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-16 13:08:22 +01:00
Ines Montani	e2f75eb492	Fix message formatting	2018-11-16 13:08:20 +01:00
Matthew Honnibal	2874b8efd8	Fix tok2vec loading in spacy train	2018-11-15 23:34:54 +00:00
Matthew Honnibal	2ddd428834	Fix pretrain script	2018-11-15 23:34:35 +00:00
Matthew Honnibal	f8afaa0c1c	Fix pretrain	2018-11-15 22:46:53 +00:00
Matthew Honnibal	6af6950e46	Fix pretrain	2018-11-15 22:45:36 +00:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00
Ines Montani	02fc73ca53	💫 Create random IDs for SVGs to prevent ID clashes (#2927 ) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-15 11:40:10 +01:00
Ines Montani	e89708c3eb	💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925 ) * Allow matching non-orth attributes in PhraseMatcher (see #1971) Usage: PhraseMatcher(nlp.vocab, attr='POS') * Allow attr argument to be int * Fix formatting * Fix typo	2018-11-15 03:00:58 +01:00
Ines Montani	0d5b142c78	Fix typos and whitespace	2018-11-14 19:12:34 +01:00
Ines Montani	bd1b0e396a	Add deprecation warning for PhraseMatcher max_length	2018-11-14 19:10:46 +01:00
Ines Montani	64257bf3a7	Fix formatting	2018-11-14 19:10:21 +01:00
Ines Montani	b3cadd5b81	Delete _matcher2_notes.py	2018-11-14 16:19:12 +01:00
mauryaland	87ce435aff	Check if the word is in one of the regular lists specific to each POS (#2886 )	2018-11-14 15:58:43 +01:00
Daniel Hershcovich	d3d419ecc0	Allow input text of length up to max_length, inclusive (#2922 )	2018-11-13 16:46:29 +01:00
Matthew Honnibal	5fc98ade04	Set version to 2.1.0a2	2018-11-08 09:56:56 +01:00
Matthew Honnibal	ad44982f01	Fix dropout in tensorizer, update comment	2018-11-03 12:46:58 +00:00
Matthew Honnibal	ba365ae1c9	Normalize gradient by number of words in tensorizer	2018-11-03 10:53:22 +00:00
Matthew Honnibal	dac3f1b280	Improve Tensorizer	2018-11-03 10:52:50 +00:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Matthew Honnibal	db08b168a3	Set version to 2.0.17	2018-10-29 23:22:18 +01:00
Suraj Rajan	0bf14082a4	Added more constucts for dependency tree matcher (#2836 )	2018-10-29 23:21:39 +01:00
Matthew Honnibal	e2ae25d6f5	Try setting older regex version, to align with conda	2018-10-29 13:39:00 +01:00
Matthew Honnibal	d4fa9af56f	Set version to 2.0.17.dev0	2018-10-28 16:15:26 +01:00
Matthew Honnibal	b2e2bba8b0	Fix missing comma	2018-10-28 00:09:16 +02:00
Wannaphong Phatthiyaphaibun	2d2765fd8a	Change PyThaiNLP Url (#2876 )	2018-10-27 14:46:07 +02:00
Matthew Honnibal	817e1fc5e5	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 01:12:50 +02:00

... 2 3 4 5 6 ...

5707 Commits