spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-19 08:31:59 +03:00

Author	SHA1	Message	Date
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Matthew Honnibal	ab9494b2a3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-12 21:08:50 +00:00
Matthew Honnibal	fb56028476	Remove b1 and b2 decay	2018-12-12 12:37:07 +01:00
Matthew Honnibal	df15279e88	Reduce batch size during pretrain	2018-12-10 15:30:23 +00:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	449b889454	Fix KeyError in Vectors.most_similar. Fixes #2648	2018-12-10 16:19:18 +01:00
Matthew Honnibal	90aec6d2f6	Fix vectors for reserved words. Closes #2871	2018-12-10 16:09:49 +01:00
Matthew Honnibal	16fd8dce1d	Add get_string_id helper to spacy.strings	2018-12-10 16:09:26 +01:00
Matthew Honnibal	cc1ea03004	Add test for issue #2871 -- vectors for reserved words	2018-12-10 16:09:10 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	b1c8731b4d	Make spacy train respect LOG_FRIENDLY	2018-12-10 09:46:53 +01:00
Matthew Honnibal	6936ca1664	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 09:44:07 +01:00
Matthew Honnibal	4405b5c875	Fix resizing edge-case for NER	2018-12-10 06:25:17 +00:00
Matthew Honnibal	0994dc50d8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 05:35:01 +00:00
Matthew Honnibal	24f2e9bc07	Tweak training params	2018-12-09 17:08:58 +00:00
Matthew Honnibal	16c5861d29	Fix NER space constraints Allow entities to end on spaces, to avoid stumping the oracle when we're inside an entity, and there's a space just before a correct entity.	2018-12-09 08:06:45 +01:00
Matthew Honnibal	1b1a1af193	Fix printing in spacy train	2018-12-09 06:03:49 +01:00
Matthew Honnibal	d2ac618af1	Set cbb_maxout_pieces=3	2018-12-08 23:27:29 +01:00
Matthew Honnibal	cb16b78b0d	Set dropout rate to 0.2	2018-12-08 19:59:11 +01:00
Matthew Honnibal	2c2db0c492	💫 Allow Span to take text label (#3031 ) Fixes #3027. * Allow Span.__init__ to take unicode values for the `label` argument. * Allow `Span.label_` to be writeable. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-08 13:08:41 +01:00
Matthew Honnibal	11a29af751	Set cupy.random seed in fix_random_seed helper	2018-12-08 12:37:38 +01:00
Ines Montani	ffdd5e964f	Small CLI improvements (#3030 ) * Add todo * Auto-format * Update wasabi pin * Format training results with wasabi * Remove loading animation from model saving Currently behaves weirdly * Inline messages * Remove unnecessary path2str Already taken care of by printer * Inline messages in CLI * Remove unused function * Move loading indicator into loading function * Check for invalid whitespace entities	2018-12-08 11:49:43 +01:00
Matthew Honnibal	8aa7882762	Make NORM a token attribute (#3029 ) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.	2018-12-08 10:49:10 +01:00
Matthew Honnibal	a338c6f8f6	Fix JSON segmentation bug that affected French Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped slashes were being handled incorrectly. This bug caused low scores for French in the early v2.1.0 alphas, because most of the data was not being read in. Fittingly, the document that triggered the bug was a Wikipedia article about Perl. Parsing perl remains difficult!	2018-12-08 10:41:24 +01:00
Matthew Honnibal	b2bfd1e1c8	Move dropout and batch sizes out of global scope in train cmd	2018-12-07 20:54:35 +01:00
Matthew Honnibal	40e0da9cc1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-07 00:12:22 +00:00
Matthew Honnibal	1e6725e9b7	Try to prevent spaces from being tagged as entities	2018-12-07 00:12:12 +00:00
Matthew Honnibal	427c0693c8	Fix missing comma in init-model command	2018-12-06 22:48:31 +01:00
Amandine Périnet	0b44ea23bd	Lemmatization of Nouns - French : adding rules and vocabulary (#2992 ) * modifying FR lemmatization for nouns * modifying FR lemmatization for nouns * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized * adding contributor agreement for amperinet * adding a missing comma	2018-12-06 22:42:18 +01:00
Matthew Honnibal	d896fbca62	Fix batch size in parser.pipe	2018-12-06 21:45:56 +01:00
Matthew Honnibal	bb3304a4f1	Fix pickle tests	2018-12-06 20:46:36 +01:00
Matthew Honnibal	e619f45287	Fix pickle tests	2018-12-06 20:43:47 +01:00
Matthew Honnibal	0a60726215	Remove cytoolz usage in CLI	2018-12-06 20:37:00 +01:00
Matthew Honnibal	c0af627f32	Fix dill usage in vocab	2018-12-06 18:53:16 +01:00
Matthew Honnibal	9520489225	Fix removabl of dill (for srsly)	2018-12-06 18:46:09 +01:00
Matthew Honnibal	711f108532	Fix cytoolz import cytoolz	2018-12-06 16:04:12 +01:00
Gavriel Loria	9c8c4287bf	Accept iob2 and allow generic whitespace (#2999 ) * accept non-pipe whitespace as delimiter; allow iob2 filename * added small documentation note for IOB2 allowance * added contributor agreement	2018-12-06 15:50:25 +01:00
Amandine Périnet	2457318b7a	Lemmatization of Verbs - French : adding rules and vocabulary (#3006 ) * updating rules and vocabulary for French lemmatization of verbs * updating the file with French auxiliary verb * updating rules and vocabulary for French lemmatization of verbs * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized	2018-12-06 15:49:28 +01:00
Beate Sildnes	f0d7e206ec	Updated wordforms for Norwegian lemmatizer (#3007 ) * Updated wordforms for Norwegian lemmatizer Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup). * Add spaCy contributor agreement for user beatesi * Updated wordforms for Norwegian lemmatizer	2018-12-06 15:46:18 +01:00
Matthew Honnibal	cabaadd793	Fix build error from bad import Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.	2018-12-06 15:12:39 +01:00
Matthew Honnibal	ea00dbaaa4	Remove usage of itertools.islice	2018-12-03 02:43:03 +01:00
Matthew Honnibal	c7b33b24f1	Fix conflict	2018-12-03 02:20:20 +01:00
Matthew Honnibal	2402ef498b	Remove unused import	2018-12-03 02:19:23 +01:00
Matthew Honnibal	1c71fdb805	Remove cytoolz usage from spaCy	2018-12-03 02:19:12 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Matthew Honnibal	a7b085ae46	Set version back to 2.1.0a4	2018-12-03 02:03:26 +01:00
Matthew Honnibal	8e9a4d2f5e	Increment version to 2.1.0a5	2018-12-03 01:59:50 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Matthew Honnibal	40a273245c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-01 14:43:29 +01:00
Matthew Honnibal	d9d339186b	Fix dropout and batch-size defaults	2018-12-01 13:42:35 +00:00
Matthew Honnibal	9536ee787c	Add comma deletion to data noising	2018-12-01 13:42:18 +00:00
Matthew Honnibal	21ee1c7a17	Improve parser multi-task objective	2018-12-01 13:41:24 +00:00
Matthew Honnibal	fe7d6f36b1	Fix parser default	2018-12-01 13:41:04 +00:00
Matthew Honnibal	a31d557f2d	Set version to v2.1.0a4	2018-12-01 14:40:03 +01:00
Ines Montani	5c966d0874	Simplify function	2018-12-01 04:59:12 +01:00
Ines Montani	ce7eec846b	Move CLi-specific Markdown helper to CLI	2018-12-01 04:55:48 +01:00
Ines Montani	40ae499f32	Remove unused helper function Now imported from wasabi	2018-12-01 04:54:46 +01:00
Matthew Honnibal	bbaca991ba	Set version to v2.0.18	2018-12-01 03:35:09 +01:00
Matthew Honnibal	e1a4b0d7f7	Set version to v2.0.18.dev1	2018-12-01 03:12:12 +01:00
Matthew Honnibal	413530b269	Set version to 2.0.18	2018-12-01 03:00:27 +01:00
Matthew Honnibal	24d52876e1	Set version to v2.0.18.dev0	2018-12-01 02:38:04 +01:00
Matthew Honnibal	3139b020b5	Fix train script	2018-11-30 22:17:08 +00:00
Matthew Honnibal	4aa1002546	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-30 20:58:51 +00:00
Matthew Honnibal	6bd1cc57ee	Increase length limit for pretrain	2018-11-30 20:58:18 +00:00
Gavriel Loria	919729d38c	replace user-facing references to "sbd" with "sentencizer" (#2985 ) ## Description Fixes #2693 Previously, the tokens `sbd` and `sentencizer` would create the same nlp pipe. Internally, both would be called `sbd`. This setup became problematic because it was hard for a user relying on the `sentencizer` pipe name to realize that their pipe's name would be `sbd` for all functions other than creating a pipe. This PR intends to change the API and API documentation to fully support `sentencizer` and drop any user-facing references to `sbd`. ### Types of change end-user API bug ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 21:22:40 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	0369db75c1	Fix support for parser multi-task objectives	2018-11-30 19:53:59 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Matthew Honnibal	1b240f2119	Fix default token_vector_width	2018-11-30 16:40:11 +00:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
Ines Montani	c9bdeafbc7	Don't run weird failing test for now	2018-11-30 16:13:40 +01:00
Sofie	585de273cd	Fix small typo bug in French regexp + relevant unit test (#2980 ) * additional unit test for new entr word not in other lists * bugfix - unit test works * use _latin_lower instead of alpha_lower for french * revert back to ALPHA_LOWER (following the code for languages) * contributor agreement	2018-11-29 20:16:13 +01:00
Ines Montani	d33953037e	💫 Port master changes over to develop (#2979 ) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit `70f4e8adf3`. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit `bdebbef455`. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit `62358dd867`. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests	2018-11-29 16:30:29 +01:00
Matthew Honnibal	681258e29b	Add support for pretrained tok2vec to ud-train	2018-11-29 14:54:47 +00:00
Matthew Honnibal	93be3ad038	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-29 12:37:06 +00:00
Matthew Honnibal	008e1ee1dd	Update pretrain command	2018-11-29 12:36:43 +00:00
Ines Montani	8d3bfb3c04	Remove outdated options and fix formatting	2018-11-28 23:33:34 +01:00
Adam Schwalm	00566949de	Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977 ) Fixes #2976	2018-11-28 19:49:33 +01:00
Nathaniel J. Smith	73255091f8	Fix conftest getoption	2018-11-28 19:07:24 +01:00
Matthew Honnibal	87da5bcf5b	Set version to v2.1.0a3	2018-11-28 18:22:09 +01:00
Matthew Honnibal	647d1a1efc	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-28 18:21:45 +01:00
Matthew Honnibal	61e435610e	💫 Feature/improve pretraining (#2971 ) * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Tweak pretraining script * Fix data limits in spacy.gold * Fix pretrain script	2018-11-28 18:04:58 +01:00
Matthew Honnibal	0fdb25b958	Fix msgpack error	2018-11-27 19:35:55 +01:00
Matthew Honnibal	ef0820827a	Update hyper-parameters after NER random search (#2972 ) These experiments were completed a few weeks ago, but I didn't make the PR, pending model release. Token vector width: 128->96 Hidden width: 128->64 Embed size: 5000->2000 Dropout: 0.2->0.1 Updated optimizer defaults (unclear how important?) This should improve speed, model size and load time, while keeping similar or slightly better accuracy. The tl;dr is we prefer to prevent over-fitting by reducing model size, rather than using more dropout.	2018-11-27 18:49:52 +01:00
Matthew Honnibal	c9f6acc564	Set version to 2.1.0a3.dev0	2018-11-27 05:15:27 +01:00
Ines Montani	b6e991440c	💫 Tidy up and auto-format tests (#2967 ) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility	2018-11-27 01:09:36 +01:00
Matthew Honnibal	2c37e0ccf6	💫 Use Blis for matrix multiplications (#2966 ) Our epic matrix multiplication odyssey is drawing to a close... I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython. The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced. With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with. * Use blis * Use -2 arg to Cython * Update dependencies * Fix requirements * Update setup dependencies * Fix requirement typo * Fix msgpack errors * Remove Python27 test from Appveyor, until Blis works there * Auto-format setup.py * Fix murmurhash version	2018-11-27 00:44:04 +01:00
Ines Montani	41c6002fd8	Tidy up [ci skip]	2018-11-26 18:56:04 +01:00
Ines Montani	c62d06ea5c	Port over #2949	2018-11-26 18:54:27 +01:00
Ines Montani	ec5ee9e616	Auto-format	2018-11-26 18:54:20 +01:00
Ines Montani	968aff2f6a	Update tests for pytest 4.x (#2965 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-26 18:14:57 +01:00
Marc Puig	98fe1ab259	Catalan Language Support (#2940 ) * Catalan language Support * Ddding Catalan to documentation	2018-11-26 15:25:47 +01:00
Ines Montani	048416f265	Fix formatting	2018-11-26 13:27:41 +01:00
Shawn Cicoria	7601ae0cff	fixes symbolic link on py3 and windows (#2949 ) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com>	2018-11-24 15:34:23 +01:00
Ines Montani	350c8d25b0	Add EntityRecognizer.label property	2018-11-18 00:06:26 +01:00
Ines Montani	017bc2ef2f	Expose TextCategorizer via __all__	2018-11-18 00:06:13 +01:00
Ines Montani	b4581435f6	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-16 13:08:22 +01:00
Ines Montani	e2f75eb492	Fix message formatting	2018-11-16 13:08:20 +01:00
Matthew Honnibal	2874b8efd8	Fix tok2vec loading in spacy train	2018-11-15 23:34:54 +00:00
Matthew Honnibal	2ddd428834	Fix pretrain script	2018-11-15 23:34:35 +00:00
Matthew Honnibal	f8afaa0c1c	Fix pretrain	2018-11-15 22:46:53 +00:00
Matthew Honnibal	6af6950e46	Fix pretrain	2018-11-15 22:45:36 +00:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00
Ines Montani	02fc73ca53	💫 Create random IDs for SVGs to prevent ID clashes (#2927 ) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-15 11:40:10 +01:00
Ines Montani	e89708c3eb	💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925 ) * Allow matching non-orth attributes in PhraseMatcher (see #1971) Usage: PhraseMatcher(nlp.vocab, attr='POS') * Allow attr argument to be int * Fix formatting * Fix typo	2018-11-15 03:00:58 +01:00
Ines Montani	0d5b142c78	Fix typos and whitespace	2018-11-14 19:12:34 +01:00
Ines Montani	bd1b0e396a	Add deprecation warning for PhraseMatcher max_length	2018-11-14 19:10:46 +01:00
Ines Montani	64257bf3a7	Fix formatting	2018-11-14 19:10:21 +01:00
Ines Montani	b3cadd5b81	Delete _matcher2_notes.py	2018-11-14 16:19:12 +01:00
mauryaland	87ce435aff	Check if the word is in one of the regular lists specific to each POS (#2886 )	2018-11-14 15:58:43 +01:00
Daniel Hershcovich	d3d419ecc0	Allow input text of length up to max_length, inclusive (#2922 )	2018-11-13 16:46:29 +01:00
Matthew Honnibal	5fc98ade04	Set version to 2.1.0a2	2018-11-08 09:56:56 +01:00
Matthew Honnibal	ad44982f01	Fix dropout in tensorizer, update comment	2018-11-03 12:46:58 +00:00
Matthew Honnibal	ba365ae1c9	Normalize gradient by number of words in tensorizer	2018-11-03 10:53:22 +00:00
Matthew Honnibal	dac3f1b280	Improve Tensorizer	2018-11-03 10:52:50 +00:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Matthew Honnibal	db08b168a3	Set version to 2.0.17	2018-10-29 23:22:18 +01:00
Suraj Rajan	0bf14082a4	Added more constucts for dependency tree matcher (#2836 )	2018-10-29 23:21:39 +01:00
Matthew Honnibal	e2ae25d6f5	Try setting older regex version, to align with conda	2018-10-29 13:39:00 +01:00
Matthew Honnibal	d4fa9af56f	Set version to 2.0.17.dev0	2018-10-28 16:15:26 +01:00
Matthew Honnibal	b2e2bba8b0	Fix missing comma	2018-10-28 00:09:16 +02:00
Wannaphong Phatthiyaphaibun	2d2765fd8a	Change PyThaiNLP Url (#2876 )	2018-10-27 14:46:07 +02:00
Matthew Honnibal	817e1fc5e5	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 01:12:50 +02:00
Matthew Honnibal	9447739027	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-27 00:50:48 +02:00
Matthew Honnibal	ad068f51be	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 00:46:30 +02:00
Grivaz	57f274b693	raise error when setting overlapping entities as doc.ents (#2880 )	2018-10-26 23:29:16 +02:00
Ines Montani	48b1bc44d3	Update version to 2.0.16	2018-10-15 14:39:25 +02:00
Ines Montani	a0f6647160	Increment version	2018-10-15 14:20:55 +02:00
Ines Montani	7bc7fa8f1e	Increment version	2018-10-15 01:40:44 +02:00
Matthew Honnibal	8612b75890	Set version to 2.0.14	2018-10-15 00:10:04 +02:00
Matthew Honnibal	d6e9cf8b09	Set version to 2.0.14.dev1	2018-10-15 00:09:02 +02:00
Matthew Honnibal	8ccfa52d19	Unhack prefer_gpu	2018-10-14 23:27:09 +02:00
Matthew Honnibal	41adf3572b	Set version to v2.0.14	2018-10-14 23:15:34 +02:00
Matthew Honnibal	38aa835ada	Workaround bug in thinc require_gpu	2018-10-14 23:15:08 +02:00
Matthew Honnibal	91593b7378	Add tests for prefer_gpu() and require_gpu()	2018-10-14 23:05:22 +02:00
Matthew Honnibal	62c70b3163	Import prefer_gpu and require_gpu functions from Thinc	2018-10-14 23:03:06 +02:00
Ines Montani	295da0f11b	Increment version to 2.0.14.dev0	2018-10-14 16:37:46 +02:00
Matthew Honnibal	7de0dcb91f	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-14 16:12:23 +02:00
Keshan	cb075c8e72	Adding "This is a sentence" example to Sinhala (#2846 )	2018-10-14 00:06:40 +02:00
Matthew Honnibal	9cfab5933a	Set version to 2.0.13	2018-10-13 19:42:16 +02:00
Matthew Honnibal	6a6ae5b0af	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-13 19:41:00 +02:00
mauryaland	36514b5762	Rule-based French Lemmatizer (#2818 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-10-13 16:38:21 +02:00
Matthew Honnibal	de46286107	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-13 16:11:16 +02:00
Ines Montani	cb57b35bb8	Also include lowercase norm exceptions	2018-10-13 15:37:30 +02:00
JKhakpour	74a30d883c	Add Persian(Farsi) language support (#2797 )	2018-10-13 15:31:49 +02:00
Matthew Honnibal	c3ddf98b1e	Set version to 2.0.13.dev4	2018-10-13 15:20:59 +02:00
Marina Lysyuk	b76fe08308	Correcting lang/ru/examples.py (#2845 ) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file	2018-10-13 15:19:43 +02:00
Matthew Honnibal	67ddce68d8	Unskip test	2018-10-02 23:47:55 +02:00
Matthew Honnibal	4cf5ce2cc2	Revert "Remove problematic test" This reverts commit `bdebbef455`.	2018-10-02 23:47:24 +02:00
Matthew Honnibal	bdebbef455	Remove problematic test	2018-10-02 23:16:29 +02:00
Matthew Honnibal	6afc6ffe56	Skip seemingly problematic test	2018-10-02 22:33:40 +02:00
Matthew Honnibal	9e4079ddb2	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-02 19:44:43 +02:00
Matthew Honnibal	40f228c2f2	Set version to 2.0.13.dev3	2018-10-02 19:44:25 +02:00
Ines Montani	ea20b72c08	💫 Make like_num work for prefixed numbers (#2808 ) * Only split + prefix if not numbers * Make like_num work for prefixed numbers * Add test for like_num	2018-10-01 10:49:14 +02:00
Filipe Caixeta	6c498f9ff4	Update Portuguese Language (#2790 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language	2018-09-29 09:51:45 +02:00
Matthew Honnibal	b39810d692	Fix copy_reg compatibility on _serialize module	2018-09-28 15:23:14 +02:00
Matthew Honnibal	f82f8ba5dd	Fix serialization when empty parser model. Closes #2482	2018-09-28 15:18:52 +02:00
Matthew Honnibal	d5a6c63b62	Add regression test for #2482	2018-09-28 15:18:30 +02:00
Matthew Honnibal	e3e9fe18d4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-09-28 14:27:35 +02:00
Matthew Honnibal	0323f5be0c	Fix _serialize module	2018-09-28 14:27:24 +02:00
Ines Montani	5d56eb70d7	Tidy up tests	2018-09-27 16:41:57 +02:00
Ines Montani	1f1bab9264	Remove unused import	2018-09-27 16:41:37 +02:00
Matthew Honnibal	6430b1fe64	Restore encoding arg on msgpack-numpy	2018-09-27 15:58:21 +02:00
Matthew Honnibal	2ac69facc6	Fix Python 2 test failure	2018-09-27 15:34:16 +02:00
Matthew Honnibal	b9ef8ac616	Fix GoldParse class when no entities	2018-09-27 15:14:27 +02:00
Matthew Honnibal	72778375fb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 13:54:49 +02:00
Matthew Honnibal	96fe314d8d	Fix bug when too many entity types. Fixes #2800	2018-09-27 13:54:34 +02:00
Suraj Rajan	bbdc6456c6	Set up dependency tree pattern matching skeleton (#2732 )	2018-09-27 13:27:18 +02:00
Matthew Honnibal	8809dc4514	Remove deprecated encoding argument to msgpack	2018-09-27 12:56:23 +02:00
Matthew Honnibal	bae6b3e2b3	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 12:50:31 +02:00
Ines Montani	71cdbeada7	Revert "Also include lowercase norm exceptions" This reverts commit `70f4e8adf3`.	2018-09-27 12:29:25 +02:00
darindf	8227566805	Fix error (#2802 ) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement	2018-09-26 21:31:03 +02:00
Matthew Honnibal	c8a2841308	Add property to get morph key on token	2018-09-26 21:04:29 +02:00
Matthew Honnibal	823cc4127a	Update morphology tests	2018-09-26 21:04:13 +02:00
Matthew Honnibal	2b8a53ebdc	Fix morphology functions	2018-09-26 21:03:57 +02:00
Matthew Honnibal	022dcda964	Fix morphology enum	2018-09-26 21:03:44 +02:00
Matthew Honnibal	6350234929	Add morphologizer pipeline component to Language	2018-09-26 21:03:20 +02:00
Matthew Honnibal	6f98313254	Fix disjunctive features in English tag map	2018-09-26 21:03:03 +02:00
Matthew Honnibal	f03640b41f	Fix morphology task in ud-train	2018-09-26 21:02:42 +02:00
Matthew Honnibal	1f9f834dc0	Fix morphologizer	2018-09-26 21:02:13 +02:00
Matthew Honnibal	3b6b018904	Fix loading of gold morphology	2018-09-26 21:01:48 +02:00
Ines Montani	5e0dfb34fa	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-26 11:13:58 +02:00
Matthew Honnibal	2be15fa7d2	Fix Python feature enum in morphology	2018-09-25 23:03:43 +02:00
Matthew Honnibal	a4fc397880	Add helper to parse features into field and column IDs	2018-09-25 22:13:10 +02:00
Matthew Honnibal	d0dc032842	Fill in missing morphologizer methods	2018-09-25 22:12:54 +02:00
Matthew Honnibal	53eb96db09	Fix definition of morphology model	2018-09-25 22:12:32 +02:00
Matthew Honnibal	fb0abddd9e	Call morph morphology in GoldParse	2018-09-25 21:34:53 +02:00
Matthew Honnibal	2ba10493f7	Read morphology into gold standard in ud-train	2018-09-25 21:32:24 +02:00
Matthew Honnibal	834dfb0e9d	Add morph attribute to GoldParse	2018-09-25 21:32:05 +02:00
Matthew Honnibal	d89a1a91ac	Update morphology tests	2018-09-25 21:07:48 +02:00
Matthew Honnibal	51a297f934	Fix morphology add and update	2018-09-25 21:07:08 +02:00
Matthew Honnibal	34cab8cc49	Update morphology API	2018-09-25 20:53:24 +02:00
Matthew Honnibal	9998d9b9ff	Start testing morphology class	2018-09-25 20:38:08 +02:00
Matthew Honnibal	4b7e772f5d	Implement the is_animacy_feature etc functions	2018-09-25 17:28:34 +02:00
Matthew Honnibal	6fe7c72560	Reorder morphology enum, and add begin and end markers	2018-09-25 17:28:13 +02:00
Matthew Honnibal	8308c1525e	Fix exception loading	2018-09-25 15:18:21 +02:00
Ines Montani	70f4e8adf3	Also include lowercase norm exceptions	2018-09-25 12:22:02 +02:00
Keshan	9a016d17c2	Adding basic support for Sinhala language. (#2788 ) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement	2018-09-25 12:18:25 +02:00
Matthew Honnibal	e4d8f86d7f	Merge branch 'develop' into feature/lemmatizer	2018-09-25 11:09:22 +02:00
Matthew Honnibal	b42c123e5d	Fix regression introduced by `1759abf1e`	2018-09-25 11:08:58 +02:00
Matthew Honnibal	500898907b	Fix regression in parser.begin_training()	2018-09-25 11:08:31 +02:00
Matthew Honnibal	c2357d3ba0	Fix morphologizer class	2018-09-25 10:58:13 +02:00
Matthew Honnibal	e6dde97295	Add function to make morphologizer model	2018-09-25 10:57:59 +02:00
Matthew Honnibal	be8cf39e16	Fix morphology	2018-09-25 10:57:33 +02:00
Matthew Honnibal	a3d2e616d5	Restore previous morphology stuff	2018-09-25 00:35:59 +02:00
Matthew Honnibal	3bba8e9245	Update structs	2018-09-24 23:58:08 +02:00
Matthew Honnibal	6ae645c4ef	WIP on supporting morphology features	2018-09-24 23:57:41 +02:00
Matthew Honnibal	ac5742223a	Draft class to predict morphological tags	2018-09-24 23:14:06 +02:00
Matthew Honnibal	b10d0cce05	Add MultiSoftmax class Add a new class for the Tagger model, MultiSoftmax. This allows softmax prediction of multiple classes on the same output layer, e.g. one variable with 3 classes, another with 4 classes. This makes a layer with 7 output neurons, which we softmax into two distributions.	2018-09-24 17:35:28 +02:00
Matthew Honnibal	052c45dc2f	Add as_int and as_string methods to StringStore	2018-09-24 15:25:20 +02:00
Ines Montani	3c4e3ade30	Fix typo (closes #2784 )	2018-09-21 10:45:11 +02:00
mauryaland	68b3c544d5	Adding French hyphenated first name (#2786 )	2018-09-21 10:38:13 +02:00
Matthew Honnibal	1759abf1e5	Fix bug in sentence starts for non-projective parses The set_children_from_heads function assumed parse trees were projective. However, non-projective parses may be passed in during deserialization, or after deprojectivising. This caused incorrect sentence boundaries to be set for non-projective parses. Close #2772.	2018-09-19 14:50:06 +02:00
Matthew Honnibal	48fd36bf05	Fix test for issue 27772	2018-09-19 14:47:27 +02:00
Matthew Honnibal	6cd920e088	Add xfail test for deprojectivization SBD bug	2018-09-19 14:00:31 +02:00
Matthew Honnibal	99a6011580	Avoid adding empty layer in model, to keep models backwards compatible	2018-09-14 22:51:58 +02:00
Matthew Honnibal	c046392317	Trigger on_data hooks in parser model	2018-09-14 20:51:21 +02:00
Matthew Honnibal	5afd98dff5	Add a stepping function, for changing batch sizes or learning rates	2018-09-14 18:37:16 +02:00
Matthew Honnibal	27c00f4f22	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-09-14 12:30:57 +02:00
Andrew Ongko	81564cc4e8	Update Indonesian model (#2752 ) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file	2018-09-14 12:30:32 +02:00
Filipe Caixeta	fe515085f3	Add words to portuguese language _num_words (#2759 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words	2018-09-14 12:30:16 +02:00
Matthew Honnibal	f32b52e611	Fix bug that caused deprojectivisation to run multiple times	2018-09-14 12:12:54 +02:00
Matthew Honnibal	8f2a6367e9	Fix usage of PyTorch BiLSTM in ud_train	2018-09-13 22:54:59 +00:00
Matthew Honnibal	afeddfff26	Fix PyTorch BiLSTM	2018-09-13 22:54:34 +00:00
Matthew Honnibal	a26fe8e7bb	Small hack in Language.update to make torch work	2018-09-13 22:51:52 +00:00
Matthew Honnibal	445b81ce3f	Support bilstm_depth argument in ud-train	2018-09-13 19:30:22 +02:00
Matthew Honnibal	b43643a953	Support bilstm_depth option in parser	2018-09-13 19:29:49 +02:00
Matthew Honnibal	45032fe9e1	Support option of BiLSTM in Tok2Vec (requires pytorch)	2018-09-13 19:28:35 +02:00
Matthew Honnibal	3eb9f3e2b8	Fix defaults for ud-train	2018-09-13 18:05:48 +02:00
Matthew Honnibal	59cf533879	Improve ud-train script. Make config optional	2018-09-13 14:24:08 +02:00
Matthew Honnibal	3e3a309764	Fix tagger	2018-09-13 14:14:38 +02:00
Matthew Honnibal	da7650e84b	Fix maximum doc length in ud_train script	2018-09-13 14:10:25 +02:00
Matthew Honnibal	a95eea4c06	Fix multi-task objective for parser	2018-09-13 14:08:55 +02:00
Matthew Honnibal	21321cd6cf	Add tok2vec property to parser model	2018-09-13 14:08:43 +02:00
Matthew Honnibal	d6aa60139d	Fix tagger training on GPU	2018-09-13 14:05:37 +02:00
Grivaz	aeba99ab0d	Introduces a bulk merge function, in order to solve issue #653 (#2696 ) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions	2018-09-10 16:41:42 +02:00
tyburam	476472d181	Lex _attrs for polish language (#2750 ) * Signed spaCy contributor agreement * Added polish version of english lex_attrs	2018-09-10 11:53:57 +02:00
Sainath Adapa	77139bc03c	Basic support for Telugu language (#2751 )	2018-09-10 11:53:18 +02:00
Maxim Kupfer	cebe50b5b8	Remove ')' for clarity (#2737 ) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.	2018-09-10 11:31:49 +02:00
Matthew Honnibal	b2cb1fc67d	Merge matcher tests	2018-09-06 01:39:53 +02:00
Suraj Krishnan Rajan	356af7b0a1	Fix tests	2018-09-06 01:39:36 +02:00
Piotr Żelasko	bdb2165bd1	Less norm computations in token similarity (#2730 ) * Less norm computations in token similarity * Contributor agreement	2018-09-05 21:50:23 +02:00
Aniruddha Adhikary	4530ddcc51	update bengali token rules for hyphen and digits (#2731 )	2018-09-05 21:49:00 +02:00
Nathaniel J. Smith	26849874ad	When calling getoption() in conftest.py, pass a default option (#2709 ) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement	2018-09-03 09:57:52 +02:00
Matthew Honnibal	4d2d7d5866	Fix new feature flags	2018-08-27 02:12:39 +02:00
Matthew Honnibal	598dbf1ce0	Fix character-based tokenization for Japanese	2018-08-27 01:51:38 +02:00
Matthew Honnibal	3763e20afc	Pass subword_features and conv_depth params	2018-08-27 01:51:15 +02:00
Matthew Honnibal	8051136d70	Support subword_features and conv_depth params in Tok2Vec	2018-08-27 01:50:48 +02:00
Matthew Honnibal	9c33d4d1df	Add more hyper-parameters to spacy ud-train * subword_features: Controls whether subword features are used in the word embeddings. True by default (specifically, prefix, suffix and word shape). Should be set to False for languages like Chinese and Japanese. * conv_depth: Depth of the convolutional layers. Defaults to 4.	2018-08-27 01:48:46 +02:00
Ines Montani	e9022f7b33	Remove docstrings for deprecated arguments (see #2703 )	2018-08-26 14:23:13 +02:00
Ines Montani	559f4139e3	Add FAC to spacy.explain (resolves #2706 )	2018-08-26 14:13:50 +02:00
Matthew Honnibal	51a9efbf3b	Add draft Binder class	2018-08-22 13:12:51 +02:00
Matthew Honnibal	5ce459d2ee	Fix error in vocab	2018-08-16 17:18:09 +02:00
Matthew Honnibal	00febda2e3	Improve alignment around quotes	2018-08-16 01:04:34 +02:00
Matthew Honnibal	66a3f2ba21	Lower-case text before alignment	2018-08-16 00:42:36 +02:00
Matthew Honnibal	595c893791	Expose noise_level option in train CLI	2018-08-16 00:41:44 +02:00
Matthew Honnibal	8365226bf3	Fix lookup of symbols in vocab.	2018-08-15 23:43:34 +02:00
Matthew Honnibal	b9f0588580	Set version to v2.1.0a1	2018-08-15 17:22:39 +02:00
Matthew Honnibal	e968016417	Note link between issues #2671 and #2675	2018-08-15 17:18:28 +02:00
Matthew Honnibal	63bdc734ba	Skip flakey test	2018-08-15 16:56:55 +02:00
Matthew Honnibal	ce512e1d47	Fix #2671 : Incorrect match ID on some patterns	2018-08-15 16:19:08 +02:00
Matthew Honnibal	f12b9190f6	Xfail test for issue #2671	2018-08-15 15:55:31 +02:00
Matthew Honnibal	7cfa665ce6	Add failing test for issue 2671: Incorrect rule ID returned from matcher	2018-08-15 15:54:33 +02:00
Matthew Honnibal	1b2a5869ab	Set version to v2.1.0a2.dev0	2018-08-15 15:38:52 +02:00
Matthew Honnibal	5080760288	Add extra comment on 'add label' in parser	2018-08-15 15:37:24 +02:00
Matthew Honnibal	6e749d3c70	Skip flakey parser test	2018-08-15 15:37:04 +02:00
Matthew Honnibal	6ea981c839	Add converter for jsonl NER data	2018-08-14 14:04:32 +02:00
Matthew Honnibal	a9fb6d5511	Fix docs2jsonl function	2018-08-14 14:03:48 +02:00
Matthew Honnibal	ea2edd1e2c	Merge branch 'feature/docs_to_json' into develop	2018-08-14 13:23:42 +02:00
Matthew Honnibal	6ec236ab08	Fix label-clobber bug in parser.begin_training() The parser.begin_training() method was rewritten in v2.1. The rewrite introduced a regression, where if you added labels prior to begin_training(), these labels were discarded. This patch fixes that.	2018-08-14 13:20:19 +02:00
Matthew Honnibal	02c5c114d0	Fix usage of deprecated freqs.txt in init-model	2018-08-14 13:19:15 +02:00
Matthew Honnibal	2a5a61683e	Add function to get train format from Doc objects Our JSON training format is annoying to work with, and we've wanted to retire it for some time. In the meantime, we can at least add some missing functions to make it easier to live with. This patch adds a function that generates the JSON format from a list of Doc objects, one per paragraph. This should be a convenient way to handle a lot of data conversions: whatever format you have the source information in, you can use it to setup a Doc object. This approach should offer better future-proofing as well. Hopefully, we can steadily rewrite code that is sensitive to the current data-format, so that it instead goes through this function. Then when we change the data format, we won't have such a problem.	2018-08-14 13:13:10 +02:00
Matthew Honnibal	4336397ecb	Update develop from master	2018-08-14 03:04:28 +02:00
Matthew Honnibal	13fa550b36	Merge branch 'master' of https://github.com/explosion/spaCy	2018-08-14 02:32:01 +02:00
Ioannis Daras	fe94e696d3	Optimize Greek language support (#2658 )	2018-08-14 02:31:32 +02:00
Matthew Honnibal	85000ea13b	Increment version to 2.0.13.dev2	2018-08-10 00:43:55 +02:00
Matthew Honnibal	c4ac981e6d	Try again to filter warnings	2018-08-10 00:42:54 +02:00
Matthew Honnibal	ae7fc42a41	Increment version to v2.0.13.dev1	2018-08-10 00:14:31 +02:00
Matthew Honnibal	19f5046934	Undoing warning suppression, as doesnt really work	2018-08-10 00:13:34 +02:00
Matthew Honnibal	3fb828352d	Set version to 2.0.13.dev0	2018-08-09 23:49:34 +02:00
Matthew Honnibal	1c0614ecd2	Catch numpy warning	2018-08-09 23:49:24 +02:00
Aashish Gangwani	6eebfc7bf4	Added numbers to ../lang/hi/lex_attrs.py (#2629 ) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-08-08 16:06:11 +02:00
Emil Stenström	3834f4146d	Add abbreviations from UD_Swedish-Talbanken (#2613 ) * Add abbreviations from UD_Swedish-Talbanken * Add contributor agreement.	2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm	0473add369	Feature/span ents (#2599 ) * Created Span.ents property * Add tests for span.ents * Add tests for start and end of sentence	2018-08-07 13:52:32 +02:00
Xiaoquan Kong	87fa847e6e	Fix Chinese language related bugs (#2634 )	2018-08-07 11:26:31 +02:00
Xiaoquan Kong	f0c9652ed1	New Feature: display more detail when Error E067 (#2639 ) * Fix off-by-one error * Add verbose option * Update verbose option * Update documents for verbose option	2018-08-07 10:45:29 +02:00
Emil Stenström	1914c488d3	Swedish: Exceptions for single letter words ending sentence (#2615 ) * Exceptions for single letter words ending sentence Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens. * Add test	2018-08-05 14:14:30 +02:00
Matthew Honnibal	860f5bd91f	Add test for issue 2626	2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak	e531a827db	Changed conllu2json to be able to extract NER tags (#2594 ) * extract ner tags from conllu file if available * fixed a bug in regex	2018-07-25 22:21:31 +02:00
Dmitry Bruhanov	07d0cc9de7	Update examples.py (#2597 )	2018-07-25 22:20:24 +02:00
Matthew Honnibal	66983d8412	Port BenDerPan's Chinese changes to v2 (finally) (#2591 ) * add template files for Chinese * add template files for Chinese, and test directory .	2018-07-25 02:47:23 +02:00
ines	f2e3e039b7	Update French stop words (resolves #2540 )	2018-07-24 23:41:51 +02:00
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
Matthew Honnibal	6303ce3d0e	Try to fix memory error by moving fr_tokenizer to module scope	2018-07-24 20:09:06 +02:00
Matthew Honnibal	afe3fa4449	Merge branch 'master' of https://github.com/explosion/spaCy	2018-07-24 19:44:31 +02:00
Matthew Honnibal	b2e9e958b9	Add session scoping to tokenizers to try to fix oom on Appveyor	2018-07-24 19:44:18 +02:00
Ines Montani	a43ad114c2	Fix typo [ci skip]	2018-07-24 18:45:40 +02:00
Dmitry Bruhanov	27160b1516	added some widespread written jargon & dialectizms (#2584 ) This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc. Dmitry Briukhanov, Linguist & Pythonist	2018-07-24 18:44:29 +02:00

... 4 5 6 7 8 ...

5765 Commits