spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-22 10:02:01 +03:00

Author	SHA1	Message	Date
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Amandine Périnet	d570e75dbb	Improving the French lookup dictionnary for ambiguous words (#3185 ) * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * updating the contributor agreement for amperinet	2019-01-31 23:53:45 +01:00
Ines Montani	e9a6dbe4f3	Don't check for Jupyter in global scope and fix check (#3213 ) Resolves #3208. Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-31 23:49:13 +01:00
Amandine Périnet	b34bc9d2e9	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
Loghi	5ca8e2b269	Tamil (#3194 ) * Tamil language support stop wors, examples and numerical attribite supports added Contributor agreement signed * Create Loghijiaha.md Added contributor agreement * Update CONTRIBUTOR_AGREEMENT.md Adjusted contributor_agreement.md * Norm exceptions added	2019-01-27 06:02:04 +01:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Gavriel Loria	9a5003d5c8	iob converter: add 'exception' for error 'too many values' (#3159 ) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com>	2019-01-16 13:44:16 +01:00
Mark Neumann	e599ed9ef8	Allow vectors to be optional in init-model, more robust string counting (#3155 ) * more robust init-model * key not word * add license agreement	2019-01-14 23:48:30 +01:00
mauryaland	214c2ec263	check if argument flat is true or not (#3156 )	2019-01-14 23:47:05 +01:00
Loghi	d97661d18b	Tamil language support (#3154 ) Tamil language support to spaCy Description Hereby, creating new PR to add support for Tamil language in spaCy added stop words, examples and numerical attributes <--Working on other language data--> Types of change Enhancement Checklist [ x] I have submitted the spaCy Contributor Agreement. [x ] I ran the tests, and all new and existing tests passed. [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-14 15:32:30 +01:00
Amandine Périnet	ee24e2534d	French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131 ) * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * updating contributor agreement for amperinet	2019-01-10 15:41:15 +01:00
Kirill Bulygin	7b064542f7	Making `lang/th/test_tokenizer.py` pass by creating `ThaiTokenizer` (#3078 )	2019-01-10 15:40:37 +01:00
Álvaro Abella Bascarán	1cd8f9823f	Correct docs of `Token.subtree` and `Span.subtree` (issue #3122 ) (#3124 ) * solve inconsistency between docs and Span.subtree (issue #3122) * solve inconsistency between docs and Token.subtree (issue #3122)	2019-01-09 03:11:15 +01:00
Amandine Périnet	eef11a7a2c	French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104 ) * modifying French lookup that contained wrong lemmas * correcting wrong line breaks on hyphen * adding contributor agreement for amperinet@ * correcting a typo	2019-01-07 14:15:19 +01:00
Álvaro Abella Bascarán	e03e1eee92	Bugfix/get lca matrix (#3110 ) This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-06 19:07:50 +01:00
Álvaro Abella Bascarán	6fe276f85d	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:02:26 +01:00
Will Price	4a6af0852a	Improve random prefix generation in displaCy arcs (#3096 ) * Improve random prefix generation in displaCy arcs * Add @willprice contributor agreement	2018-12-27 14:46:02 +01:00
Özcan Kasal	b573ebca77	trilyon forgotten (#3083 ) * trilyon forgotten * contributor added	2018-12-27 14:44:23 +01:00
Muhammad Irfan	2e84ec1513	Fixed ISO code for Urdu. (#3073 )	2018-12-20 12:28:53 +01:00
Kirill Bulygin	10189d9092	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 14:53:50 +01:00
Brixjohn	52f3c95004	Added alpha support for Tagalog language (#3062 ) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template	2018-12-18 13:08:38 +01:00
Amandine Périnet	361554f629	Lemmatization of Adjectives - French : adding rules and vocabulary (#3045 ) * modifying FR lemmatisation for Adjectives * adding contributor agreement for amperinet * correcting some errors in vocabulary files	2018-12-16 18:11:07 +01:00
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Amandine Périnet	0b44ea23bd	Lemmatization of Nouns - French : adding rules and vocabulary (#2992 ) * modifying FR lemmatization for nouns * modifying FR lemmatization for nouns * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized * adding contributor agreement for amperinet * adding a missing comma	2018-12-06 22:42:18 +01:00
Gavriel Loria	9c8c4287bf	Accept iob2 and allow generic whitespace (#2999 ) * accept non-pipe whitespace as delimiter; allow iob2 filename * added small documentation note for IOB2 allowance * added contributor agreement	2018-12-06 15:50:25 +01:00
Amandine Périnet	2457318b7a	Lemmatization of Verbs - French : adding rules and vocabulary (#3006 ) * updating rules and vocabulary for French lemmatization of verbs * updating the file with French auxiliary verb * updating rules and vocabulary for French lemmatization of verbs * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized	2018-12-06 15:49:28 +01:00
Beate Sildnes	f0d7e206ec	Updated wordforms for Norwegian lemmatizer (#3007 ) * Updated wordforms for Norwegian lemmatizer Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup). * Add spaCy contributor agreement for user beatesi * Updated wordforms for Norwegian lemmatizer	2018-12-06 15:46:18 +01:00
Matthew Honnibal	bbaca991ba	Set version to v2.0.18	2018-12-01 03:35:09 +01:00
Matthew Honnibal	e1a4b0d7f7	Set version to v2.0.18.dev1	2018-12-01 03:12:12 +01:00
Matthew Honnibal	413530b269	Set version to 2.0.18	2018-12-01 03:00:27 +01:00
Matthew Honnibal	24d52876e1	Set version to v2.0.18.dev0	2018-12-01 02:38:04 +01:00
Ines Montani	c9bdeafbc7	Don't run weird failing test for now	2018-11-30 16:13:40 +01:00
Sofie	585de273cd	Fix small typo bug in French regexp + relevant unit test (#2980 ) * additional unit test for new entr word not in other lists * bugfix - unit test works * use _latin_lower instead of alpha_lower for french * revert back to ALPHA_LOWER (following the code for languages) * contributor agreement	2018-11-29 20:16:13 +01:00
Adam Schwalm	00566949de	Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977 ) Fixes #2976	2018-11-28 19:49:33 +01:00
Ines Montani	968aff2f6a	Update tests for pytest 4.x (#2965 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-26 18:14:57 +01:00
Marc Puig	98fe1ab259	Catalan Language Support (#2940 ) * Catalan language Support * Ddding Catalan to documentation	2018-11-26 15:25:47 +01:00
Ines Montani	048416f265	Fix formatting	2018-11-26 13:27:41 +01:00
Shawn Cicoria	7601ae0cff	fixes symbolic link on py3 and windows (#2949 ) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com>	2018-11-24 15:34:23 +01:00
Ines Montani	02fc73ca53	💫 Create random IDs for SVGs to prevent ID clashes (#2927 ) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-15 11:40:10 +01:00
mauryaland	87ce435aff	Check if the word is in one of the regular lists specific to each POS (#2886 )	2018-11-14 15:58:43 +01:00
Daniel Hershcovich	d3d419ecc0	Allow input text of length up to max_length, inclusive (#2922 )	2018-11-13 16:46:29 +01:00
Matthew Honnibal	db08b168a3	Set version to 2.0.17	2018-10-29 23:22:18 +01:00
Matthew Honnibal	e2ae25d6f5	Try setting older regex version, to align with conda	2018-10-29 13:39:00 +01:00
Matthew Honnibal	d4fa9af56f	Set version to 2.0.17.dev0	2018-10-28 16:15:26 +01:00
Matthew Honnibal	b2e2bba8b0	Fix missing comma	2018-10-28 00:09:16 +02:00
Wannaphong Phatthiyaphaibun	2d2765fd8a	Change PyThaiNLP Url (#2876 )	2018-10-27 14:46:07 +02:00
Matthew Honnibal	9447739027	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-27 00:50:48 +02:00
Matthew Honnibal	ad068f51be	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 00:46:30 +02:00

1 2 3 4 5 ...

5045 Commits