spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-18 00:35:50 +03:00

Author	SHA1	Message	Date
Brian Phillips	8227de0099	Update language.jade (#2616 )	2018-07-31 12:34:42 +02:00
Ioannis Daras	055cc0de44	Bug fix to pseudocode for tokenizer customization (#2604 )	2018-07-27 11:04:12 +02:00
Kaisa (Katarzyna) Korsak	e531a827db	Changed conllu2json to be able to extract NER tags (#2594 ) * extract ner tags from conllu file if available * fixed a bug in regex	2018-07-25 22:21:31 +02:00
Dmitry Bruhanov	07d0cc9de7	Update examples.py (#2597 )	2018-07-25 22:20:24 +02:00
Andriy Mulyar	e9ef51137d	Fixed typo (#2596 ) Changed 'The index of the first character after the span.' to The index of the last character after the span' in description of doc.char_span	2018-07-25 22:17:15 +02:00
Matthew Honnibal	66983d8412	Port BenDerPan's Chinese changes to v2 (finally) (#2591 ) * add template files for Chinese * add template files for Chinese, and test directory .	2018-07-25 02:47:23 +02:00
ines	f2e3e039b7	Update French stop words (resolves #2540 )	2018-07-24 23:41:51 +02:00
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
kororo	b1ec827ee0	Fix typo (#2579 ) Update slogan, desc and code snippet to latest version	2018-07-24 22:47:33 +02:00
ines	cd687091fb	Remove nl examples from widget for now [ci skip] Restore for next spaCy version when path to example sentences is fixed	2018-07-24 22:41:20 +02:00
ines	2d8ffb8bcd	Fix formatting	2018-07-24 22:40:49 +02:00
ines	1b3da8d2ae	Update website for v2.0.12 [ci skip]	2018-07-24 21:04:22 +02:00
Matthew Honnibal	e05bebce8e	Try setting appveyor to Python2 64	2018-07-24 20:47:03 +02:00
Matthew Honnibal	6303ce3d0e	Try to fix memory error by moving fr_tokenizer to module scope	2018-07-24 20:09:06 +02:00
Matthew Honnibal	afe3fa4449	Merge branch 'master' of https://github.com/explosion/spaCy	2018-07-24 19:44:31 +02:00
Matthew Honnibal	b2e9e958b9	Add session scoping to tokenizers to try to fix oom on Appveyor	2018-07-24 19:44:18 +02:00
Ines Montani	a43ad114c2	Fix typo [ci skip]	2018-07-24 18:45:40 +02:00
Dmitry Bruhanov	27160b1516	added some widespread written jargon & dialectizms (#2584 ) This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc. Dmitry Briukhanov, Linguist & Pythonist	2018-07-24 18:44:29 +02:00
Dmitry Bruhanov	4ad7de6ca9	DimaBryuhanov.md (#2590 ) # spaCy contributor agreement This spaCy Contributor Agreement ("SCA") is based on the [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). The SCA applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term "you" shall mean the person or entity identified below. If you agree to be bound by these terms, fill in the information requested below and include the filled-in version with your first pull request, under the folder [`.github/contributors/`](/.github/contributors/). The name of the file should be your GitHub username, with the extension `.md`. For example, the user example_user would create the file `.github/contributors/example_user.md`. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement. ## Contributor Agreement 1. The term "contribution" or "contributed materials" means any source code, object code, patch, tool, sample, graphic, specification, manual, documentation, or any other material posted or submitted by you to the project. 2. With respect to any worldwide copyrights, or copyright applications and registrations, in your contribution: * you hereby assign to us joint ownership, and to the extent that such assignment is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license to exercise all rights under those copyrights. This includes, at our option, the right to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements; * you agree that each of us can do all things in relation to your contribution as if each of us were the sole owners, and if one of us makes a derivative work of your contribution, the one who makes the derivative work (or has it made will be the sole owner of that derivative work; * you agree that you will not assert any moral rights in your contribution against us, our licensees or transferees; * you agree that we may register a copyright in your contribution and exercise all ownership rights associated with it; and * you agree that neither of us has any duty to consult with, obtain the consent of, pay or render an accounting to the other for any use or distribution of your contribution. 3. With respect to any patents you own, or that you can license without payment to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free license to: * make, have made, use, sell, offer to sell, import, and otherwise transfer your contribution in whole or in part, alone or in combination with or included in any product, work or materials arising out of the project to which your contribution was submitted, and * at our option, to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements. 4. Except as set out above, you keep all right, title, and interest in your contribution. The rights that you grant to us under these terms are effective on the date you first submitted a contribution to us, even if your submission took place before the date you sign these terms. 5. You covenant, represent, warrant and agree that: * Each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this SCA; * to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and * each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws. You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. We may publicly disclose your participation in the project, including the fact that you have signed the SCA. 6. This SCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply. 7. Please place an “x” on one of the applicable statement below. Please do NOT mark both statements: * [X] I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect to my contributions. * [ ] I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity. ## Contributor Details \| Field \| Entry \| \|------------------------------- \| -------------------- \| \| Name \| Dmitry Briukhanov \| \| Company name (if applicable) \| - \| \| Title or role (if applicable) \| - \| \| Date \| 7/24/2018 \| \| GitHub username \| DimaBryuhanov \| \| Website (optional) \| \|	2018-07-24 18:43:27 +02:00
Matthew Honnibal	1a16162da9	Merge branch 'master' of https://github.com/explosion/spaCy	2018-07-21 15:57:18 +02:00
Matthew Honnibal	cabce07ba6	Fix thinc version requirement	2018-07-21 15:56:33 +02:00
ines	ae5ed2d698	Update docs for v2.0.12 [ci skip]	2018-07-21 15:51:44 +02:00
ines	d517dd4297	Document remove_extension methods	2018-07-21 15:51:28 +02:00
ines	153f41a5cc	Use better examples for Doc extension methods	2018-07-21 15:51:11 +02:00
ines	3c30d1763c	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
Matthew Honnibal	f0024e3b13	Add script to push a tag	2018-07-21 15:10:54 +02:00
Matthew Honnibal	90c269e1a9	Set about to v2.0.12 release	2018-07-21 15:09:42 +02:00
Matthew Honnibal	1a1c7304cf	Set version to 2.0.12.dev1	2018-07-21 13:08:01 +02:00
Matthew Honnibal	a723fafea3	Require thinc 6.10.3.dev1	2018-07-21 12:49:09 +02:00
ines	1ea881c80b	Allow ignoring warnings and only overwrite if set explicitly	2018-07-20 22:50:19 +02:00
ines	95641f4026	Only install pathlib backport on Python < 3.4	2018-07-20 21:08:29 +02:00
Matthew Honnibal	e0caf3ae8c	Fix msgpack for new version	2018-07-20 17:32:00 +02:00
Matthew Honnibal	899f1cf442	Add regression test for issue 2179	2018-07-20 17:15:44 +02:00
Matthew Honnibal	9db77fd914	Fix deserialization for msgpack	2018-07-20 14:11:09 +02:00
Matthew Honnibal	adde3826e2	Build against thinc 6.10.3.dev0	2018-07-20 13:34:54 +02:00
katarkor	5ca853bee0	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 ) * changed tag_map, morph_rules, lemmatizer for Norwegian * Move unicode declaration up Hopefully fixes test failure on Python 2 * Update CONTRIBUTOR_AGREEMENT.md * Move unicode declarations Hopefully fixes test this time * Revert "Merge remote-tracking branch 'origin/patch-1'" This reverts commit `f5ccd5dd0d`, reversing changes made to `dd07e180ea`. * Update contributor agreement [ci skip]	2018-07-19 19:38:24 +02:00
kororo	2784babef9	Add ExcelCy into Universe list (#2572 ) Hi guys, This is my first spaCy extension. I am excited to able to do this. Please do let me know if there is any suggestions or modifications I need to do. Feel free to use/contribute the repo that I made. ## Description ExcelCy is a SpaCy toolkit to help improve the data training experiences. It provides easy annotation using Excel file format. It has helper to pre-train entity annotation with phrase and regex matcher pipe. ### Types of change Update to Universe list in website. ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-19 19:28:33 +02:00
ines	4339f64128	Merge branch 'master' into develop	2018-07-19 16:15:03 +02:00
ines	d489ffb78b	Fix formatting [ci skip]	2018-07-19 13:22:25 +02:00
ines	c0b62ce13c	Ignore pytest cache	2018-07-19 12:30:09 +02:00
Ines Montani	e7b075565d	💫 Rule-based NER component (#2513 ) * Add helper function for reading in JSONL * Add rule-based NER component * Fix whitespace * Add component to factories * Add tests * Add option to disable indent on json_dumps compat Otherwise, reading JSONL back in line by line won't work * Fix error code	2018-07-18 19:43:16 +02:00
ines	d84b13e02c	Merge branch 'master' into develop	2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm	6e2930a4a2	Conll(u)-bio converter (#2525 ) * Started simple conllxbiluo converter * Fix missing BIO to BILUO conversion	2018-07-18 18:55:42 +02:00
ines	02aefe7cc0	Merge branch 'master' into develop	2018-07-18 18:52:59 +02:00
Ioannis Daras	6ed18412d0	Greek language optimizations (#2558 ) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words	2018-07-18 18:51:38 +02:00
ines	80e7485630	Merge branch 'master' into develop	2018-07-18 17:28:47 +02:00
Xiang Ji	19a5ef1c58	Fix venv command examples (#2560 ) [ci skip] * Fix venv command examples The documentation refers to `venv`, which is native to Python3. However, the command examples are as if they were still `virtualenv`, which is a package independent of `venv`: - It doesn't need to be installed via `pip`. In fact `pip install venv` would return an error. - The correct way to invoke `venv` is `python3 -m venv`, not `venv`, which would return command not found. See https://docs.python.org/3/library/venv.html I suspect the documentation simply replaced all occurrences of `virtualenv` with `venv`. However they are different modules and are used differently. * Update comment [ci skip]	2018-07-18 10:31:24 +02:00
Paul O'Leary McCann	61ef0739b8	Add Japanese stop words. (#2549 ) List created by taking the 2000 top words from a Wikipedia dump and removing everything that wasn't hiragana. Tried going through kanji words and deciding what to keep but there were too many obvious non-stopwords (東京 was in the top 500) and many other words where it wasn't clear if they should be included or not.	2018-07-17 10:12:48 +02:00
Tero K	f35980f865	Enhancement/lang fi examples (#2547 ) * Added a file with examples in finnish * added contributor agreement	2018-07-15 09:50:27 +02:00

... 3 4 5 6 7 ...

9122 Commits