Commit Graph

5321 Commits

Author SHA1 Message Date
Matthew Honnibal
82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
afe3fa4449 Merge branch 'master' of https://github.com/explosion/spaCy 2018-07-24 19:44:31 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Ines Montani
a43ad114c2
Fix typo [ci skip] 2018-07-24 18:45:40 +02:00
Dmitry Bruhanov
27160b1516 added some widespread written jargon & dialectizms (#2584)
This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc.
Dmitry Briukhanov, Linguist & Pythonist
2018-07-24 18:44:29 +02:00
ines
3c30d1763c Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
Matthew Honnibal
90c269e1a9 Set about to v2.0.12 release 2018-07-21 15:09:42 +02:00
Matthew Honnibal
1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
ines
1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
Matthew Honnibal
e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal
899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal
9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
Ines Montani
e7b075565d
💫 Rule-based NER component (#2513)
* Add helper function for reading in JSONL

* Add rule-based NER component

* Fix whitespace

* Add component to factories

* Add tests

* Add option to disable indent on json_dumps compat

Otherwise, reading JSONL back in line by line won't work

* Fix error code
2018-07-18 19:43:16 +02:00
ines
d84b13e02c Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
ines
02aefe7cc0 Merge branch 'master' into develop 2018-07-18 18:52:59 +02:00
Ioannis Daras
6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
ines
80e7485630 Merge branch 'master' into develop 2018-07-18 17:28:47 +02:00
Paul O'Leary McCann
61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K
f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
ines
3a321e79ac Merge branch 'master' into develop 2018-07-10 13:49:08 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter
3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
ines
fd6207426a Merge branch 'master' into develop 2018-07-09 18:05:10 +02:00
Duygu Altinok
00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm
c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
ines
38e07ade4c Add test for custom tokenizer serialization (resolves #2494) 2018-07-06 12:40:51 +02:00
ines
c2581f9172 Tidy up tokenizer test 2018-07-06 12:40:28 +02:00
Matthew Honnibal
43dcaa473e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:36:42 +02:00
Matthew Honnibal
6c8d627733 Fix tokenizer deserialization 2018-07-06 12:36:33 +02:00
ines
c001d46153 Tidy up 2018-07-06 12:33:42 +02:00
Matthew Honnibal
63f5651f8d Fix tokenizer serialization 2018-07-06 12:32:11 +02:00
Matthew Honnibal
e1569fda4e Fix compile error in matcher 2018-07-06 12:29:23 +02:00
Matthew Honnibal
f5b2076700 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:23:14 +02:00
Matthew Honnibal
1a2f61725c Fix tokenizer serialization 2018-07-06 12:23:04 +02:00
ines
9e09477b2f Remove unused import 2018-07-06 12:18:17 +02:00
ines
26f04a6ac3 Fix Matcher tests and add test for any token with operator 2018-07-06 12:17:50 +02:00
Matthew Honnibal
f5703b7a91 Clean up unused stuff in matcher 2018-07-06 12:16:44 +02:00
Matthew Honnibal
08c362d541 Suppress compiler warning about unreachable code 2018-07-06 11:31:22 +02:00
Matthew Honnibal
8ae1bec8bf Fix init_model 2018-07-05 14:02:06 +02:00
Matthew Honnibal
7b09a4ca49 Fix lemmatization 2018-07-05 13:56:02 +02:00
Matthew Honnibal
ec41ceb383 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-05 13:49:42 +02:00
Matthew Honnibal
4eb3405df7 Fix lemmatizer ordering, re Issue #1387 2018-07-05 13:49:29 +02:00
ines
63666af328 Merge branch 'master' into develop 2018-07-04 14:52:25 +02:00
ines
8feb7cfe2d Remove model dependency from French lemmatizer tests 2018-07-04 14:46:45 +02:00
kleinay
a82c3153ad fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452)
<!--- Provide a general summary of your changes in the title. -->
Referring #2452, fixing displacy arrow directions to match the input. 

## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal
dee8bdb900 Fix init-model for npz vectors 2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0 Fix model init from jsonl 2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9 Save vectors init 2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887 Fix init_model 2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5 Fix init_model 2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3 Fix init model 2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a Support .npz vectors in init-model command 2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939 Fix init_model arg 2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3 Fix init model command 2018-07-03 16:32:23 +02:00
Matthew Honnibal
97487122ea Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-03 15:44:37 +02:00
Matthew Honnibal
6a89faf12e Add support for jsonl-formatted lexical attributes to init-model command. 2018-07-03 12:22:56 +02:00
Matthew Honnibal
2ec2192000 Revert #1389: Don't overrule rules when lemma exception is present 2018-06-29 19:43:02 +02:00
Matthew Honnibal
01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal
a1b05048d0 Fix tagger when doc is empty 2018-06-29 16:05:40 +02:00
Matthew Honnibal
3786942ff1 Fix tagger when docs are empty 2018-06-29 15:13:45 +02:00
ines
526be40823 Add test for 46d8a66 2018-06-29 14:33:12 +02:00
ines
f08c871adf Fix typo in Language.from_disk 2018-06-29 14:32:16 +02:00
Matthew Honnibal
46d8a66fef Fix tokenizer serialization if token_match is None 2018-06-29 14:24:46 +02:00
Matthew Honnibal
e0860bcfb3 Fix bug when docs are empty 2018-06-29 13:56:29 +02:00
Matthew Honnibal
a4d2b0c293 Fix bug when docs are empty 2018-06-29 13:44:25 +02:00
Matthew Honnibal
c83fccfe2a Fix output of best model 2018-06-25 23:05:56 +02:00
Matthew Honnibal
5a65418c40 Fix handling of unseen labels in tagger 2018-06-25 22:28:59 +02:00
Matthew Honnibal
5b56aad4c2 Fix handling of unseen labels in tagger 2018-06-25 22:24:54 +02:00
Matthew Honnibal
3aabf621a3 Fix handling of unknown tags in tagger update 2018-06-25 22:01:02 +02:00
Matthew Honnibal
69c900f003 Fix init-model if no vectors provided 2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a Fix init-model if no vectors provided 2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712 Don't collate model unless training succeeds 2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Matthew Honnibal
24dfbb8a28 Fix model collation 2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4 Import shutil 2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e Import json into cli.train 2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2 Fix collation of best models 2018-06-25 01:21:34 +02:00
Matthew Honnibal
9d6a1c57f2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-06-24 23:40:06 +02:00
Matthew Honnibal
2c80b7c013 Collate best model after training 2018-06-24 23:39:52 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b Silent keyword in info function in init (#2459)
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* contributor agreement
2018-06-18 12:24:21 +02:00
ines
778e5f4da3 Merge branch 'master' into develop 2018-06-11 00:38:04 +02:00
himkt
57311d5d47 replace janome with mecab in the documentation and the test (#2415)
* Add links to Reddit data (see #2401)

* replace janome with mecab in the documentation and the test

* add the assignment
2018-06-11 00:33:13 +02:00
Nour Shalabi
a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
a0017e4909 Merge branch 'master' into develop 2018-05-30 14:10:47 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
ines
4a62486340 Merge branch 'master' into develop 2018-05-30 13:01:01 +02:00
Maciej
c7d53348d7 Fix bug in CLI iob and ner converter (#2392) (fixes #2385)
* issue_2385 add tests for iob_to_biluo converter function

* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter

* issue_2385 add test to fix b char bug

* add contributor agreement

* fill contributor agreement
2018-05-30 12:28:44 +02:00
ines
3c3a175018 Merge branch 'master' into develop 2018-05-28 18:37:09 +02:00
ansgar-t
9732988951 escape html in displacy.render (#2378) (closes #2361)
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp;amp; , &amp;lt; , &amp;gt; , &amp;quot; in before rendering svg

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
f7103babd9 Only overwrite warnings filter if set explicitly (resolves #2369)
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428 Fix spacy requirement constraint in package template 2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793 Support pipeline factories via entry points (#2348) 2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652 Fix unpickling of NER parser 2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4 Merge branch 'master' into develop 2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b Simplify is_config() and normalize_string_keys() (#2305)
* Simplify is_config() and normalize_string_keys()

* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string

* Keep more basic loop in normalize_string_keys

* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197)
* Add logic to filter out warning IDs via environment variable

Usage: SPACY_WARNING_EXCLUDE=W001,W007

* Add warnings for empty vectors

* Add warning if no word vectors are used in .similarity methods

For example, if only tensors are available in small models – should hopefully clear up some confusion around this

* Capture warnings in tests

* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec Merge setup.py 2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c 💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346)
* Go back to using requests instead of urllib (closes #2320)

Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.

* Only download model if not installed (see #1456)

Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.

* Pass additional options to pip when installing model (resolves #1456)

Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:

python -m spacy download en --user

* Add CLI option to enable installing model package dependencies

* Revert "Add CLI option to enable installing model package dependencies"

This reverts commit 9336ffe695.

* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5 Require thinc 6.11.1 and prepare for release to spacy-nightly 2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-20 18:59:24 +02:00
ines
5401c55c75 Merge branch 'master' into develop 2018-05-20 16:49:40 +02:00
ines
b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
ines
5768df4f09 Add SimpleFrozenDict util to use as default function argument 2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f Fix parser for GPU 2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3 Use rising beam update prob 2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db Merge branch 'develop' into feature/refactor-parser 2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa Revert "Improve dynamic oracle when values are missing in parse"
This reverts commit f56bd4736b.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358 Add missing name attribute for parser 2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca Fix size limits in training data 2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0 Fix parser model loading 2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd Merge branch 'develop' into feature/refactor-parser 2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c Revert hacks to tests 2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b Restore beam_density argument for parser beam 2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25 Disable some tests to figure out why CI fails 2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7 Disable some tests to figure out why CI fails 2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58 Set a more aggressive threshold on the max violn update 2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b Default to beam_update_prob 1 2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681 Fix error in beam gradient calculation 2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7 Don't modify Token in global scope 2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef Bug fixes to beam search after refactor 2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3 Add a keyword argument sink to GoldParse 2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87 Avoid relying on final gold check in beam search 2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77 Support oracle segmentation in ud-train CLI command 2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a Fix beam parsing 2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d Fix parser 2018-05-08 00:27:26 +02:00