Gregory Howard
f2ab7d77b4
Lazy imports language
2017-05-03 11:01:42 +02:00
Gregory Howard
92f368f83b
Removing extra spaces
2017-04-27 12:02:14 +02:00
Gregory Howard
8ff4682255
correcting tokenizer exception.
...
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Gregory Howard
ad8129cb45
Improvement of rules now title insentive and have same declaration format
2017-04-27 10:23:56 +02:00
Gregory Howard
ed5f094451
Adding insensitive lemmatisation test
2017-04-25 18:07:02 +02:00
ghoward
55c6910f90
Look_up table for languages in spacy.
...
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00
Ben Eyal
d8098a8be2
Use regex
instead of re
2017-04-20 02:22:52 +03:00
ines
66c1f194f9
Use consistent unicode declarations
2017-03-12 13:07:28 +01:00
Matthew Honnibal
bd4375a2e6
Remove comment
2017-02-27 11:44:26 +01:00
Matthew Honnibal
e7e22d8be6
Move import within get_exceptions() function, to speed import
2017-02-27 11:34:48 +01:00
Matthew Honnibal
26446aa728
Avoid loading all French exceptions on import
...
Move exceptions loading behind a get_tokenizer_exceptions() function
for French, instead of loading into the top-level namespace. This
cuts import times from 0.6s to 0.2s, at the expense of making the
French data a little different from the others (there's no top-level
TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat
unsatisfying.
2017-02-25 11:55:00 +01:00
ines
0e2e331b58
Convert exceptions to Python list
2017-02-24 18:22:40 +01:00
ines
f08e180a47
Make groups non-capturing
...
Prevents hitting the 100 named groups limit in Python
2017-02-10 13:35:02 +01:00
ines
fa3b8512da
Use consistent imports and exports
...
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
2017-02-10 13:34:09 +01:00
ines
21f09d10d7
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
...
This reverts commit f02a2f9322
.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
...
This reverts commit b95afdf39c
, reversing
changes made to b0ccf32378
.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
5d706ab95d
Merge tokenizer exceptions from PR #802
2017-02-09 16:30:28 +01:00
Raphaël Bournhonesque
85f951ca99
Add tokenizer exceptions for French
2017-02-02 08:36:16 +01:00
Raphaël Bournhonesque
1faaf698ca
Add infixes and abbreviation exceptions (fr)
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
cf8474401b
Remove unused import statement
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
902f136f18
Add support for elision in French
2017-01-24 10:57:37 +01:00
Ines Montani
0dec90e9f7
Use global abbreviation data languages and remove duplicates
2017-01-08 20:36:00 +01:00
Ines Montani
2b2ea8ca11
Reorganise language data
2016-12-18 16:54:19 +01:00
Ines Montani
e0a7b5c612
Fix formatting
2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67
Move shared functions and constants to global language data
2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086
Move update_exc to global language data utils
2016-12-17 12:29:02 +01:00
Ines Montani
487ce1e20a
Add encoding declaration
2016-12-17 12:25:44 +01:00
Ines Montani
1b3b043660
Add French stopwords
2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb
Update French language data
2016-12-08 20:07:14 +01:00
Matthew Honnibal
3d4bd96e8a
Fix infixes in french
2016-11-02 20:41:43 +01:00
Matthew Honnibal
ad1c747c6b
Fix stray POS in language stubs
2016-11-02 20:37:55 +01:00
Matthew Honnibal
6dbf4f7ad7
Stub out support for French, Spanish, Italian and Portuguese
2016-11-02 20:02:41 +01:00