Commit Graph

13010 Commits

Author SHA1 Message Date
Matthew Honnibal
7be8a0516a Fix project pull 2020-09-03 18:54:03 +02:00
Ines Montani
b1eb98b15c Remove todos [ci skip] 2020-09-03 17:43:58 +02:00
Ines Montani
23b7d9cfa3 Prefix span getters 2020-09-03 17:37:06 +02:00
Ines Montani
5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Ines Montani
c063e55eb7 Add prefix to batchers 2020-09-03 17:30:41 +02:00
Ines Montani
804f120361 Don't use registered function version in title 2020-09-03 17:29:47 +02:00
Ines Montani
896caf45e3
Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip] 2020-09-03 17:13:44 +02:00
Ines Montani
c53b1433b9 Adjust more arguments [ci skip] 2020-09-03 17:12:24 +02:00
Ines Montani
121809dd1e Fix anchor [ci skip] 2020-09-03 16:49:56 +02:00
Ines Montani
25a595dc10 Fix typos and wording [ci skip] 2020-09-03 16:37:45 +02:00
Ines Montani
b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Matthew Honnibal
f038841798 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-03 12:52:39 +02:00
Matthew Honnibal
ef0d0630a4 Let Langugae.use_params work with falsey inputs
The Language.use_params method was failing if you passed in None, which
meant we had to use awkward conditionals for the parameter averaging.
This solves the problem.
2020-09-03 12:51:04 +02:00
Ines Montani
b02ad8045b Update docs [ci skip] 2020-09-03 10:10:13 +02:00
Yohei Tamura
5af432e0f2
fix for empty string (#5936) 2020-09-03 10:09:03 +02:00
Ines Montani
1815c613c9 Update docs [ci skip] 2020-09-03 10:07:45 +02:00
Ines Montani
6f46d4e4d2
Merge pull request #6017 from svlandeg/feature/docs-layers [ci skip] 2020-09-03 10:03:23 +02:00
Adriane Boyd
77ac4a38aa
Simplify specials and cache checks (#6012) 2020-09-03 09:42:49 +02:00
Adriane Boyd
8b5594df86 Remove near-duplicate test 2020-09-02 20:32:01 +02:00
Matthew Honnibal
122cb02001 Fix averages 2020-09-02 19:37:43 +02:00
Adriane Boyd
960d9cfadc Officially support DependencyMatcher
Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.

Patterns
--------

Refactor pattern structure to:

```
{
  "LEFT_ID": str,
  "REL_OP": str,
  "RIGHT_ID": str,
  "RIGHT_ATTRS": dict,
}
```

The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.

New operators
-------------

Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.

* `.*`: precedes
* `;`: immediately follows
* `;*`: follows

Operator fixes
--------------

* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
  all sibling operators)
* Linear precedence operators do not match nodes outside the same parse

Additional fixes
----------------

* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling
2020-09-02 17:45:29 +02:00
svlandeg
ab909a3f68 Merge branch 'feature/docs-layers' of https://github.com/svlandeg/spaCy into feature/docs-layers 2020-09-02 17:44:00 +02:00
svlandeg
cda45dd1ab Merge remote-tracking branch 'upstream/develop' into feature/docs-layers 2020-09-02 17:43:45 +02:00
svlandeg
19298de352 small fix 2020-09-02 17:43:11 +02:00
svlandeg
bbaea530f6 sublayers paragraph 2020-09-02 17:36:22 +02:00
svlandeg
1be7ff02a6 swapping section 2020-09-02 15:26:07 +02:00
Marek Grzenkowicz
92d7832a86
Fix off-by-one error for best iteration calculation (closes #6014) (#6016) 2020-09-02 15:15:45 +02:00
Matthew Honnibal
737a1408d9 Improve implementation of fix #6010
Follow-ups to the parser efficiency fix.

* Avoid introducing new counter for number of pushes
* Base cut on number of transitions, keeping it more even
* Reintroduce the randomization we had in v2.
2020-09-02 14:42:32 +02:00
svlandeg
57e432ba2a editor tip as Accordion instead of Infobox 2020-09-02 14:26:57 +02:00
svlandeg
d19ec6c67b small rewrites in types paragraph 2020-09-02 14:25:18 +02:00
svlandeg
821b2d4e63 update examples 2020-09-02 14:15:50 +02:00
svlandeg
e29a33449d rewrite intro, simpel Model example 2020-09-02 13:41:18 +02:00
svlandeg
422df9c2e2 Merge remote-tracking branch 'upstream/develop' into feature/docs-layers
# Conflicts:
#	website/docs/usage/layers-architectures.md
2020-09-02 13:17:11 +02:00
Sofie Van Landeghem
eb56377799
Fix overfitting test (#6011)
* remove unused MORPH_RULES

* fix textcat architecture in overfitting test
2020-09-02 13:07:41 +02:00
Adriane Boyd
b97d98783a
Fix Hungarian % tokenization (#6013) 2020-09-02 13:06:16 +02:00
Ines Montani
70238543c8 Update layers/arch docs structure [ci skip] 2020-09-02 13:04:35 +02:00
Matthew Honnibal
c1bf3a5602
Fix significant performance bug in parser training (#6010)
The parser training makes use of a trick for long documents, where we
use the oracle to cut up the document into sections, so that we can have
batch items in the middle of a document. For instance, if we have one
document of 600 words, we might make 6 states, starting at words 0, 100,
200, 300, 400 and 500.

The problem is for v3, I screwed this up and didn't stop parsing! So
instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch
of [600, 500, 400, 300, 200, 100]. Oops.

The implementation here could probably be improved, it's annoying to
have this extra variable in the state. But this'll do.

This makes the v3 parser training 5-10 times faster, depending on document
lengths. This problem wasn't in v2.
2020-09-02 12:57:13 +02:00
svlandeg
474abb2e59 remove unused MORPH_RULES from test 2020-09-02 11:37:56 +02:00
svlandeg
6fd7f140ec custom-architectures section 2020-09-02 11:14:06 +02:00
svlandeg
3d9ae9286f small fixes 2020-09-02 10:46:38 +02:00
Sofie Van Landeghem
f7a25d69f7
Bugfix in merge_entities (#6005)
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Sofie Van Landeghem
6bfb1b3a29
Fix sparse checkout for 'spacy project' (#6008)
* exit if cloning fails

* UX

* rewrite http link to git protocol, don't use stdin

* fixes to sparse checkout

* formatting
2020-09-01 19:49:01 +02:00
Matthew Honnibal
4cce32f090 Fix tagger initialization 2020-09-01 16:38:34 +02:00
Matthew Honnibal
046c38bd26
Remove 'cleanup' of strings (#6007)
A long time ago we went to some trouble to try to clean up "unused"
strings, to avoid the `StringStore` growing in long-running processes.

This never really worked reliably, and I think it was a really wrong
approach. It's much better to let the user reload the `nlp` object as
necessary, now that the string encoding is stable (in v1, the string IDs
were sequential integers, making reloading the NLP object really
annoying.)

The extra book-keeping does make some performance difference, and the
feature is unsed, so it's past time we killed it.
2020-09-01 16:12:15 +02:00
Ines Montani
690bd77669 Add todos [ci skip] 2020-09-01 14:04:36 +02:00
Ines Montani
70b226f69d Support ignore marker in project document [ci skip] 2020-09-01 12:49:04 +02:00
Ines Montani
a4c51f0f18 Add v3 info to project docs [ci skip] 2020-09-01 12:36:21 +02:00
Ines Montani
ef9005273b Update fill-config command and add silent mode [ci skip] 2020-09-01 12:07:04 +02:00
Matthew Honnibal
027c82c068 Update makefile 2020-09-01 01:22:54 +02:00
Matthew Honnibal
bff1640a75 Try to debug tmpdir problem 2020-09-01 01:13:09 +02:00