spaCy/website
Kevin Humphreys 19650ebb52
Enable fuzzy text matching in Matcher (#11359)
* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* fuzzy match only on oov tokens

* remove polyleven

* exclude whitespace tokens

* don't allow more edits than characters

* fix min distance

* reinstate FUZZY operator

with length-based distance function

* handle sets inside regex operator

* remove is_oov check

* attempt build fix

no mypy failure locally

* re-attempt build fix

* don't overwrite fuzzy param value

* move fuzzy_match

to its own Python module to allow patching

* move fuzzy_match back inside Matcher

simplify logic and add tests

* Format tests

* Parametrize fuzzyn tests

* Parametrize and merge fuzzy+set tests

* Format

* Move fuzzy_match to a standalone method

* Change regex kwarg type to bool

* Add types for fuzzy_match

- Refactor variable names
- Add test for symmetrical behavior

* Parametrize fuzzyn+set tests

* Minor refactoring for fuzz/fuzzy

* Make fuzzy_match a Matcher kwarg

* Update type for _default_fuzzy_match

* don't overwrite function param

* Rename to fuzzy_compare

* Update fuzzy_compare default argument declarations

* allow fuzzy_compare override from EntityRuler

* define new Matcher keyword arg

* fix type definition

* Implement fuzzy_compare config option for EntityRuler and SpanRuler

* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects

* Use simpler fuzzy_compare algorithm

* Update types

* Increase minimum to 2 in fuzzy_compare to allow one transposition

* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX

* Add FUZZY6..9

* Add initial docs

* Increase default fuzzy to rounded 30% of pattern length

* Update docs for fuzzy_compare in components

* Update EntityRuler and SpanRuler API docs

* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare

To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

* Fix schema aliases

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typo

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add FUZZY6-9 operators and update tests

* Parameterize test over greedy

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type for fuzzy_compare to remove Optional

* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein

* Update docs following levenshtein_compare renaming

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-10 10:36:17 +01:00
..
docs Enable fuzzy text matching in Matcher (#11359) 2023-01-10 10:36:17 +01:00
meta Add spacy-pythainlp (#12038) 2023-01-03 17:03:59 +09:00
setup Remove now-built-in jinja2>=3.1.0 extensions 2022-03-25 14:29:33 +01:00
src Update custom solutions links (#11903) 2022-12-07 16:02:09 +01:00
.eslintrc Tidy up website and add eslint config [ci skip] 2019-03-12 15:21:58 +01:00
.prettierrc 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
Dockerfile Docker Image for Website Dev (#10098) 2022-01-20 23:02:13 +01:00
gatsby-browser.js Merge branch 'spacy.io' into develop [ci skip] 2019-02-26 16:51:22 +01:00
gatsby-config.js Fix icon [ci skip] 2021-01-30 18:27:55 +11:00
gatsby-node.js Remove docs references to starters for now (see #6262) [ci skip] 2020-10-16 15:46:34 +02:00
package-lock.json Update GitHub link [ci skip] 2021-02-02 14:27:46 +11:00
package.json Update GitHub link [ci skip] 2021-02-02 14:27:46 +11:00
README.md Update README.md 2022-11-27 03:47:11 +01:00
runtime.txt Fix website deployment [ci skip] 2020-08-24 14:28:24 +02:00
UNIVERSE.md Update UNIVERSE.md (#9941) 2021-12-27 13:46:04 +01:00

spacy.io website and docs

Netlify Status

The styleguide for the spaCy website is available at spacy.io/styleguide.

Setup and installation

Before running the setup, make sure your versions of Node and npm are up to date. Node v10.15 or later is required.

# Clone the repository
git clone https://github.com/explosion/spaCy
cd spaCy/website

# Install Gatsby's command-line tool
npm install --global gatsby-cli

# Install the dependencies
npm install

# Start the development server
npm run dev

If you are planning on making edits to the site, you should also set up the Prettier code formatter. It takes care of formatting Markdown and other files automatically. See here for the available extensions for your code editor. The .prettierrc file in the root defines the settings used in this codebase.

Building & developing the site with Docker

Sometimes it's hard to get a local environment working due to rapid updates to node dependencies, so it may be easier to use docker for building the docs.

If you'd like to do this, be sure you do not include your local node_modules folder, since there are some dependencies that need to be built for the image system. Rename it before using.

docker run -it \
  -v $(pwd):/spacy-io/website \
  -p 8000:8000 \
  ghcr.io/explosion/spacy-io \
  gatsby develop -H 0.0.0.0

This will allow you to access the built website at http://0.0.0.0:8000/ in your browser, and still edit code in your editor while having the site reflect those changes.

Note: If you're working on a Mac with an M1 processor, you might see segfault errors from qemu if you use the default image. To fix this use the arm64 tagged image in the docker run command (ghcr.io/explosion/spacy-io:arm64).

Building the Docker image

If you'd like to build the image locally, you can do so like this:

docker build -t spacy-io .

This will take some time, so if you want to use the prebuilt image you'll save a bit of time.

Project structure

├── docs                 # the actual markdown content
├── meta                 # JSON-formatted site metadata
|   ├── languages.json   # supported languages and statistical models
|   ├── sidebars.json    # sidebar navigations for different sections
|   ├── site.json        # general site metadata
|   ├── type-annotations.json # Type annotations
|   └── universe.json    # data for the spaCy universe section
├── public               # compiled site
├── setup                # Jinja setup
├── src                  # source
|   ├── components       # React components
|   ├── fonts            # webfonts
|   ├── images           # images used in the layout
|   ├── plugins          # custom plugins to transform Markdown
|   ├── styles           # CSS modules and global styles
|   ├── templates        # page layouts
|   |   ├── docs.js      # layout template for documentation pages
|   |   ├── index.js     # global layout template
|   |   ├── models.js    # layout template for model pages
|   |   └── universe.js  # layout templates for universe
|   └── widgets          # non-reusable components with content, e.g. changelog
├── .eslintrc.json       # ESLint config file
├── .prettierrc          # Prettier config file
├── gatsby-browser.js    # browser-specific hooks for Gatsby
├── gatsby-config.js     # Gatsby configuration
├── gatsby-node.js       # Node-specific hooks for Gatsby
└── package.json         # package settings and dependencies