"description":"greCy offers state-of-the-art pipelines for ancient Greek NLP. The repository makes language models available in various sizes, some of them containing floret word vectors and a BERT transformer layer.",
"github":"jmyerston/greCy",
"code_example":[
"import spacy",
"#After installing the grc_ud_proiel_trf wheel package from the greCy repository",
" description='Paris is located in northern central France, in a north-bending arc of the river Seine'),",
" Entity(name='IBM',",
" description='International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York'),",
" Entity(name='New York', description='New York is a city in U.S. state'),",
" Entity(name='Florida', description='southeasternmost U.S. state'),",
" Entity(name='American',",
" description='American, something of, from, or related to the United States of America, commonly known as the United States or America'),",
" Entity(name='Chemical formula',",
" description='In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecul'),",
" Entity(name='Acetamide',",
" description='Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent.'),",
" Entity(name='Armonk',",
" description='Armonk is a hamlet and census-designated place (CDP) in the town of North Castle, located in Westchester County, New York, United States.'),",
" Entity(name='Acetic Acid',",
" description='Acetic acid, systematically named ethanoic acid, is an acidic, colourless liquid and organic compound with the chemical formula CH3COOH'),",
" Entity(name='Industrial solvent',",
" description='Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent.'),",
"slogan":"Aim-spaCy is an Aim-based spaCy experiment tracker.",
"description":"Aim-spaCy helps to easily collect, store and explore training logs for spaCy, including: hyper-parameters, metrics and displaCy visualizations",
"slogan":"Generates interactive reports for spaCy models.",
"description":"The goal of spacy-report is to offer static reports for spaCy models that help users make better decisions on how the models can be used.",
"slogan":"Remove personally identifiable information from text using spaCy.",
"description":"scrubadub removes personally identifiable information from text. scrubadub_spacy is an extension that uses spaCy NLP models to remove personal information from text.",
"print(scrubber.clean(\"My name is Alex, I work at LifeGuard in London, and my eMail is alex@lifeguard.com btw. my super secret twitter login is username: alex_2000 password: g-dragon180888\"))",
"# My name is {{NAME}}, I work at {{ORGANIZATION}} in {{LOCATION}}, and my eMail is {{EMAIL}} btw. my super secret twitter login is username: {{USERNAME}} password: {{PASSWORD}}"
]
},
{
"id":"spacy-setfit-textcat",
"title":"spacy-setfit-textcat",
"category":["research"],
"tags":["SetFit","Few-Shot"],
"slogan":"spaCy Project: Experiments with SetFit & Few-Shot Classification",
"description":"This project is an experiment with spaCy and few-shot text classification using SetFit",
"slogan":"Cutting-edge experimental spaCy components and features",
"description":"This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.",
"slogan":"Easy PDF to text to spaCy text extraction in Python.",
"description":"*spacypdfreader* is a Python library that allows you to convert PDF files directly into *spaCy* `Doc` objects. The library provides several built in parsers or bring your own parser. `Doc` objects are annotated with several custom attributes including: `token._.page_number`, `doc._.page_range`, `doc._.first_page`, `doc._.last_page`, `doc._.pdf_file_name`, and `doc._.page(int)`.",
"slogan":"Production-ready API for spaCy models in production",
"description":"A highly-available hosted API to easily deploy and use spaCy models in production. Supports NER, POS tagging, dependency parsing, and tokenization.",
"description":"eMFDscore is a library for the fast and flexible extraction of various moral information metrics from textual input data. eMFDscore is built on spaCy for faster execution and performs minimal preprocessing consisting of tokenization, syntactic dependency parsing, lower-casing, and stopword/punctuation/whitespace removal. eMFDscore lets users score documents with multiple Moral Foundations Dictionaries, provides various metrics for analyzing moral information, and extracts moral patient, agent, and attribute words related to entities.",
"description":"`skweak` brings the power of weak supervision to NLP tasks, and in particular sequence labelling and text classification. Instead of annotating documents by hand, `skweak` allows you to define *labelling functions* to automatically label your documents, and then aggregate their results using a statistical model that estimates the accuracy and confusions of each labelling function.",
"slogan":"Use DBpedia Spotlight to link entities inside SpaCy",
"description":"This library links SpaCy with [DBpedia Spotlight](https://www.dbpedia-spotlight.org/). You can easily get the DBpedia entities from your documents, using the public web service or by using your own instance of DBpedia Spotlight. The `doc.ents` are populated with the entities and all their details (URI, type, ...).",
"github":"MartinoMensio/spacy-dbpedia-spotlight",
"pip":"spacy-dbpedia-spotlight",
"code_example":[
"import spacy_dbpedia_spotlight",
"# load your model as usual",
"nlp = spacy.load('en_core_web_lg')",
"# add the pipeline stage",
"nlp.add_pipe('dbpedia_spotlight')",
"# get the document",
"doc = nlp('The president of USA is calling Boris Johnson to decide what to do about coronavirus')",
"# see the entities",
"print('Entities', [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])",
"description":"spacytextblob is a pipeline component that enables sentiment analysis using the [TextBlob](https://github.com/sloria/TextBlob) library. It will add the additional extension `._.blob` to `Doc`, `Span`, and `Token` objects.",
"description":"This library lets you use the embeddings from [sentence-transformers](https://github.com/UKPLab/sentence-transformers) of Docs, Spans and Tokens directly from spaCy. Most models are for the english language but three of them are multilingual.",
"github":"MartinoMensio/spacy-sentence-bert",
"pip":"spacy-sentence-bert",
"code_example":[
"import spacy_sentence_bert",
"# load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/",
"slogan":"spaCy building blocks for Streamlit apps",
"github":"explosion/spacy-streamlit",
"description":"This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
"description":"Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
"slogan":"Make interactive visualisations to figure out 'what lies' in word embeddings.",
"description":"This small library offers tools to make visualisation easier of both word embeddings as well as operations on them. It has support for spaCy prebuilt models as a first class citizen but also offers support for sense2vec. There's a convenient API to perform linear algebra as well as support for popular transformations like PCA/UMAP/etc.",
"slogan":"Leveraging BERT and c-TF-IDF to create easily interpretable topics.",
"description":"BERTopic is a topic modeling technique that leverages embedding models and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic supports guided, (semi-) supervised, hierarchical, and dynamic topic modeling.",
"slogan":"Connect vowpal-wabbit & scikit-learn models to spaCy to run simple classification benchmarks. Comes with many utility functions for spaCy pipelines.",
"slogan":"Use the latest Stanza (StanfordNLP) research models directly in spaCy",
"description":"This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models as a spaCy pipeline. Using this wrapper, you'll be able to use the following annotations, computed by your pretrained `stanza` model:\n\n- Statistical tokenization (reflected in the `Doc` and its tokens)\n - Lemmatization (`token.lemma` and `token.lemma_`)\n - Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)\n - Dependency parsing (`token.dep`, `token.dep_`, `token.head`)\n - Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`)\n - Sentence segmentation (`doc.sents`)",
"slogan":"Use the latest UDPipe models directly in spaCy",
"description":"This package wraps the fast and efficient UDPipe language-agnostic NLP pipeline (via its Python bindings), so you can use UDPipe pre-trained models as a spaCy pipeline for 50+ languages out-of-the-box. Inspired by spacy-stanza, this package offers slightly less accurate models that are in turn much faster.",
"github":"TakeLab/spacy-udpipe",
"pip":"spacy-udpipe",
"code_example":[
"import spacy_udpipe",
"",
"spacy_udpipe.download(\"en\") # download English model",
"",
"text = \"Wikipedia is a free online encyclopedia, created and edited by volunteers around the world.\"",
"slogan":"\uD83E\uDD9C Containerized HTTP API for spaCy NLP",
"description":"For developers who need programming language agnostic NLP, spaCy Server is a containerized HTTP API that provides industrial-strength natural language processing. Unlike other servers, our server is fast, idiomatic, and well documented.",
"github":"neelkamath/spacy-server",
"code_example":[
"docker run --rm -dp 8080:8080 neelkamath/spacy-server",
"curl http://localhost:8080/ner -H 'Content-Type: application/json' -d '{\"sections\": [\"My name is John Doe. I grew up in California.\"]}'"
"description":"spaCy extension and pipeline component for adding emoji meta data to `Doc` objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom `Doc`, `Token` and `Span` attributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`.",
"slogan":"Add text readability meta data to Doc objects",
"description":"spaCy v2.0 pipeline component for calculating readability scores of of text. Provides scores for Flesh-Kincaid grade level, Flesh-Kincaid reading ease, and Dale-Chall.",
"github":"mholtzscher/spacy_readability",
"pip":"spacy-readability",
"code_example":[
"import spacy",
"from spacy_readability import Readability",
"",
"nlp = spacy.load('en')",
"read = Readability(nlp)",
"nlp.add_pipe(read, last=True)",
"doc = nlp(\"I am some really difficult text to read because I use obnoxiously large words.\")",
"description":"spaCy-CLD operates on `Doc` and `Span` spaCy objects. When called on a `Doc` or `Span`, the object is given two attributes: `languages` (a list of up to 3 language codes) and `language_scores` (a dictionary mapping language codes to confidence scores between 0 and 1).\n\nspacy-cld is a little extension that wraps the [PYCLD2](https://github.com/aboSamoor/pycld2) Python library, which in turn wraps the [Compact Language Detector 2](https://github.com/CLD2Owners/cld2) C library originally built at Google for the Chromium project. CLD2 uses character n-grams as features and a Naive Bayes classifier to identify 80+ languages from Unicode text strings (or XML/HTML). It can detect up to 3 different languages in a given document, and reports a confidence score (reported in with each language.",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [IWNLP-py](https://github.com/Liebeck/iwnlp-py) as German lemmatizer directly into your spaCy pipeline.",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [SentiWS](http://wortschatz.uni-leipzig.de/en/download) as German sentiment score directly into your spaCy pipeline.",
"slogan":"POS and French lemmatization with Lefff",
"description":"spacy v2.0 extension and pipeline component for adding a French POS and lemmatizer based on [Lefff](https://hal.inria.fr/inria-00521242/).",
"description":"Lemmy is a lemmatizer for Danish 🇩🇰 . It comes already trained on Dansk Sprognævns (DSN) word list (‘fuldformliste’) and the Danish Universal Dependencies and is ready for use. Lemmy also supports training on your own dataset. The model currently included in Lemmy was evaluated on the Danish Universal Dependencies dev dataset and scored an accruacy > 99%.\n\nYou can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCy’s builtin POS tagger.",
"github":"sorenlind/lemmy",
"pip":"lemmy",
"code_example":[
"import da_custom_model as da # name of your spaCy model",
"import lemmy.pipe",
"nlp = da.load()",
"",
"# create an instance of Lemmy's pipeline component for spaCy",
"pipe = lemmy.pipe.load()",
"",
"# add the comonent to the spaCy pipeline.",
"nlp.add_pipe(pipe, after='tagger')",
"",
"# lemmas can now be accessed using the `._.lemma` attribute on the tokens",
"slogan":"The cherry on top of your NLP pipeline",
"description":"Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.",
"github":"kennethenevoldsen/augmenty",
"pip":"augmenty",
"code_example":[
"import spacy",
"import augmenty",
"",
"nlp = spacy.load('en_core_web_md')",
"",
"docs = nlp.pipe(['Augmenty is a great tool for text augmentation'])",
"description":"DaCy is a Danish preprocessing pipeline trained in SpaCy. It has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency parsing for Danish. This repository contains material for using the DaCy, reproducing the results and guides on usage of the package. Furthermore, it also contains a series of behavioural test for biases and robustness of Danish NLP pipelines.",
"slogan":"For Wrapping fine-tuned transformers in spaCy pipelines",
"description":"spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing inclusion of existing models within existing workflows.",
"github":"kennethenevoldsen/spacy-wrap",
"pip":"spacy_wrap",
"code_example":[
"import spacy",
"import spacy_wrap",
"",
"nlp = spacy.blank('en')",
"config = {",
" 'doc_extension_trf_data': 'clf_trf_data', # document extention for the forward pass",
" 'doc_extension_prediction': 'sentiment', # document extention for the prediction",
"slogan":"Fast, flexible and transparent sentiment analysis",
"description":"Asent is a rule-based sentiment analysis library for Python made using spaCy. It is inspired by VADER, but uses a more modular ruleset, that allows the user to change e.g. the method for finding negations. Furthermore it includes visualisers to visualize the model predictions, making the model easily interpretable.",
"slogan":"State-of-the-art coreference resolution based on neural nets and spaCy v2",
"description":"This coreference resolution module is based on the super fast spaCy parser and uses the neural net scoring model described in [Deep Reinforcement Learning for Mention-Ranking Coreference Models](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) by Kevin Clark and Christopher D. Manning, EMNLP 2016. Since ✨Neuralcoref v2.0, you can train the coreference resolution system on your own dataset—e.g., another language than English! — **provided you have an annotated dataset**. Note that to use neuralcoref with spaCy > 2.1.0, you'll have to install neuralcoref from source, and v3+ is not supported.",
"slogan":"State-of-the-art coreference resolution based on neural nets and spaCy",
"description":"In short, coreference is the fact that two or more expressions in a text – like pronouns or nouns – link to the same person or thing. It is a classical Natural language processing task, that has seen a revival of interest in the past two years as several research groups applied cutting-edge deep-learning and reinforcement-learning techniques to it. It is also one of the key building blocks to building conversational Artificial intelligences.",
"url":"https://huggingface.co/coref/",
"image":"https://i.imgur.com/3yy4Qyf.png",
"thumb":"https://i.imgur.com/j6FO9O6.jpg",
"github":"huggingface/neuralcoref",
"category":["visualizers","conversational"],
"tags":["coref","chatbots"],
"author":"Hugging Face",
"author_links":{
"github":"huggingface"
}
},
{
"id":"matcher-explorer",
"title":"Rule-based Matcher Explorer",
"slogan":"Test spaCy's rule-based Matcher by creating token patterns interactively",
"description":"Test spaCy's rule-based `Matcher` by creating token patterns interactively and running them over your text. Each token can set multiple attributes like text value, part-of-speech tag or boolean flags. The token-based view lets you explore how spaCy processes your text – and why your pattern matches, or why it doesn't. For more details on rule-based matching, see the [documentation](https://spacy.io/usage/rule-based-matching).",
"slogan":"A modern syntactic dependency visualizer",
"description":"Visualize spaCy's guess at the syntactic structure of a sentence. Arrows point from children to heads, and are labelled by their relation type.",
"description":"Visualize spaCy's guess at the named entities in the document. You can filter the displayed types, to only show the annotations you're interested in.",
"slogan":"Beautiful visualizations of how language differs among document types",
"description":"A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.",
"slogan":"Conversational AI platform for deep-domain voice interfaces and chatbots",
"description":"The MindMeld Conversational AI platform is among the most advanced AI platforms for building production-quality conversational applications. It is a Python-based machine learning framework which encompasses all of the algorithms and utilities required for this purpose. (https://github.com/cisco/mindmeld)",
"slogan":"An open-source NLP research library, built on PyTorch and spaCy",
"description":"AllenNLP is a new library designed to accelerate NLP research, by providing a framework that supports modern deep learning workflows for cutting-edge language understanding problems. AllenNLP uses spaCy as a preprocessing component. You can also use Allen NLP to develop spaCy pipeline components, to add annotations to the `Doc` object.",
"github":"allenai/allennlp",
"pip":"allennlp",
"thumb":"https://i.imgur.com/U8opuDN.jpg",
"url":"http://allennlp.org",
"author":" Allen Institute for Artificial Intelligence",
"description":"`textacy` is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance `spacy` library. With the fundamentals – tokenization, part-of-speech tagging, dependency parsing, etc. – delegated to another library, `textacy` focuses on the tasks that come before and follow after.",
"description":"`textpipe` is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.",
"slogan":"Full text geoparsing using spaCy, Geonames and Keras",
"description":"Extract the place names from a piece of text, resolve them to the correct place, and return their coordinates and structured geographic information.",
"github":"openeventdata/mordecai",
"pip":"mordecai",
"thumb":"https://i.imgur.com/gPJ9upa.jpg",
"code_example":[
"from mordecai import Geoparser",
"geo = Geoparser()",
"geo.geoparse(\"I traveled from Oxford to Ottawa.\")"
"slogan":"Biomedical relation extraction using spaCy",
"description":"Kindred is a package for relation extraction in biomedical texts. Given some training data, it can build a model to identify relations between entities (e.g. drugs, genes, etc) in a sentence.",
"description":"sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015) is a nice twist on [word2vec](https://en.wikipedia.org/wiki/Word2vec) that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our [sense2vec demo](https://explosion.ai/demos/sense2vec) that lets you explore semantic similarities across all Reddit comments of 2015.",
"txt <- c(d1 = \"spaCy excels at large-scale information extraction tasks.\",",
" d2 = \"Mr. Smith goes to North Carolina.\")",
"",
"# process documents and obtain a data.table",
"parsedtxt <- spacy_parse(txt)"
],
"code_language":"r",
"author":"Kenneth Benoit & Aki Matsuo",
"category":["nonpython"]
},
{
"id":"cleannlp",
"title":"CleanNLP",
"slogan":"A tidy data model for NLP in R",
"description":"The cleanNLP package is designed to make it as painless as possible to turn raw text into feature-rich data frames. the package offers four backends that can be used for parsing text: `tokenizers`, `udpipe`, `spacy` and `corenlp`.",
"github":"statsmaths/cleanNLP",
"cran":"cleanNLP",
"author":"Taylor B. Arnold",
"author_links":{
"github":"statsmaths"
},
"category":["nonpython"]
},
{
"id":"spacy-cpp",
"slogan":"C++ wrapper library for spaCy",
"description":"The goal of spacy-cpp is to expose the functionality of spaCy to C++ applications, and to provide an API that is similar to that of spaCy, enabling rapid development in Python and simple porting to C++.",
"slogan":"Wrapper module for using spaCy from Ruby via PyCall",
"description":"ruby-spacy is a wrapper module for using spaCy from the Ruby programming language via PyCall. This module aims to make it easy and natural for Ruby programmers to use spaCy.",
"slogan":"Radically efficient machine teaching, powered by active learning",
"description":"Prodigy is an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you're working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. Stream in your own examples or real-world data from live APIs, update your model in real-time and chain models together to build more complex systems.",
"thumb":"https://i.imgur.com/UVRtP6g.jpg",
"image":"https://i.imgur.com/Dt5vrY6.png",
"url":"https://prodi.gy",
"code_example":[
"prodigy dataset ner_product \"Improve PRODUCT on Reddit data\"",
"with Flow(\"Natural Language Processing\") as flow:",
" doc = SpacyNLP(text=\"This is some text\", nlp=nlp)",
"",
"flow.run()"
],
"author":"Prefect",
"author_links":{
"website":"https://prefect.io"
},
"category":["standalone"]
},
{
"id":"graphbrain",
"title":"Graphbrain",
"slogan":"Automated meaning extraction and text understanding",
"description":"Graphbrain is an Artificial Intelligence open-source software library and scientific research tool. Its aim is to facilitate automated meaning extraction and text understanding, as well as the exploration and inference of knowledge.",
"title":"Natural Language Processing Using Python",
"slogan":"No Starch Press, 2020",
"description":"Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
"title":"Introduction to Machine Learning with Python: A Guide for Data Scientists",
"slogan":"O'Reilly, 2016",
"description":"Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.",
"description":"*Text Analytics with Python* teaches you the techniques related to natural language processing and text analytics, and you will gain the skills to know which technique is best suited to solve a particular problem. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.",
"description":"Master the essential skills needed to recognize and solve complex problems with machine learning and deep learning. Using real-world examples that leverage the popular Python machine learning ecosystem, this book is your perfect companion for learning the art and science of machine learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute machine learning systems and projects successfully.",
"title":"Natural Language Processing and Computational Linguistics",
"slogan":"Packt, 2018",
"description":"This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy.",
"description":"This is your ultimate spaCy book. Master the crucial skills to use spaCy components effectively to create real-world NLP applications with spaCy. Explaining linguistic concepts such as dependency parsing, POS-tagging and named entity extraction with many examples, this book will help you to conquer computational linguistics with spaCy. The book further focuses on ML topics with Keras and Tensorflow. You'll cover popular topics, including intent recognition, sentiment analysis and context resolution; and use them on popular datasets and interpret the results. A special hands-on section on chatbot design is included.",
"title":"Applied Natural Language Processing in the Enterprise: Teaching Machines to Read, Write, and Understand",
"slogan":"O'Reilly, 2021",
"description":"Natural language processing (NLP) is one of the hottest topics in AI today. Having lagged behind other deep learning fields such as computer vision for years, NLP only recently gained mainstream popularity. Even though Google, Facebook, and OpenAI have open sourced large pretrained language models to make NLP easier, many organizations today still struggle with developing and productionizing NLP applications. This hands-on guide helps you learn the field quickly.",
"github":"nlpbook/nlpbook",
"cover":"https://i.imgur.com/6RxLBvf.jpg",
"url":"https://www.amazon.com/dp/149206257X",
"author":"Ankur A. Patel",
"author_links":{
"github":"aapatel09",
"website":"https://www.ankurapatel.io"
},
"category":["books"]
},
{
"type":"education",
"id":"introduction-into-spacy-3",
"title":"Introduction to spaCy 3",
"slogan":"A free course for beginners by Dr. W.J.B. Mattingly",
"description":"In this free interactive course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"slogan":"NLP for newcomers using spaCy and Stanza",
"description":"These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.",
"slogan":"Incremental parsing with bloom embeddings and residual CNNs",
"description":"spaCy v2.0's Named Entity Recognition system features a sophisticated word embedding strategy using subword features and \"Bloom\" embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing. The system is designed to give a good balance of efficiency, accuracy and adaptability. In this talk, I sketch out the components of the system, explaining the intuition behind the various choices. I also give a brief introduction to the named entity recognition problem, with an overview of what else Explosion AI is working on, and why.",
"youtube":"sqDHBH9IjRU",
"author":"Matthew Honnibal",
"author_links":{
"twitter":"honnibal",
"github":"honnibal",
"website":"https://explosion.ai"
},
"category":["videos"]
},
{
"type":"education",
"id":"video-new-nlp-solutions",
"title":"Building new NLP solutions with spaCy and Prodigy",
"slogan":"PyData Berlin 2018",
"description":"In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures.",
"author":"Matthew Honnibal",
"author_links":{
"twitter":"honnibal",
"github":"honnibal",
"website":"https://explosion.ai"
},
"youtube":"jpWqz85F_4Y",
"category":["videos"]
},
{
"type":"education",
"id":"video-modern-nlp-in-python",
"title":"Modern NLP in Python",
"slogan":"PyData DC 2016",
"description":"Academic and industry research in Natural Language Processing (NLP) has progressed at an accelerating pace over the last several years. Members of the Python community have been hard at work moving cutting-edge research out of papers and into open source, \"batteries included\" software libraries that can be applied to practical problems. We'll explore some of these tools for modern NLP in Python.",
"title":"Advanced NLP with spaCy · A free online course",
"description":"spaCy is a modern Python library for industrial-strength Natural Language Processing. In this free and interactive online course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"url":"https://course.spacy.io/en",
"author":"Ines Montani",
"author_links":{
"twitter":"_inesmontani",
"github":"ines"
},
"youtube":"THduWAnG97k",
"category":["videos"]
},
{
"type":"education",
"id":"video-spacy-course-de",
"title":"Modernes NLP mit spaCy · Ein Gratis-Onlinekurs",
"description":"spaCy ist eine moderne Python-Bibliothek für industriestarkes Natural Language Processing. In diesem kostenlosen und interaktiven Onlinekurs lernst du, mithilfe von spaCy fortgeschrittene Systeme für die Analyse natürlicher Sprache zu entwickeln und dabei sowohl regelbasierte Verfahren, als auch moderne Machine-Learning-Technologie einzusetzen.",
"title":"NLP avanzado con spaCy · Un curso en línea gratis",
"description":"spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"author":"Vincent Warmerdam",
"author_links":{
"twitter":"fishnets88",
"github":"koaning"
},
"youtube":"4V0JDdohxAk",
"category":["videos"]
},
{
"type":"education",
"id":"video-intro-to-nlp-episode-4",
"title":"Intro to NLP with spaCy (4)",
"slogan":"Episode 4: Named Entity Recognition",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"description":"In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"description":"Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
"description":"As the amount of text available on the internet and in businesses continues to increase, the need for fast and accurate language analysis becomes more prominent. This week Matthew Honnibal, the creator of spaCy, talks about his experiences researching natural language processing and creating a library to make his findings accessible to industry.",
"description":"The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of spaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.",
"description":"One core question around open source is how do you fund it? Well, there is always that PayPal donate button. But that's been a tremendous failure for many projects. Often the go-to answer is consulting. But what if you don't want to trade time for money? You could take things up a notch and change the equation, exchanging value for money. That's what Ines Montani and her co-founder did when they started Explosion AI with spaCy as the foundation.",
"description":"\"Ines and I caught up to discuss her various projects, including the aforementioned spaCy, an open-source NLP library built with a focus on industry and production use cases. In our conversation, Ines gives us an overview of the spaCy Library, a look at some of the use cases that excite her, and the Spacy community and contributors. We also discuss her work with Prodigy, an annotation service tool that uses continuous active learning to train models, and finally, what other exciting projects she is working on.\"",
"title":"DataHack Radio #23: The Brains behind spaCy",
"slogan":"June 2019",
"description":"\"What would you do if you had the chance to pick the brains behind one of the most popular Natural Language Processing (NLP) libraries of our era? A library that has helped usher in the current boom in NLP applications and nurtured tons of NLP scientists? Well – you invite the creators on our popular DataHack Radio podcast and let them do the talking! We are delighted to welcome Ines Montani and Matt Honnibal, the developers of spaCy – a powerful and advanced library for NLP.\"",
"description":"\"spaCy is awesome for NLP! It’s easy to use, has widespread adoption, is open source, and integrates the latest language models. Ines Montani and Matthew Honnibal (core developers of spaCy and co-founders of Explosion) join us to discuss the history of the project, its capabilities, and the latest trends in NLP. We also dig into the practicalities of taking NLP workflows to production. You don’t want to miss this episode!\"",
"slogan":"Query spaCy's linguistic annotations using GraphQL",
"github":"ines/spacy-graphql",
"description":"A very simple and experimental app that lets you query spaCy's linguistic annotations using [GraphQL](https://graphql.org/). The API currently supports most token attributes, named entities, sentences and text categories (if available as `doc.cats`, i.e. if you added a text classifier to a model). The `meta` field will return the model meta data. Models are only loaded once and kept in memory.",
"url":"https://explosion.ai/demos/spacy-graphql",
"category":["apis"],
"tags":["graphql"],
"thumb":"https://i.imgur.com/xC7zpTO.png",
"code_example":[
"{",
" nlp(text: \"Zuckerberg is the CEO of Facebook.\", model: \"en_core_web_sm\") {",
"slogan":"JavaScript API for spaCy with Python REST API",
"github":"ines/spacy-js",
"description":"JavaScript interface for accessing linguistic annotations provided by spaCy. This project is mostly experimental and was developed for fun to play around with different ways of mimicking spaCy's Python API.\n\nThe results will still be computed in Python and made available via a REST API. The JavaScript API resembles spaCy's Python API as closely as possible (with a few exceptions, as the values are all pre-computed and it's tricky to express complex recursive relationships).",
"code_language":"javascript",
"code_example":[
"const spacy = require('spacy');",
"",
"(async function() {",
" const nlp = spacy.load('en_core_web_sm');",
" const doc = await nlp('This is a text about Facebook.');",
"description":"`spacy-wordnet` creates annotations that easily allow the use of WordNet and [WordNet Domains](http://wndomains.fbk.eu/) by using the [NLTK WordNet interface](http://www.nltk.org/howto/wordnet.html)",
"slogan":"Parsing from and to CoNLL-U format with `spacy`, `spacy-stanza` and `spacy-udpipe`",
"description":"This module allows you to parse text into CoNLL-U format or read ConLL-U into a spaCy `Doc`. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a `spacy`, `spacy-stanza` or `spacy-udpipe` pipeline. It also provides an easy-to-use function to quickly initialize any spaCy-wrapped parser. CoNLL-related properties are added to `Doc` elements, `Span` sentences, and `Token` objects.",
"doc = nlp(\"A cookie is a baked or cooked food that is typically small, flat and sweet. It usually contains flour, sugar and some type of oil or fat.\")",
"",
"# Get the CoNLL representation of the whole document, including headers",
"description":"Ludwig makes it easy to build deep learning models for many applications, including NLP ones. It uses spaCy for tokenizing text in different languages.",
"description":"pic2phrase_bot runs inside Telegram messenger and can be used to generate a phrase describing a submitted photo, employing computer vision, web scraping, and syntactic dependency analysis powered by spaCy.",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add word inflections to the system.",
"slogan":"A Python module for English lemmatization and inflection",
"description":"LemmInflect uses a dictionary approach to lemmatize English words and inflect them into forms specified by a user supplied [Universal Dependencies](https://universaldependencies.org/u/pos/) or [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) tag. The library works with out-of-vocabulary (OOV) words by applying neural network techniques to classify word forms and choose the appropriate morphing rules. The system acts as a standalone module or as an extension to spaCy.",
"slogan":"A python library that makes AMR parsing, generation and visualization simple.",
"description":"amrlib is a python module and spaCy add-in for Abstract Meaning Representation (AMR). The system can parse sentences to AMR graphs or generate text from existing graphs. It includes a GUI for visualization and experimentation.",
"github":"bjascob/amrlib",
"pip":"amrlib",
"code_example":[
"import spacy",
"import amrlib",
"amrlib.setup_spacy_extension()",
"nlp = spacy.load('en_core_web_sm')",
"doc = nlp('This is a test of the spaCy extension. The test has multiple sentences.')",
"slogan":"Have you ever struggled with needing a spaCy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!",
"description":"Have you ever struggled with needing a [spaCy TextCategorizer](https://spacy.io/api/textcategorizer) but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using [sentence-transformers](https://github.com/UKPLab/sentence-transformers) or [spaCy models](https://spacy.io/usage/models), provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with [Huggingface zero-shot classifiers](https://huggingface.co/models?pipeline_tag=zero-shot-classification).",
"slogan":"Concise Concepts uses few-shot NER based on word embedding similarity to get you going with easy!",
"description":"When wanting to apply NER to concise concepts, it is really easy to come up with examples, but it takes some effort to train an entire pipeline. Concise Concepts uses few-shot NER based on word embedding similarity to get you going with easy!",
"slogan":"One multi-lingual coreference model to rule them all!",
"description":"Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also data proved to be poorly annotated. Crosslingual Coreference therefore uses the assumption a trained model with English data and cross-lingual embeddings should work for other languages with similar sentence structure. Verified to work quite well for at least (EN, NL, DK, FR, DE).",
"slogan":"A spaCy pipeline and model for NLP on unstructured legal text",
"description":"Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project from the [Incorporated Council of Law Reporting for England and Wales'](https://iclr.co.uk/) research lab, [ICLR&D](https://research.iclr.co.uk/).",
"slogan":"A little Windows GUI for training models with spaCy",
"description":"NeuralGym is a Python application for Windows with a graphical user interface to train models with spaCy. Run the application, select an output folder, a training data file in spaCy's data format, a spaCy model or blank model and press 'Start'.",
"description":"Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.explosion.services).",
"description":"Coreferee is a pipeline plugin that performs coreference resolution for English, French, German and Polish. It is designed so that it is easy to add support for new languages and optimised for limited training data. It uses a mixture of neural networks and programmed rules. Please note you will need to [install models](https://github.com/explosion/coreferee#getting-started) before running the code example.",
"doc = nlp('Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.')",
"description":"This package provides spaCy model pipelines that wrap [Hugging Face's `transformers`](https://github.com/huggingface/transformers) package, so you can use them in spaCy. The result is convenient access to state-of-the-art transformer architectures, such as BERT, GPT-2, XLNet, etc.",
"slogan":"Push your spaCy pipelines to the Hugging Face Hub",
"description":"This package provides a CLI command for uploading any trained spaCy pipeline packaged with [`spacy package`](https://spacy.io/api/cli#package) to the [Hugging Face Hub](https://huggingface.co). It auto-generates all meta information for you, uploads a pretty README (requires spaCy v3.1+) and handles version control under the hood.",
"slogan":"Implementation of the ClausIE information extraction system for Python+spaCy",
"github":"mmxgn/spacy-clausie",
"url":"https://github.com/mmxgn/spacy-clausie",
"description":"ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text",
"# [AE died in Princeton in 1955, AE died in 1955, AE died in Princeton"
],
"author":"Emmanouil Theofanis Chourdakis",
"author_links":{
"github":"mmxgn"
}
},
{
"id":"ipymarkup",
"slogan":"NER, syntax markup visualizations",
"description":"Collection of NLP visualizations for NER and syntax tree markup. Similar to [displaCy](https://explosion.ai/demos/displacy) and [displaCy ENT](https://explosion.ai/demos/displacy-ent).",
"text = 'В мероприятии примут участие не только российские учёные, но и зарубежные исследователи, в том числе, Крис Хелмбрехт - управляющий директор и совладелец креативного агентства Kollektiv (Германия, США), Ннека Угбома - руководитель проекта Mushroom works (Великобритания), Гергей Ковач - политик и лидер субкультурной партии «Dog with two tails» (Венгрия), Георг Жено - немецкий режиссёр, один из создателей экспериментального театра «Театр.doc», Театра им. Йозефа Бойса (Германия).'",
"slogan":"spaCy pipeline object for negating concepts in text based on the NegEx algorithm.",
"github":"jenojp/negspacy",
"url":"https://github.com/jenojp/negspacy",
"description":"negspacy is a spaCy pipeline component that evaluates whether Named Entities are negated in text. It adds an extension to 'Span' objects.",
"description":"The corpus holds 5127 sentences, annotated with 16 classes, with a total of 26376 annotated entities. The corpus comes into two formats: BRAT and CONLLUP.",
"slogan":"Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects",
"description":"This spaCy project trains an NER model and a custom Text Classification model with Clause Segmentation and Blinding capabilities to analyze supplement reviews and their potential effects on health.",
"slogan":"Context aware, pluggable and customizable data protection and PII data anonymization",
"description":"Presidio *(Origin from Latin praesidium ‘protection, garrison’)* helps to ensure sensitive text is properly managed and governed. It provides fast ***analytics*** and ***anonymization*** for sensitive text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers and financial data. Presidio analyzes the text using predefined or custom recognizers to identify entities, patterns, formats, and checksums with relevant context.",
"slogan":"Toolbox for developing and evaluating PII detectors, NER models for PII and generating fake PII data",
"description":"This package features data-science related tasks for developing new recognizers for Microsoft Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. Anyone interested in evaluating an existing Microsoft Presidio instance, a specific PII recognizer or to develop new models or logic for detecting PII could leverage the preexisting work in this package. Additionally, anyone interested in generating new data based on previous datasets (e.g. to increase the coverage of entity values) for Named Entity Recognition models could leverage the data generator contained in this package.",
"description":"pySBD is 'real-world' sentence segmenter which extracts reasonable sentences when the format and domain of the input text are unknown. It is a rules-based algorithm based on [The Golden Rules](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) - a set of tests to check accuracy of segmenter in regards to edge case scenarios developed by [TM-Town](https://www.tm-town.com/) dev team. pySBD is python port of ruby gem [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter).",
"description":"Docker-based cookiecutter for easy spaCy APIs using FastAPI. The default endpoints expect batch requests with a list of Records in the Azure Search Cognitive Skill format. So out of the box, this cookiecutter can be setup as a Custom Cognitive Skill. For more on Azure Search and Cognitive Skills [see this page](https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface).",
"slogan":"Py impl of TextRank for lightweight phrase extraction",
"description":"An implementation of TextRank in Python for use in spaCy pipelines which provides fast, effective phrase extraction from texts, along with extractive summarization. The graph algorithm works independent of a specific natural language and does not require domain knowledge. See (Mihalcea 2004) https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf",
"description":"Spacy Syllables is a pipeline component that adds multilingual syllable annotations to Tokens. It uses Pyphen under the hood and has support for a long list of languages.",
"description":"gobbli is a Python library which wraps several modern deep learning models in a uniform interface that makes it easy to evaluate feasibility and conduct analyses. It leverages the abstractive powers of Docker to hide nearly all dependency management and functional differences between models from the user. It also contains an interactive app for exploring text data and evaluating classification models. spaCy's base text classification models, as well as models integrated from `spacy-transformers`, are available in the collection of classification models. In addition, spaCy is used for data augmentation and document embeddings.",
"slogan":"An open source platform for the machine learning lifecycle",
"description":"MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: Tracking, Projects, Models and Registry.",
"description":"PyATE is a term extraction library written in Python using Spacy POS tagging with Basic, Combo Basic, C-Value, TermExtractor, and Weirdness.",
"string = 'Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors, are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the connection between the inflammatory response and cancer.'",
"description":"This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting NWE.",
"slogan":"Text preprocessing, representation and visualization from zero to hero.",
"description":"Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
"slogan":"spaCy pipeline for COVID-19 surveillance.",
"github":"abchapman93/VA_COVID-19_NLP_BSV",
"description":"A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
"slogan":"A toolkit for clinical NLP with spaCy.",
"github":"medspacy/medspacy",
"description":"A toolkit for clinical NLP with spaCy. Features include sentence splitting, section detection, and asserting negation, family history, and uncertainty.",
"slogan":"Domain Specific Language for creating language rules",
"github":"zaibacu/rita-dsl",
"description":"A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format",
"slogan":"SpacyDotNet is a .NET Core compatible wrapper for spaCy, based on Python.NET",
"description":"This projects relies on [Python.NET](http://pythonnet.github.io/) to interop with spaCy. It's not meant to be a complete and exhaustive implementation of all spaCy features and [APIs](https://spacy.io/api). Although it should be enough for basic tasks, it's considered as a starting point if you need to build a complex project using spaCy in .NET Most of the basic features in _Spacy101_ are available. All `Container` classes are present (`Doc`, `Token`, `Span` and `Lexeme`) with their basic properties/methods running and also `Vocab` and `StringStore` in a limited form. Anyway, any developer should be ready to add the missing properties or classes in a very straightforward manner.",
"slogan":"A library for statistics extraction from texts in Russian",
"description":"The library allows extracting the following statistics from a text: basic statistics, readability metrics, lexical diversity metrics, morphological statistics",
"slogan":"A text complexity library for text analysis built on spaCy",
"description":"With all the basic NLP capabilities provided by spaCy (dependency parsing, POS tagging, tokenizing), `TRUNAJOD` focuses on extracting measurements from texts that might be interesting for different applications and use cases.",
"slogan":"A Linguistic Feature Extraction (Text Analysis) Tool for Readability Assessment and Text Simplification",
"description":"LingFeat is a feature extraction library which currently extracts 255 linguistic features from English string input. Categories include syntax, semantics, discourse, and also traditional readability formulas. Published in EMNLP 2021.",
"github":"brucewlee/lingfeat",
"pip":"lingfeat",
"code_example":[
"from lingfeat import extractor",
"",
"",
"text = 'TAEAN, South Chungcheong Province -- Just before sunup, Lee Young-ho, a seasoned fisherman with over 30 years of experience, silently waits for boats carrying blue crabs as the season for the seafood reaches its height. Soon afterward, small and big boats sail into Sinjin Port in Taean County, South Chungcheong Province, the second-largest source of blue crab after Incheon, accounting for 29 percent of total production of the country. A crane lifts 28 boxes filled with blue crabs weighing 40 kilograms each from the boat, worth about 10 million won ($8,500). “It has been a productive fall season for crabbing here. The water temperature is a very important factor affecting crab production. They hate cold water,” Lee said. The temperature of the sea off Taean appeared to have stayed at the level where crabs become active. If the sea temperature suddenly drops, crabs go into their winter dormancy mode, burrowing into the mud and sleeping through the cold months.'",
"",
"",
"#Pass text",
"LingFeat = extractor.pass_text(text)",
"",
"",
"#Preprocess text",
"LingFeat.preprocess()",
"",
"",
"#Extract features",
"#each method returns a dictionary of the corresponding features",
"description":"Hammurabi works as a rule engine to parse input using a defined set of rules. It uses a simple and readable syntax to define complex rules to handle phrase matching. The syntax supports nested logical statements, regular expressions, reusable or side-loaded variables and match triggered callback functions to modularize your rules. The latest version works with both spaCy 2.X and 3.X. For more information check the documentation on [ReadTheDocs](https://hmrb.readthedocs.io/en/latest/).",
"slogan":"Forte is a toolkit for building Natural Language Processing pipelines, featuring cross-task interaction, adaptable data-model interfaces and composable pipelines.",
"description":"Forte provides a platform to assemble state-of-the-art NLP and ML technologies in a highly-composable fashion, including a wide spectrum of tasks ranging from Information Retrieval, Natural Language Understanding to Natural Language Generation.",
"description":"Combination of the RapidFuzz library with Spacy PhraseMatcher The goal of this component is to find matches when there were NO \"perfect matches\" due to typos or abbreviations between a Spacy doc and a list of phrases.",
"slogan":"A calibre plugin that generates Word Wise and X-Ray files.",
"description":"A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.",
"description":"textnets represents collections of texts as networks of documents and words. This provides novel possibilities for the visualization and analysis of texts.",
"slogan":"Text mining and topic modeling toolkit",
"description":"tmtoolkit is a set of tools for text mining and topic modeling with Python developed especially for the use in the social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation and a clear programming interface while offering good performance on large datasets by the means of vectorized operations (via NumPy) and parallel computation (using Python’s multiprocessing module and the loky package).",
"slogan":"spaCy components to extract information from clinical notes written in French.",
"description":"EDS-NLP provides a set of rule-based spaCy components to extract information for French clinical notes. It also features _qualifier_ pipelines that detect negations, speculations and family context, among other modalities. Check out the [demo](https://aphp.github.io/edsnlp/demo/)!",
"github":"aphp/edsnlp",
"pip":"edsnlp",
"code_example":[
"import spacy",
"",
"nlp = spacy.blank(\"fr\")",
"",
"terms = dict(",
" covid=[\"covid\", \"coronavirus\"],",
")",
"",
"# Sentencizer component, needed for negation detection",
"slogan":"English interpretation for accurate translation from English to Japanese",
"description":"This package categorizes English sentences into one of five basic sentence patterns and identifies the subject, verb, object, and other components. The five basic sentence patterns are based on C. T. Onions's Advanced English Syntax and are frequently used when teaching English in Japan.",
"slogan":"Sequence Tagger for Partially Annotated Dataset in spaCy",
"description":"This is a library to build a CRF tagger with a partially annotated dataset in spaCy. You can build your own tagger only from dictionary.",