Name	Name	Last commit message	Last commit date
parent directory ..
resource	resource
README.md	README.md
__init__.py	__init__.py
language_detector.py	language_detector.py
language_support.py	language_support.py
requirements.txt	requirements.txt
spacy_tokenizer.py	spacy_tokenizer.py
symspell_checker.py	symspell_checker.py
text_cleaner.py	text_cleaner.py

NLP

Description

The lib provides the following tools for NLP:

spacy_tokenizer.py
- Tokenization, Lemmatization & other components available in spaCy
- By default the requirements.txt includes models for several common languages that can be used by calling MultilingualTokenizer(use_models=True,). These tend to have better performance than the default rule-based system, but if not needed feel free to remove them from the requirments.
language_detection.py
- Language detection
text_cleaner.py
- Text cleaning (e.g. remove emojis)
symspell_checker.py
- Spell checking

Examples

Here are examples of using the nlp submodules.

Each module acts upon a pandas DataFrame, adding a new column with the module's output. For additional examples feel free to check tests/nlp/.

Tokenization

The MultilingualTokenizer in spacy_tokenizer.py can be used as follows:

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer

input_df = pd.DataFrame({"input_text": ["I hope nothing. I fear nothing. I am free. 💩 😂 #OMG"]})
tokenizer = MultilingualTokenizer(use_models=False)
output_df = tokenizer.tokenize_df(df=input_df, text_column="input_text", language="en")

Language detection

import pandas as pd
from core.nlp.language_detector import LanguageDetector

input_df = pd.DataFrame({"input_text": ["Comment est votre blanquette ?"]})
detector = LanguageDetector(minimum_score=0.2, fallback_language="es")
output_df = detector.detect_languages_df(input_df, "input_text").sort_values(by=["input_text"])

Text cleaning

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.text_cleaner import TextCleaner

input_df = pd.DataFrame({"input_text": ["Hi, I have two apples costing 3$ 😂    \n and unicode has #snowpersons ☃"]})
token_filters = {"is_punct", "is_stop", "like_num", "is_symbol", "is_currency", "is_emoji"}
tokenizer = MultilingualTokenizer()
tokenizer.spacy_nlp_dict["en"] = tokenizer._create_spacy_tokenizer("en")
tokenizer._activate_components_to_lemmatize("en")
text_cleaner = TextCleaner(tokenizer=tokenizer, token_filters=token_filters, lemmatization=True)
output_df = text_cleaner.clean_df(df=input_df, text_column="input_text", language="en")

Spell checking

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.symspell_checker import SpellChecker

dictionary_folder_path = "./core/nlp/resource/dictionaries"

input_df = pd.DataFrame(
        {"input_text": ["Can yu read tHISs message despite the horible AB1234 sppeling msitakes 😂 #OMG"]}
    )
spellchecker = SpellChecker(tokenizer=MultilingualTokenizer(), dictionary_folder_path=dictionary_folder_path)
output_df = spellchecker.check_df(df=input_df, text_column="input_text", language="en")

Projects using the library

Don't hesitate to check these plugins using the library for more examples:

dss-plugin-nlp-visualization

Version

Version: 0.2.0
State: Supported

Credit

Library created and maintained by Alex Combessie.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

NLP

Description

Examples

Tokenization

Language detection

Text cleaning

Spell checking

Projects using the library

Version

Credit

FilesExpand file tree

nlp

Directory actions

More options

Directory actions

More options

Latest commit

History

nlp

Folders and files

parent directory

README.md

NLP

Description

Examples

Tokenization

Language detection

Text cleaning

Spell checking

Projects using the library

Version

Credit