Skip to content

Latest commit

 

History

History

README.md

NLP

Description

The lib provides the following tools for NLP:

  • spacy_tokenizer.py
    • Tokenization, Lemmatization & other components available in spaCy
    • By default the requirements.txt includes models for several common languages that can be used by calling MultilingualTokenizer(use_models=True,). These tend to have better performance than the default rule-based system, but if not needed feel free to remove them from the requirments.
  • language_detection.py
    • Language detection
  • text_cleaner.py
    • Text cleaning (e.g. remove emojis)
  • symspell_checker.py
    • Spell checking

Examples

Here are examples of using the nlp submodules.

Each module acts upon a pandas DataFrame, adding a new column with the module's output. For additional examples feel free to check tests/nlp/.

Tokenization

The MultilingualTokenizer in spacy_tokenizer.py can be used as follows:

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer

input_df = pd.DataFrame({"input_text": ["I hope nothing. I fear nothing. I am free. 💩 😂 #OMG"]})
tokenizer = MultilingualTokenizer(use_models=False)
output_df = tokenizer.tokenize_df(df=input_df, text_column="input_text", language="en")

Language detection

import pandas as pd
from core.nlp.language_detector import LanguageDetector

input_df = pd.DataFrame({"input_text": ["Comment est votre blanquette ?"]})
detector = LanguageDetector(minimum_score=0.2, fallback_language="es")
output_df = detector.detect_languages_df(input_df, "input_text").sort_values(by=["input_text"])

Text cleaning

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.text_cleaner import TextCleaner

input_df = pd.DataFrame({"input_text": ["Hi, I have two apples costing 3$ 😂    \n and unicode has #snowpersons ☃"]})
token_filters = {"is_punct", "is_stop", "like_num", "is_symbol", "is_currency", "is_emoji"}
tokenizer = MultilingualTokenizer()
tokenizer.spacy_nlp_dict["en"] = tokenizer._create_spacy_tokenizer("en")
tokenizer._activate_components_to_lemmatize("en")
text_cleaner = TextCleaner(tokenizer=tokenizer, token_filters=token_filters, lemmatization=True)
output_df = text_cleaner.clean_df(df=input_df, text_column="input_text", language="en")

Spell checking

import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.symspell_checker import SpellChecker

dictionary_folder_path = "./core/nlp/resource/dictionaries"

input_df = pd.DataFrame(
        {"input_text": ["Can yu read tHISs message despite the horible AB1234 sppeling msitakes 😂 #OMG"]}
    )
spellchecker = SpellChecker(tokenizer=MultilingualTokenizer(), dictionary_folder_path=dictionary_folder_path)
output_df = spellchecker.check_df(df=input_df, text_column="input_text", language="en")

Projects using the library

Don't hesitate to check these plugins using the library for more examples:

Version

  • Version: 0.2.0
  • State: Supported

Credit

Library created and maintained by Alex Combessie.