The lib provides the following tools for NLP:
spacy_tokenizer.py- Tokenization, Lemmatization & other components available in spaCy
- By default the requirements.txt includes models for several common languages that can be used by calling
MultilingualTokenizer(use_models=True,). These tend to have better performance than the default rule-based system, but if not needed feel free to remove them from the requirments.
language_detection.py- Language detection
text_cleaner.py- Text cleaning (e.g. remove emojis)
symspell_checker.py- Spell checking
Here are examples of using the nlp submodules.
Each module acts upon a pandas DataFrame, adding a new column with the module's output.
For additional examples feel free to check tests/nlp/.
The MultilingualTokenizer in spacy_tokenizer.py can be used as follows:
import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
input_df = pd.DataFrame({"input_text": ["I hope nothing. I fear nothing. I am free. 💩 😂 #OMG"]})
tokenizer = MultilingualTokenizer(use_models=False)
output_df = tokenizer.tokenize_df(df=input_df, text_column="input_text", language="en")
import pandas as pd
from core.nlp.language_detector import LanguageDetector
input_df = pd.DataFrame({"input_text": ["Comment est votre blanquette ?"]})
detector = LanguageDetector(minimum_score=0.2, fallback_language="es")
output_df = detector.detect_languages_df(input_df, "input_text").sort_values(by=["input_text"])
import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.text_cleaner import TextCleaner
input_df = pd.DataFrame({"input_text": ["Hi, I have two apples costing 3$ 😂 \n and unicode has #snowpersons ☃"]})
token_filters = {"is_punct", "is_stop", "like_num", "is_symbol", "is_currency", "is_emoji"}
tokenizer = MultilingualTokenizer()
tokenizer.spacy_nlp_dict["en"] = tokenizer._create_spacy_tokenizer("en")
tokenizer._activate_components_to_lemmatize("en")
text_cleaner = TextCleaner(tokenizer=tokenizer, token_filters=token_filters, lemmatization=True)
output_df = text_cleaner.clean_df(df=input_df, text_column="input_text", language="en")
import pandas as pd
from core.nlp.spacy_tokenizer import MultilingualTokenizer
from core.nlp.symspell_checker import SpellChecker
dictionary_folder_path = "./core/nlp/resource/dictionaries"
input_df = pd.DataFrame(
{"input_text": ["Can yu read tHISs message despite the horible AB1234 sppeling msitakes 😂 #OMG"]}
)
spellchecker = SpellChecker(tokenizer=MultilingualTokenizer(), dictionary_folder_path=dictionary_folder_path)
output_df = spellchecker.check_df(df=input_df, text_column="input_text", language="en")
Don't hesitate to check these plugins using the library for more examples:
- Version: 0.2.0
- State: Supported
Library created and maintained by Alex Combessie.