SECRET OF CSS

10 Text Preprocessing Benchmarks on CPU, GPU, and TPU | by Bruce H. Cottman, Ph.D. | Jun, 2022


NLP benchmarks

Python code and benchmarks for ten different spaCy text preprocessing actions

Figure 1. Racing computing platforms. Source: Photo: Pietro Mattia on Unsplash

Estimates state that 70%–85% of the world’s data is text (unstructured data). New deep learning language models (transformers) have also caused explosive growth in industrial applications.

This article is not an article introducing Natural Language Processing (NLP). The feeding of a sequence of tokens created from the raw text into different Natural Language models is not covered here. Instead, we focus on preprocessing text before it is inputted as tokens into a Natural Language model.

Raw text degrades the NLP modeling unless the noise removal operation deletes or transforms words in the text to the sequence of tokens. Noise removal is usually NLP model dependent. For example, email may or may not be removed if it is a text classification task or a text redaction task.

Normalizing the corpus (sequence of tokens) transforms the text into a standard form. The most frequent example is normalization by transforming all characters to lowercase. Nevertheless, be careful. Some advanced NLP models make use of capitalization or uppercase information.

In production-grade (NLP), text preprocessing (noise cleaning and normalization) is critical to model production deployment.

We benchmark token noise cleaning and normalization preprocessing using spaCy on CPU, TPU, and three GPUs.

Google Cloud Platform (GCP) Colab is a customized Jupyter notebook image appearing as a cloud service in the GCP framework. The short definition is that “Colab is Jupyter notebook running in GCP.” Follow the URLs should any question arise about Jupyter or Colab.

One way to goto to Colab:

  1. Create a Google account
  2. Goto to https://colab.research.google.com/, logged in with your Google account
  3. Create a Google Drive and Colab access any files in your shared drive

Note: You can access GitHub files via a GitHub URL or search by organization or use.

The Colab CPU configuration for the benchmark is Intel @2.2 Ghx with 3 CPUs, six cores, and 58 MB on the CPU cache.

!cat /proc/cpuinfo

=>

1*3XYQDT0 mvbH9ofGlSyC4w
Figure 2: Colab instance CPU resources configured.

Colab is free and can provide an Nvidia GPU or Google TPU for you.

1*OZiu4nPZSXC6aGDKelrdGA
Figure 3: Colab “Change runtime type” panel.
from tensorflow.python.client import device_libdevice_lib.list_local_devices()

=>

.
.
.
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"]

A Tesla (Nvidia) P100 GPU with 16 GB memory is provisioned in this case. Depending on what is available, a T4 to high-end Nvidia V100 GPU.

We used Intel CPU, Google TPU, Nvidia T4, P100, and V100 for the benchmarks.

spaCy is the fastest package we know for Natural Language Processing (NLP) operations. spaCy is an NLP library implemented both in Python and Cython.

Because of Cython, parts of spaCy are faster than if implemented in Python. spaCy is available for MS Windows, macOS, and Ubuntu operating systems and runs natively on Nvidia GPUs.

To preprocess text with spaCy, we transform the text into a sequence of tokens. The spaCy pipeline acts on a sequence of tokens resulting in a corpus Doc object.

Our text preprocessing end goal is to produce tokens that feed into our NLP models.

1*eoriF34WbKFrzC6iDgK1ag
Figure 4 spaCy pipeline Source: https://github.com/explosion/spaCy/blob/master/website/docs/images/pipeline.svg

You configure spaCy to use a GPU by:

import spacyspacy.prefer_gpu() # or spacy.require_gpu()

=>

True

You can either create your ownTokenizer class from scratch, or even replace it with an entirely custom function. You customize the tokenizer by adding a custom pipeline. We cover spaCy pipeline customization in an upcoming article.

output = >

['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a sequence of tokens (Doc).

We create long_s, a long string with extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text may or may not be noise for the downstream NLP model.

=>

Wall time: <processor dependent> size: 307800

:( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508

Note: Size is several tokens in sequence in spaCy Doc, not the number of characters in long_s.

Each token operation is described below. The time to compute each operation is given in Figure 5.

1*CC2VQHCG4EWO4V7hvj4Scw
Figure 5. Benchmark times for spaCy actions by the processor.

You can remove emojis using the spaCy pipeline add-on:

%time long_s_doc_no_emojicon = [token  for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

We can translate emoticons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

We remove email tokens using token.like_email.

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

EMOJI Sentiment Score is not a text preprocessor in the classic sense. However, we find that emojis almost always dominate a document’s text. For example, two similar phrases from legal notes email with opposite sentiments.

The client was challenging. :(The client was difficult. :)

We calculate only EMOJI when present in a note or email.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

The sentiment was 0.07 (neutral) for 0.5 million characters “note” with 15,200 emojis and emoticons in the range of 171 to 195 ms using different processors. A fast sentiment analysis calculation!

We remove whitespace and punctuation simultaneously using spaCy tokens.

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

We remove the currency symbol in tokens using token.is_currency.

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including :) emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token  for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, replace_currency_symbols and regex ignores context and replaces any currency symbol. You may have multiple uses of $ in your text and thus can not ignore context. In this case, you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. — Industrial-Strength Natural Language Processing; ] Turbo-charge your spaCy NLP pipeline

Note: We use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

Lemmatization looks beyond word reduction and considers a language’s complete vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

Note: spaCy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context . (Why some researcher considers spacy “opinionated”).

Clean (preprocess) the data (text) into a corpus (document or set of documents) before it is input into any NLP model.

Note: Stop-word removal is expensive computationally. We found that the best way to achieve faster stop-word removal was not to do it.

Note: The Noise Removal and Normalization lists are not exhaustive. These are some of the tasks I have encountered.

The result is that parsing text into a sequence of tokens, Doc object is where most computation lies. In row 1, Figure 5, GPUs double the processing speed relative to CPUs.

Figure 5 also shows that preprocessing of the sequence of tokens by their attributes is relatively fast and sequential. GPUs have a slight advantage over CPUs for token operations. We see-saw performance changes using CPU, Goggle TPU, and Nvidia GPUs ranging by approximately 10%.

1*CC2VQHCG4EWO4V7hvj4Scw
Figure 5. spacy text preprocessing operation computation on CPU, TPU, Nvidia T4, P100, V100.

We are comparing the times of the Intel 16-core to the Intel 6-core benchmark times, a change of less than 30%.

In a future article, we show how to speed up spaCy on a multi-core platform using Dask and other packages.

Keep on coding happily and productively!



News Credit

%d bloggers like this: