Python – Text Processing Introduction

One field in Machine Learning is Natural Language Processing (NLP). It can be useful for example while building an automatic system to manipulate CVs. The user upload a CV document , we are looking for keywords and classify it based on the result

As part of feature engineering, we create a new input features based on existing ones. Sometimes we have a description field and we want to look for something in that field

Examples:

  • One or more words from a close set (sale, offer, buy, free, …)
  • Two terms with a relation (python , machine learning)
  • Numbers

To process to text we can use:

  • Python string type
  • Regular expressions
  • Dedicated packages

 

Python Strings

We can use python string methods for simple operations:

  • Split a string using a single separator
  • Find a substring
  • Count sub strings
  • Check some simple conditions (start with , ends with, digits, alpha, etc)

It is very simple , you can read more in python official documentation

 

Regular Expressions

Regular expressions helps in a complex string manipulation. You need to learn the rules and the regex engine will do the job for you

For complete guide look here

Some examples that regular expressions can be used:

 

Removing numbers

 

Splitting with multiple separator (numbers, comma, spaces, plus, colon)

 

Find all numbers

 

Find 2 related words with up to 20 characters between them

search for quick or slow followed by fox or camel (up to 20 characters far)

 

Using Dedicated Packages

You can find many packages for text handling, strings and NLP

 

The string package

The string package is not so useful but has some nice options

Removing punctuation 

see also – string.digits, string.hexdigits, string.ascii_letters and more

 

NLTK

NLTK is a huge package with many natural language modules.

one useful package for text preprocessing is stopwords , it helps with removing many stop words from our text (I , You , have, ….)

first you need to download it:

Now we can use it to remove all stop words

This is not a post about NLTK , only an intro with a simple example , for more information, see the official site and this blog post

 

Scikit-learn Text processing

scikit-learn is a very popular package for machine learning. It can be used to build many models for supervised and unsupervised learning

One useful package for text preprocessing is sklearn.feature_extraction.text . We can use it to extract and count words from a document, build a vocabulary and more

for example, we have some documents:

To build a vocabulary from it:

transform it:

We can see from the results that on the first document we have words 6,9 on the second 3,4,5 and so on

To count all words in all documents we need to convert it to a numpy array and sum it on axis 0:

Word with index 0 ==> count = 1

Word with index 3 ==> count = 2

and so on

 

 

 

 

 

2 thoughts on “Python – Text Processing Introduction

Leave a Reply

Your email address will not be published. Required fields are marked *