This post shows how to improve accuracy of **character-level text classification** with the deep learning framework Keras, using ** n-gram extraction inside the model**:

## Our motivation

In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments he could just modify the preprocessing and the model as he wished, but for production, it was much preferable to just replace the model being served by tensorflow and leave all other code unchanged. And that is what we did — move the bigram and trigram extraction into our neural network. In this blog post, I'll show you

- the basic idea,
- the implementation,
- an application and
- the limitations

of our approach.

## The idea: *n*-gram extraction via convolution

Suppose we want to process the quote

*"I'd far rather be happy than right any day"*

of Douglas Adams. Instead of looking at the text as a sequence of characters

`I`

`'`

`d`

` `

`f`

`a`

`r`

` `

`r`

`a`

`t`

`h`

`e`

`r`

…

a neural network may profit from looking at pairs of adjacent characters, that is, at the sequence of bigrams

`I'`

`'d`

`d `

` f`

`fa`

`ar`

`r `

` r`

`ra`

`at`

`th`

`he`

`er`

…

or even trigrams or *n*-grams for *n* larger 3. To feed the neural network, we need to convert characters into numbers, for example, using the ASCII or UTF-8 codes,

`73`

`39`

`100`

`32`

`102`

`97`

`114`

`32`

`114`

`97`

`116`

`104`

`101`

`114`

…

Our bigrams then become sequences of pairs of numbers:

`73, 39`

`39, 100`

`100, 32`

`32, 102`

`102, 97`

…

If we encode these bigrams using the rule

*(a, b) ↦ N · a + b,*

where *N* is the size of our alphabet, we obtain a sequence of numbers again: in case *N=256*, this would be

`73*256+39=18727`

`39*256+100=10084`

`100*256+32=25632`

`32*256+102=8294`

…

More generally, we can encode *n*-grams for arbitrary *n* using the rule

*(a _{0}, …,a_{n-1}) ↦ N^{n-1} · a_{1} + N^{n-2} · a_{2} + … + N · a_{n-2} + a_{n-1}.*

Here comes the key observation: with this encoding rule,

extractingn-grams becomes a convolution of the sequence of character codes with the kernel(1,N, …, N.^{n-1})

And this preprocessing step can easily be inserted as a first step into any character-level text-processing neural network.

## The implementation

As a warm-up, let us implement the *n*-gram extraction as a convolution with NumPy. Given a NumPy array of character codes, the *n*-gram length *n* and the size of the alphabet *N*, the following function returns the sequence of encoded *n*-grams as an array:

import numpy as np def ngrams_numpy(array, n, alphabet_size): kernel = np.power(alphabet_size, range(0, n)) return np.convolve(array, kernel, mode='valid')

Next, how about the deep learning library Keras? Suppose we already have a working text-processing model whose input are (batches of) sequences of character codes. Then we can add bigram or *n*-gram extraction as a first layer using a lambda layer in one line. Indeed, given a batch of samples in form of a tensor of shape (`batch_size`

, `sample_length`

), the following function returns a batch of encoded bigrams in form of a tensor of shape (`batch_size`

, `sample_length - 1`

):

from keras import layers def bigrams_lambda_layer(alphabet_size): convolve = lambda x: x[:,:-1] + x[:,1:] * alphabet_size return layers.Lambda(convolve)

However, lambda layers in Keras may cause problems when saving, loading or checkpointing the model. For further deployment of a model, for example with tensorflow serving, it might be better to avoid a lambda layer and to use a 1d-convolutional layer with fixed weights as follows:

import numpy as np from keras import layers, backend def ngram_block(n, alphabet_size): def wrapped(inputs): layer = layers.Conv1D(1, n, use_bias=False, trainable=False) x = layers.Reshape((-1, 1))(inputs) x = layer(x) kernel = np.power(alphabet_size, range(0, n), dtype=backend.floatx()) layer.set_weights([kernel.reshape(n, 1, 1)]) return layers.Reshape((-1,))(x) return wrapped

This function can be used like a layer:

bigrams_tensor = ngram_block(2, alphabet_size)(input_tensor)

See also the source code for the experiment below. What this function does is

**create a 1d-convolutional layer**`layer`

with one feature map, window size*n*, zero bias vector and frozen weights that are not changed during training,**reshape the input**`inputs`

, which is a tensor of shape (`batch_size`

,`sample_length`

), to a tensor`x`

with shape (`batch_size`

,`sample_length`

, 1) (necessary because convolutional layers operate on sequences of vectors and not on sequences of scalars),**apply the convolutional layer**to the reshaped input,**set the kernel**of the convolutional layer and**reshape the output**of the convolutional layer from (`batch_size`

,`sample_length`

, 1) to (`batch_size`

,`sample_length`

) again.

## An experiment

Let us finally see how this idea works out for a classical test case, the 20 newsgroups dataset, where the task is to guess the topic of a given post from its text. We shall use a simple character-level convolutional network and see how *n*-gram extraction inside the model affects the classification accuracy and and training time.

To load the data, we use the datasets module of scikit-learn:

from sklearn.datasets import fetch_20newsgroups data = fetch_20newsgroups(subset="train", remove=("headers", "footers", "quotes")) posts, topics = data["data"], data["target"]

Now `posts`

is a list of newsgroup posts as strings, and `topics`

is a list of numbers representing the respective newsgroup topics. For each topic, we have 350 to 600 samples:

Note that this is **way too little data for a character-level model to perform well**. But let us try nevertheless. We apply some minimal preprocessing and

- convert the characters to lower case,
- filter out all characters that are not contained in our chosen
`ALPHABET`

, - replace the remaining characters by their index in the
`ALPHABET`

, - trim the sequence of indices to a fixed length
`MAX_LEN`

, - stack all those sequences in one large NumPy array:

import numpy as np ALPHABET = "abcdefghijklmnopqrstuvwxyz1234567890 !$#()-=+:;,.?/" MAX_LEN = 1000 def encode_sample(sample, index): indices = np.array([index[char] for char in sample if char in index]) return np.resize(indices, MAX_LEN) index = {char: i + 1 for i, char in enumerate(ALPHABET)} X = np.stack([encode_sample(x.lower(), index) for x in posts]) y = np.eye(20)[topics]

Now `X`

is an array of shape (`len(posts)`

, `MAX_LEN`

), and `y`

is an array of shape (`len(posts)`

, 20) containing the one-hot encoded topics. As a baseline, we train a simple convolutional model:

from keras import layers, models, optimizers LAYER_PARAMS = [[64, 3, 3], [128, 3, 3]] EMBEDDING_DIM = 16 def build_model(): inputs = layers.Input(shape=(MAX_LEN,)) x = layers.Embedding(len(ALPHABET), EMBEDDING_DIM)(inputs) for filters, kernel_size, pool_size in LAYER_PARAMS: x = layers.Conv1D(filters, kernel_size, activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.SpatialDropout1D(0.15)(x) x = layers.MaxPooling1D(pool_size)(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(20, activation="softmax")(x) model = models.Model(inputs=inputs, outputs=x) model.compile(optimizer=optimizers.Adadelta(), loss="categorical_crossentropy", metrics=["acc"]) return model model = build_model() history = model.fit(X, y, epochs=60, batch_size=20, validation_split=0.2) import json with open('baseline.json', 'w') as file: json.dump(history.history, file)

The results are quite poor — the validation accuracy reaches just 60 percent:

By careful tuning of hyperparameters, things certainly could be improved a bit. Now let us see how bigram and trigram extraction will affect performance of the model. Using the function `ngram_block`

, we only need to insert the line `x = ngram_block(n, size)(inputs)`

between the `Input`

and `Embedding`

layers in `build_model`

as follows:

def build_ngram_model(n): inputs = layers.Input(shape=(MAX_LEN,)) x = ngram_block(n, len(ALPHABET))(inputs) x = layers.Embedding(pow(len(ALPHABET), n), n * EMBEDDING_DIM)(x) for filters, kernel_size, pool_size in LAYER_PARAMS: x = layers.Conv1D(filters, kernel_size, activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.SpatialDropout1D(0.05 + 0.1 * n)(x) x = layers.MaxPooling1D(pool_size)(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(20, activation="softmax")(x) model = models.Model(inputs=inputs, outputs=x) model.compile( optimizer=optimizers.Adadelta(), loss="categorical_crossentropy", metrics=["acc"], ) return model

We also raised the embedding dimension (because now we want to embed bigrams and trigrams instead of single characters) and use an adaptive spatial dropout rate. Let us see how the *n*-gram model performs:

for n in range(1, 4): build_ngram_model(n).fit(X, y, epochs=40, batch_size=20, validation_split=0.2)

The training histories show that the *n*-gram extraction yields a significant improvement:

Indeed, the mean validation accuracy of the last 5 training epochs increased by **more than 10 percent**:

n | 1 | 2 | 3 |
---|---|---|---|

mean validation accuracy |
0.5796 | 0.6401 | 0.7064 |

## Limitations of the technique

Why did we stop at tri-grams in the experiment above? The reason is that we do not only encode the *n*-grams that occur in our samples, but reserve codings for all *n*-grams that could possibly occur. And that makes a huge difference when *n* is growing larger:

n | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

#(occuring n-grams) |
52 | 2,596 | 47,203 | 214,362 | 551,904 |

#(potential n-grams) |
51 | 2,601 | 132,651 | 6,765,201 | 345,025,251 |

And therefore, **the embedding layer will need memory increasing exponentially with n**. This is the reason why we stick to bigrams or trigrams. By the way, the numbers above where extracted as follows:

import pandas as pd def all_ngrams(n): length = MAX_LEN - n + 1 ngrams = lambda x: set(zip(*[x[i:length + i] for i in range(0, n)])) return set().union(*[ngrams(x) for x in X]) ns = range(1,6) alphabet_size = len(ALPHABET) counts = {'#(occuring n-grams)': [len(all_ngrams(n)) for n in ns], '#(potential n-grams)': [pow(alphabet_size, n) for n in ns]} pd.DataFrame(counts, index = pd.Index(ns, name='n')).transpose()