[FIXED] How to use word embedding and feature for text classification


I have a bunch of sentences that I am trying to classify. For each sentence, I generated a word embedding using word2vec. I also performed a cluster analysis which clustered the sentences into 3 separate clusters.

What I want to do is use the cluster id (1-3) as a feature for my model. However, I am just not entirely sure how to do this? I can’t seem to find a good article that clearly states how to do this.

I was thinking I could create a one hot embedding for the cluster id and then somehow combine the one hot to the word embedding? I am really not sure what to do here.

I already have a model that will take the word embedding and classify the sentence:


indices = filtered_products.index.values
X_train, X_test, y_train, y_test, indices_train, indices_test, = train_test_split(X, y, indices, test_size=0.3, random_state=428)

clf = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

DSVM = clf.fit(X_train,y_train)
prediction = DSVM.predict(X_test)

print(metrics.classification_report(y_test, prediction))

Where X is the word embedding and y is the category. Just not sure how to add in the cluster id as a feature


Assuming, you want to use Tensorflow. You can either one-hot encode the ids or map them to n-dimensional random vectors using an Embedding layer. Here is an example with an Embedding layer, where I am mapping each id to a 10-dimensional vector and then repeating this vector 50 times to correspond to the max length of a sentence (So, each word has the same 10-dimensional vector for a given input). Afterwards, I just concatenate:

import tensorflow as tf

word_embedding_dim = 300
max_sentence_length = 50

word_embedding_input = tf.keras.layers.Input((max_sentence_length, word_embedding_dim))

id_input = tf.keras.layers.Input((1, ))
embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
x = embedding_layer(id_input)
x = tf.keras.layers.RepeatVector(max_sentence_length)(x[:, 0, :])

output = tf.keras.layers.Concatenate()([word_embedding_input, x])
model = tf.keras.Model([word_embedding_input, id_input], output)

Model: "model_1"
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_17 (InputLayer)          [(None, 1)]          0           []                               
 embedding_3 (Embedding)        (None, 1, 10)        10          ['input_17[0][0]']               
 tf.__operators__.getitem (Slic  (None, 10)          0           ['embedding_3[0][0]']            
 input_16 (InputLayer)          [(None, 50, 300)]    0           []                               
 repeat_vector_1 (RepeatVector)  (None, 50, 10)      0           ['tf.__operators__.getitem[0][0]'
 concatenate (Concatenate)      (None, 50, 310)      0           ['input_16[0][0]',               
Total params: 10
Trainable params: 10
Non-trainable params: 0

If you do not have a 2D input, but actually sentence embeddings, it is even easier:

import tensorflow as tf

sentence_embedding_dim = 300

sentence_embedding_input = tf.keras.layers.Input((sentence_embedding_dim,))
id_input = tf.keras.layers.Input((1, ))
embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
x = embedding_layer(id_input)

output = tf.keras.layers.Concatenate()([sentence_embedding_input, x[:, 0, :]])
model = tf.keras.Model([sentence_embedding_input, id_input], output)

Here is a solution with numpy and sklearn for reference:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

samples = 10
word_embedding_dim = 300
max_sentence_length = 50

ids = np.random.randint(low=1, high=4, size=(10,)).reshape(-1, 1)
enc = OneHotEncoder(handle_unknown='ignore')
ids = enc.fit_transform(ids).toarray()[:, None, :]

X_train = np.random.random((samples, max_sentence_length, word_embedding_dim))

ids = np.repeat(ids, max_sentence_length, axis=1)
X_train = np.concatenate([X_train, ids], axis=-1)
# (10, 50, 303)

Answered By – AloneTogether

Answer Checked By – Pedro (Easybugfix Volunteer)

Leave a Reply

(*) Required, Your email will not be published