The success of Natural Language Processing tasks generally depends on data representation. Representational learning incorporated in predictive models is the trend of deep learning models. Learning sentence representation with full semantics of document is a challenge in natural language processing problems. Because if the semantic representation vector of the sentence is good, it will increase the performance of finding similar question problems. In this paper, we propose to implement a series of LSTM models with different ways of extracting sentence representations and apply them to question retrieval for the purpose of exploiting hidden semantics of sentences. These methods give sentence representation from hidden layers of the LSTM model. The techniques consist using the last hidden layer of LSTM, Max pooling and Mean pooling. The results show that the technique using a combination of both Maxpooling and Meanpooling gives the highest results on the 2017 semeval dataset for the problem of finding similarity questions.
Finding similar questions in the community question and answer system (cQA) is a popular problem in Natural Language Processing that has recently received a lot of attention from researchers and industry. Many web forums, such as Stack Overflow (https://stackover flow.com/) and Qatar Living (https://www.qatarliving. com/forum), are growing in popularity and flexibility in order to provide information to users [1]. Users can post questions and potentially receive multiple responses from others. The problem of finding similar questions is addressed in order for users to automatically receive answers from the answers already in the existing questions. This is why it is necessary to create a tool that will automatically find questions related to query questions.
Definition of the task of finding similar questions is follwed: Given a query question and a set of given questions from the question archive, the goal is to rank these questions based on their similarity to the query question.
Previous research [2] has indicated that the most difficult aspect of this problem is lexical gap. It means that the two sentences have the same meaning but the words and phrases in the first question are used differently from words and phrases in the second question. Here is an example of two questions that are considered similarity questions taken from the 2017 semeval dataset [3,4]. For example:
Question 1: Where I can buy good oil for massage?
Question 2: Hi there, I can see a lot of massage center here but I dont which one is better. Can someone help me which massage center is good…and how much will it cost me? Tks
These two questions have the same meaning but are use words differently. In question 2, there are many explanation words for the question and the tone of the spoken form, containing many acronyms. A major challenge of this task lies in the complex and flexible semantic relationship observed between the question and the passage question. Furthermore, question 1 of above example has 11 words, while question 2 uses 39 words to explain. On the other hand, question 2 contains a group of words that include information that is not directly related to the question. Moreover, two these question often don't share common lexical units. This problem can be confusing for simple word matching systems. Thus, these challenges make manual features much less desirable than the deep learning approach. Moreover, they also require system to learn to distinguish useful parts from irrelevant parts and moreover, focus more on the importance words. This problem is often approached as a pairwise ranking problem. The best strategy to capture question-to-question associations is still a challenge of research. Previous approaches often suffer from the following weakness: First, previous work, such as [5-8] uses a CNN model or RNN respectively. However, CNN emphasizes local interactions in n-grams, while RNNs are designed to capture long-term information and forget about unimportant local information over hidden vector in the final layer.
In this paper, we explore a series of strategy to learn sentence representation to address these weaknesses. We start with a basic LSTM model that uses hidden vectors at the last layer to give a sentence representation. We then synthesize the sentence representation using the Max pooling and Mean pooling strategies to synthesize the sentence representation across the hidden layers in the LSTM network and finally we evaluate the model when combining both Max and Mean pooling.
In the next part of the paper we present: (II) Related works; (III) Model; (IV) Results and discussion; (V). Conclusion.
Related Work
In recent years, many related studies have been proposed to solve the problem of finding similar questions and achieved many positive results. As follows:
Previous work on question-finding often used technical specifications, linguistic tools and external knowledge. For example, semantic features are built on top of Wordnet [7]. This model combines semantically related words based on the semantic relations of words.
In the Semeval 2017 conference, the model that won the contest on the Semeval dataset uses very complex technical features [8] such as probing the kernel function or extracting the kernel feature from analyzing the trees. syntax. Another study exploits different similarity features such as cosine measure, Euclidean measure of lexical, syntactic and semantic distance [5] to represent sentences learned from SVM model.
The studies on the problem of finding the answer [9-12] in the CQA system have yielded good results using neural networks without having to use manually extracted features. These models learn the sentence representation, then measure the similarity of the question to the question and the question to the answer [10].
In this paper, we propose a series of learning models to address these weaknesses. We start with a basic LSTM model that uses hidden vectors at the last layer to give a sentence representation. We then synthesize the sentence representation using the Max pooling and Mean pooling strategies to synthesize the sentence representation across the hidden layers in the LSTM network and finally we evaluate the model when combining both Max features and Mean pooling.
Proposed Models
Original Model LSTM: They first briefly present the LSTM model [13]. LSTM is a special type of RNN neural network based on string data. LSTM uses several gate vectors at each location to control the transmission of information along the sequence and thus improves modeling of long-range dependencies. While there are different variations of LSTM. We use X = (x1 , x2 ,..., XN ) to represent an input string, where, xk ∈ RL (1≤k≤N). These vectors are used together to generate a d-dimensional hidden state hk as follows [11]:
iₖ = σ(Wⁱxₖ + Vⁱh₍ₖ₋₁₎ + bⁱ)
fₖ = σ(Wᶠxₖ + Vᶠh₍ₖ₋₁₎ + bᶠ)
oₖ = σ(Wᵒxₖ + Vᵒh₍ₖ₋₁₎ + bᵒ)
cₖ = fₖ ⊙ c₍ₖ₋₁₎ + iₖ ⊙ tanh(Wᶜxₖ + Vᶜh₍ₖ₋₁₎ + bᶜ)
hₖ = oₖ ⊙ tanh(cₖ)
(1)
where: i, f, o are input gates, forget gates and output gates respectively, matrices W, V and b are the learning matrix from the model.
Methods of Representing Sentences
Figure 1 shows how to get the sentence representation using the last hidden layer in the matching problem.

Figure 1: The LSTM model uses the hidden vector at the last layer to represent the sentence
Figure 2 describes the method of getting the sentence representation using the max pooling operation of the hidden classes. Max pooling means performing the maximum value of each element in the hidden layers.

Figure 2: The LSTM model uses max pooling to get the sentence representation
Figure 3 below describes the method to get the sentence representation using Mean pooling of hidden classes. Mean pooling is the performance of calculating the average value of each component in the hidden layers.

Figure 3: The LSTM model uses the MEAN pooling operation to get the sentence representation
Finally, we use both MEAN and Max combined techniques to make sentence prediction.
The loss function is a cross entropy [13] function:
MAP = (1⁄|N|) × Σⱼ₌₁^|N| [(1⁄mⱼ) × Σₖ₌₁^|mⱼ| Precision(Rⱼₖ)]
(2)
Where, S is the number of pairs of questions in the training set, γ is the model's tuning parameter and W is the model's weight matrix set.
Dataset
We use the Semeval 2017 dataset to evaluate the proposed models. This dataset [11] is taken from the forum Qatar living. This is a forum to discuss all issues for foreigners living in Qatar. The data set is labeled and divided into 3 sets: training set, development set and test set. Table 1 lists the number of pairs of questions in the data set.
Table 1: Statistical Table of Pairs of Questions in the Semeval 2017 Dataset [11]
| Semeval 2017 |
Training set | 3170 |
Development set | 700 |
Test set | 880 |
We use MAP and MRR[9] to evaluate the effectiveness of the proposed model.
MAP = (1⁄|N|) × Σⱼ₌₁^|N| [(1⁄mⱼ) × Σₖ₌₁^|mⱼ| Precision(Rⱼₖ)]
(3)
Model Parameters
We use the representation from the 300-dimensional Glove fed into the model at the input layer. OOV words that are not in the dictionary are randomly initialized. The number of hidden layer dimensions in the LSTM model is set to 400 dimensions. The Adam optimal algorithm is used with the learning rate set to 0.0001, the γ parameter selected to be 0.0001, the batch-size is 64, the drop-out is 30%. The model is executed on tensorflow and run on google colab . We evaluate the performance of the model on the dev set and choose the best chosen parameter on the dev set and then set the test parameters on the test set.
Results
Table 2 shows the test results on the models. Looking at the results in Table 2, we see that when using the Max and Mean pooling techniques, the Map measure increases from 40 to 40.5%. It proves that, when the sentence representation vector is synthesized from the hidden layers, it is possible to exploit more semantic information of the sentence than using the final hidden layer. Moreover, when summarizing the sentence representation combining both Mean and max pooling, the MAP results increased to 41.07%. Thus, when concatenating two vectors Mean and Max, pooling makes it better to store information about sentence synthesis. Therefore, the prediction results of the model are better.
Table 2: Results of the Proposed Model
Paradigm | MAP |
LSTM uses the last hidden layer | 40.03 |
LSTM-max pooling | 40.50 |
LSTM-Mean pooling | 40.51 |
LSTM-Mean+Max pooling | 41.07 |
In this paper, we have proposed to use the LSTM model with different techniques of summarizing sentence representation for the problem of finding similarity questions. Experimentally, we find that the use of both Mean and Max pooling strategies also affects the results of predicting similar questions. In the future, we will conduct experiments on biLSTM and CNN models and combine the models and use attention mechanisms into this problem.
Zhou, Guangyou et al. “Towards Faster and Better Retrieval Models for Question Search.” Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM ’13), Association for Computing Machinery, 2013, pp. 2139-2148.
Zhou, Guangyou et al. “Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering.” Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM ’13), 2015, pp. 250-259.
Cai, Li et al. “Learning the Latent Topics for Question Retrieval in Community QA.” Proceedings of the 5th International Joint Conference on Natural Language Processing, Asian Federation of Natural Language Processing, November 2011, pp. 273-281.
Wu, Wei et al. “Question Condensing Networks for Answer Selection in Community Question Answering.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, July 2018, pp. 1746-1755.
Feng, Minwei et al. “Applying Deep Learning to Answer Selection: A Study and an Open Task.” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.
Wang, Di and Eric Nyberg. “A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering.”Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015.
Yih, Wen-Tau et al. “Question Answering Using Enhanced Lexical Semantic Models.” Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), 2013.
Robertson, Stephen et al. “Okapi at TREC-3.” Overview of the Third Text Retrieval Conference (TREC-3), January 1995.
Cao, Xin et al. “The Use of Categorization Information in Language Models for Question Retrieval.” Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM ’09), Association for Computing Machinery, 2009, pp. 265-274.
Blei, David M. et al. “Latent Dirichlet Allocation.” Advances in Neural Information Processing Systems 14, edited by T.G. Dietterich, S. Becker, and Z. Ghahramani, MIT Press, 2002, pp. 601-608.
Nakov, Preslav et al. “SemEval-2017 Task 3: Community Question Answering.” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, August 2017, pp. 27-48.
Filice, Simone et al. “KeLP at SemEval-2017 Task 3: Learning Pairwise Patterns in Community Question Answering.” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, August 2017, pp. 326-333.