Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
By Seema Rani
()
About this ebook
By way of the increased penetration of the Internet, social networking websites have become a
constitutive and indispensable concern of our lives. Social networks make sharing of
information, communication and collaboration straightforward and opportune. Social
media websites have grown significantly popular over the last decade as the key open
source platforms for general information and knowledge sharing. Social media news feeds
and question answering sites are increasingly becoming popular and valuable resources for
enriching and enhancing the knowledge base. Teaching-learning process is now immensely influenced
by the emerging role of social media and cannot be ignored. Increased accessibility of the
internet and the ubiquitous networks are major factors to change the pedagogical and
learning ecosystem's dynamics . Community question answering (CQA) as a crowd-
sourced service has emerged as a collective intelligence social system which
facilitates participation of volunteers to express their knowledge and clear their
uncertainties regarding some topics. The alternate perspectives promotes
receptiveness in sharing and learning, interactions and collaborations which describe the
advantages of intensive use of a typical Q&A website as open source of information. But
on the flip side, it is laborious and long-drawn-out task to segregate the semantically duplicate
information, best answers/semantically matched questions and experts for better user
experience . These Q&A forums however facilitate instant information, comprehend
issues related to higher response time and compromised quality of answers with
the influx of questions and answers.
Furthermore, semantically duplicate content falsify the mechanism employed for filtering.
Thus the present needs shifted the point of focus towards the hitches of 'filter
failure' from the issues of 'information overload'. To build an intelligent, proficient
and semantic filtering solutions that can adjust, realign the responses and
give options as per user's interest has become pivotal.
Related to Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Related ebooks
Pattern Recognition Rating: 4 out of 5 stars4/5Cognitive Radio Communication and Networking: Principles and Practice Rating: 0 out of 5 stars0 ratingsIntroduction to Digital Systems: Modeling, Synthesis, and Simulation Using VHDL Rating: 0 out of 5 stars0 ratingsArtificial Intelligence Methods for Optimization of the Software Testing Process: With Practical Examples and Exercises Rating: 0 out of 5 stars0 ratingsIntroduction to Quantum Computing & Machine Learning Technologies: 1, #1 Rating: 0 out of 5 stars0 ratingsCognitive Computing and Big Data Analytics Rating: 0 out of 5 stars0 ratingsData Science: Concepts, Strategies, and Applications Rating: 0 out of 5 stars0 ratingsDATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB Rating: 0 out of 5 stars0 ratingsDeep Learning and Parallel Computing Environment for Bioengineering Systems Rating: 0 out of 5 stars0 ratingsSoftware Defined Networks: A Comprehensive Approach Rating: 0 out of 5 stars0 ratingsDEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB Rating: 0 out of 5 stars0 ratingsPrinciples and Practices of Interconnection Networks Rating: 0 out of 5 stars0 ratingsCooperative and Graph Signal Processing: Principles and Applications Rating: 0 out of 5 stars0 ratingsText Mining in Practice with R Rating: 0 out of 5 stars0 ratingsDeep Belief Nets in C++ and CUDA C: Volume 2: Autoencoding in the Complex Domain Rating: 0 out of 5 stars0 ratingsSolutions for Networked Databases: How to Move from Heterogeneous Structures to Federated Concepts Rating: 0 out of 5 stars0 ratingsBig Data Analytics for Large-Scale Multimedia Search Rating: 0 out of 5 stars0 ratingsDeep Learning with R, Second Edition Rating: 0 out of 5 stars0 ratingsKeras to Kubernetes: The Journey of a Machine Learning Model to Production Rating: 0 out of 5 stars0 ratingsSystems Analysis: Made Simple Computerbooks Rating: 5 out of 5 stars5/5Modeling and Simulation of Computer Networks and Systems: Methodologies and Applications Rating: 0 out of 5 stars0 ratingsStochastic Modeling: A Thorough Guide to Evaluate, Pre-Process, Model and Compare Time Series with MATLAB Software Rating: 0 out of 5 stars0 ratingsArtificial Intelligence and Machine Learning for EDGE Computing Rating: 0 out of 5 stars0 ratingsSemantic Computing Rating: 0 out of 5 stars0 ratingsNetwork Coding: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsMachine Learning Applications in Civil Engineering Rating: 0 out of 5 stars0 ratingsDesigning Deep Learning Systems: A software engineer's guide Rating: 0 out of 5 stars0 ratingsTools and Environments for Parallel and Distributed Computing Rating: 0 out of 5 stars0 ratings
Computers For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsArtificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsPractical Lock Picking: A Physical Penetration Tester's Training Guide Rating: 5 out of 5 stars5/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsThe Mega Box: The Ultimate Guide to the Best Free Resources on the Internet Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5I Forced a Bot to Write This Book: A.I. Meets B.S. Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Remote/WebCam Notarization : Basic Understanding Rating: 3 out of 5 stars3/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings
Reviews for Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
0 ratings0 reviews
Book preview
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data - Seema Rani
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Seema Rani
TABLE OF CONTENTS
Table of Contents List of Abbreviations List of Figure(s)
List of Table(s)
CHAPTER 1: INTRODUCTION AND OUTLINE
1.1 Introduction
1.2 Semantically Equivalent Text
Semantic Relation Identification
vi x xii xiii
1
1
3
3
1.3 Duplicate Data on Community Question-Answering sites 5
Duplicate Mono-lingual Questions 7
Duplicate Multi-lingual Questions 7
Duplicate Transliterated Questions 8
Issues of Duplicate Questions 8
1.4 Techniques to Detect Duplicate Text 11
Soft Computing Techniques 11
Fuzzy Logic 13
Bayesian Network 14
Classification 14
Evolutionary Algorithms 15
Neural Networks 16
1.5 Phases of Duplicate Question Detection using soft Computing 17
Data collection: 18
Pre-processing: 18
Feature Engineering: 18
Semantic Similarity measures: 19
Classification and Evaluation: 20
1.6 Challenges in Duplicate Question Detection 23
1.7 Organization of the Thesis 23
CHAPTER 2: RELATED WORK 25
2.1 Introduction 25
2.2 Duplicate Short Text Detection 27
2.3 Duplicate Detection in Multilingual or code-mixed Text 30
2.4 Duplicate Question Detection on CQA 31
2.5 Duplicate Detection of Multilingual Questions 38
2.6 Duplicate Detection of Transliterated Bi-lingual Questions 39
2.7 Summary of literature review 39
2.8 Conclusion 45
CHAPTER 3:PROBLEM STATEMENT FORMULATION 46
3.1 Introduction 46
3.2 Origin of the Problem 46
3.3 Gaps in Present Work 47
3.4 Problem Statement and Research Objectives 48
3.5 Research Methodology 49
3.6 Conclusion 49
CHAPTER 4: PROPOSED DQDHINGLISH MODEL 50
4.1 Introduction 50
4.2 Proposed DQDHinglish Model for Detecting Duplicate Question in
Hinglish Pair 50
4.3 Language Transforming Module 51
4.4 Module for Semantic Matching 54
4.5 Dataset 54
4.6 Semantic Matching using Siamese MLP 56
Language Transformation 58
Semantic Matching 58
4.7 Experimental Requirements For DQDHinglish 60
4.8 Result Analysis and discussions 60
4.9 Conclusion 61
CHAPTER 5: HYBRID DEEP NEURAL APPROACH FOR
DQDHINGLISH 63
5.1 Introduction 63
5.2 Semantic Matching using Siamese LSTM+MLP 63
Language Transformation 65
Semantic Matching Module 66
5.3 Performance of Proposed Methodology 69
5.4 Discussion 71
5.5 Conclusion 72
CHAPTER 6: SIAMESE CAPSULE NETWORK FOR DQDHINGLISH 73
6.1 Introduction 73
6.2 Semantic Matching using Siamese Capsule Network 74
Language Transformation 75
Semantic Matching 77
6.3 Result Analysis and Discussions 81
Performance on various type of questions 83
Performance with distinct similarity measures 83
6.4 Conclusion 85
CHAPTER 7: DQD USING SUPPORT VECTOR MACHINES (SVM) 86
7.1 Introduction 86
7.2 Dataset 86
7.3 Feature Engineering 87
7.4 Semantic matching 89
7.5 Results and discussions 90
7.6 Conclusion 90
CHAPTER 8: CONCLUSION AND FUTURE SCOPE 91
8.1 Introduction 91
8.2 Overview of Thesis 91
8.3 Conclusion of Research 92
8.4 Contribution of the Work 93
8.5 Future Research Directions 94
REFERENCES 95
LIST OF ABBREVIATIONS
CQA : Community Question Answering NLP : Natural Language Processing
Tf-idf : Term frequency-Inverse document frequency RNN : Recurrent Neural Network
CNN : Convolution Neural Network LSTM : Long Short term Memory
BERT : Bi-directional Encoder Representation from Transformers SC : Soft computing
ML : Machine Learning
SVM : Support Vector Machine
AI : Artificial Intelligence
ES Evolution Strategies
EDA : Estimation of Distribution Algorithms DE : Differential Evolution
GA : Genetic Algorithm
MOEA : Multi-objective Evolutionary Algorithms MA : Memetic Algorithms
GP : Genetic Programming
LCS : Learning Classifier Systems
ANN : artificial Neural Network
BOW : Bag of Words
ROC : Receiver Operating Characteristic AUC : Area under the Curve
CKY : Cocke Kasami Younger
SICK : Sentences Involving Compositional Knowledge
MSRP : Microsoft Research Paraphrase Corpus STS : Semantic Text Similarity
GRU : Gated Recurrent Unit
LR : Linear Regression
IDF : Inverse document frequency
SIS : Semantic Information Space Bi-LSTM : Bi-direction LSTM
Bi-GRU : Bi-directional GRU
OHNLP : Open Health Natural Language Processing DQG : Duplicate Question Generation
WS-TB : Weak supervision-Title body
RCNN : Region based Convolution neural network SNLI : Stanford Natural Language Inference AMAN : Adaptive multi attention network
AeQQP : Answer-enhanced Question-Question pair MLP : Multilayer Perceptron
PCQA : Programming Community question answering
LIST OF FIGURE(S)
Figure No. Page No.
CHAPTER 1
INTRODUCTION AND OUTLINE
1.1 Introduction
By way of the increased penetration of the Internet, social networking websites have become a constitutive and indispensable concern of our lives. Social networks make sharing of information, communication and collaboration straightforward and opportune. Social media websites have grown significantly popular over the last decade as the key open source platforms for general information and knowledge sharing. Social media news feeds and question answering sites are increasingly becoming popular and valuable resources for enriching and enhancing the knowledge base. Teaching-learning process is now immensely influenced by the emerging role of social media and cannot be ignored. Increased accessibility of the internet and the ubiquitous networks are major factors to change the pedagogical and learning ecosystem’s dynamics [1], [2]. Community question answering (CQA) as a crowd- sourced service has emerged as a collective intelligence social system which facilitates participation of volunteers to express their knowledge and clear their uncertainties regarding some topics. The alternate perspectives promotes receptiveness in sharing and learning, interactions and collaborations which describe the advantages of intensive use of a typical Q&A website as open source of information. But on the flip side, it is laborious and long-drawn-out task to segregate the semantically duplicate information, best answers/semantically matched questions and experts for better user experience [3]. These Q&A forums however facilitate instant information, comprehend issues related to higher response time and compromised quality of answers with the influx of questions and answers. Furthermore, semantically duplicate content falsify the mechanism employed for filtering. Thus the present needs shifted the point of focus towards the hitches of ‘filter failure’ from the issues of ‘information overload’. To build an intelligent, proficient and semantic filtering solutions that can adjust, realign the responses and give options as per user’s interest has become pivotal. Usually users are unable to
represent their preferences with certainty as well as fuzziness in concerns, duplication in inquiries, and the imprecision connected with the enormous and various replies are some of the challenges that obstruct better information filtering systems. [4].
Being a public platform, these CQA sites obtain queries not only from a variety of individuals all around the world, but also in different languages other than English. This causes bilingual or multilingual duplicity of Question. The problem becomes strenuous with the frequent use of informal languages or a mashed-up of multiple languages in the sentences. Reproducing a source language by using its alphabets into another language sentence is very common practice