Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Ebook230 pages1 hour

Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data

Rating: 0 out of 5 stars

()

Read preview

About this ebook

By way of the increased penetration of the Internet, social networking websites have become a 
constitutive and indispensable concern of our lives. Social networks make sharing   of   
information,   communication   and   collaboration   straightforward   and opportune.  Social  
media  websites  have  grown  significantly  popular  over  the  last decade  as  the  key  open  
source  platforms  for  general  information  and  knowledge sharing.  Social  media  news  feeds  
and  question  answering  sites  are  increasingly becoming popular and valuable resources for 
enriching and enhancing the knowledge base. Teaching-learning process is now immensely influenced 
by the emerging role of social  media  and  cannot  be  ignored.  Increased  accessibility  of  the 
 internet  and  the ubiquitous  networks  are  major  factors  to  change  the  pedagogical  and  
learning ecosystem's  dynamics  .  Community question  answering  (CQA)  as  a  crowd- 
sourced   service   has   emerged   as   a   collective   intelligence   social   system   which 
facilitates  participation  of  volunteers  to  express  their  knowledge  and  clear  their 
uncertainties    regarding    some    topics.    The    alternate    perspectives    promotes 
receptiveness in sharing and learning, interactions and collaborations which describe the  
advantages  of  intensive  use  of  a  typical  Q&A  website  as  open  source  of information. But 
on the flip side, it is laborious and long-drawn-out task to segregate the semantically duplicate 
information, best answers/semantically matched questions and  experts  for  better  user  
experience  .  These  Q&A  forums  however  facilitate instant   information,   comprehend   
issues   related   to   higher   response   time   and compromised   quality   of   answers   with  
 the   influx   of   questions   and   answers.

 

Furthermore,  semantically  duplicate  content  falsify  the  mechanism  employed  for filtering.  
Thus  the  present  needs  shifted  the  point  of  focus  towards  the  hitches  of 'filter  
failure'  from  the  issues  of  'information  overload'.  To  build  an  intelligent, proficient 
and semantic  filtering solutions  that can adjust, realign the responses and
give  options  as  per  user's  interest  has  become  pivotal.
 

LanguageEnglish
Release dateAug 21, 2023
ISBN9798223296850
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data

Related to Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data

Related ebooks

Computers For You

View More

Related articles

Reviews for Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data - Seema Rani

    Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data

    Seema Rani

    TABLE OF CONTENTS

    Table of Contents List of Abbreviations List of Figure(s)

    List of Table(s)

    CHAPTER 1: INTRODUCTION AND OUTLINE

    1.1  Introduction

    1.2  Semantically Equivalent Text

    Semantic Relation Identification

    vi x xii xiii

    1

    1

    3

    3

    1.3  Duplicate Data on Community Question-Answering sites 5

    Duplicate Mono-lingual Questions 7

    Duplicate Multi-lingual Questions 7

    Duplicate Transliterated Questions 8

    Issues of Duplicate Questions 8

    1.4  Techniques to Detect Duplicate Text 11

    Soft Computing Techniques 11

    Fuzzy Logic 13

    Bayesian Network 14

    Classification 14

    Evolutionary Algorithms 15

    Neural Networks 16

    1.5  Phases of Duplicate Question Detection using soft Computing 17

    Data collection: 18

    Pre-processing: 18

    Feature Engineering: 18

    Semantic Similarity measures: 19

    Classification and Evaluation: 20

    1.6  Challenges in Duplicate Question Detection 23

    1.7  Organization of the Thesis 23

    CHAPTER 2: RELATED WORK 25

    2.1  Introduction 25

    2.2  Duplicate Short Text Detection 27

    2.3  Duplicate Detection in Multilingual or code-mixed Text 30

    2.4  Duplicate Question Detection on CQA 31

    2.5  Duplicate Detection of Multilingual Questions 38

    2.6  Duplicate Detection of Transliterated Bi-lingual Questions 39

    2.7  Summary of literature review 39

    2.8  Conclusion 45

    CHAPTER 3:PROBLEM STATEMENT FORMULATION 46

    3.1  Introduction 46

    3.2  Origin of the Problem 46

    3.3  Gaps in Present Work 47

    3.4  Problem Statement and Research Objectives 48

    3.5  Research Methodology 49

    3.6  Conclusion 49

    CHAPTER 4: PROPOSED DQDHINGLISH MODEL 50

    4.1  Introduction 50

    4.2  Proposed DQDHinglish Model for Detecting Duplicate Question in

    Hinglish Pair 50

    4.3  Language Transforming Module 51

    4.4  Module for Semantic Matching 54

    4.5  Dataset 54

    4.6  Semantic Matching using Siamese MLP 56

    Language Transformation 58

    Semantic Matching 58

    4.7  Experimental Requirements For DQDHinglish 60

    4.8  Result Analysis and discussions 60

    4.9  Conclusion 61

    CHAPTER 5: HYBRID DEEP NEURAL APPROACH FOR

    DQDHINGLISH 63

    5.1  Introduction 63

    5.2  Semantic Matching using Siamese LSTM+MLP 63

    Language Transformation 65

    Semantic Matching Module 66

    5.3  Performance of Proposed Methodology 69

    5.4  Discussion 71

    5.5  Conclusion 72

    CHAPTER 6: SIAMESE CAPSULE NETWORK FOR DQDHINGLISH 73

    6.1  Introduction 73

    6.2  Semantic Matching using Siamese Capsule Network 74

    Language Transformation 75

    Semantic Matching 77

    6.3  Result Analysis and Discussions 81

    Performance on various type of questions 83

    Performance with distinct similarity measures 83

    6.4  Conclusion 85

    CHAPTER 7: DQD USING SUPPORT VECTOR MACHINES (SVM) 86

    7.1  Introduction 86

    7.2  Dataset 86

    7.3  Feature Engineering 87

    7.4  Semantic matching 89

    7.5  Results and discussions 90

    7.6  Conclusion 90

    CHAPTER 8: CONCLUSION AND FUTURE SCOPE 91

    8.1  Introduction 91

    8.2  Overview of Thesis 91

    8.3  Conclusion of Research 92

    8.4  Contribution of the Work 93

    8.5  Future Research Directions 94

    REFERENCES 95

    LIST OF ABBREVIATIONS

    CQA : Community Question Answering NLP : Natural Language Processing

    Tf-idf : Term frequency-Inverse document frequency RNN : Recurrent Neural Network

    CNN : Convolution Neural Network LSTM : Long Short term Memory

    BERT : Bi-directional Encoder Representation from Transformers SC :  Soft computing

    ML : Machine Learning

    SVM : Support Vector Machine

    AI : Artificial Intelligence

    ES Evolution Strategies

    EDA : Estimation of Distribution Algorithms DE : Differential Evolution

    GA : Genetic Algorithm

    MOEA : Multi-objective Evolutionary Algorithms MA : Memetic Algorithms

    GP : Genetic Programming

    LCS : Learning Classifier Systems

    ANN : artificial Neural Network

    BOW : Bag of Words

    ROC : Receiver Operating Characteristic AUC : Area under the Curve

    CKY : Cocke Kasami Younger

    SICK : Sentences Involving Compositional Knowledge

    MSRP : Microsoft Research Paraphrase Corpus STS : Semantic Text Similarity

    GRU : Gated Recurrent Unit

    LR : Linear Regression

    IDF : Inverse document frequency

    SIS : Semantic Information Space Bi-LSTM : Bi-direction LSTM

    Bi-GRU : Bi-directional GRU

    OHNLP : Open Health Natural Language Processing DQG : Duplicate Question Generation

    WS-TB : Weak supervision-Title body

    RCNN : Region based Convolution neural network SNLI : Stanford Natural Language Inference AMAN  :  Adaptive multi attention network

    AeQQP : Answer-enhanced Question-Question pair MLP : Multilayer Perceptron

    PCQA : Programming Community question answering

    LIST OF FIGURE(S)

    Figure No. Page No.

    CHAPTER 1

    INTRODUCTION AND OUTLINE

    1.1  Introduction

    By way of the increased penetration of the Internet, social networking websites have become a constitutive and indispensable concern of our lives. Social networks make sharing of information, communication and collaboration straightforward and opportune. Social media websites have grown significantly popular over the last decade as the key open source platforms for general information and knowledge sharing. Social media news feeds and question answering sites are increasingly becoming popular and valuable resources for enriching and enhancing the knowledge base. Teaching-learning process is now immensely influenced by the emerging role of social media and cannot be ignored. Increased accessibility of the internet and the ubiquitous networks are major factors to change the pedagogical and learning ecosystem’s dynamics [1], [2]. Community question answering (CQA) as a crowd- sourced service has emerged as a collective intelligence social system which facilitates participation of volunteers to express their knowledge and clear their uncertainties regarding some topics. The alternate perspectives promotes receptiveness in sharing and learning, interactions and collaborations which describe the advantages of intensive use of a typical Q&A website as open source of information. But on the flip side, it is laborious and long-drawn-out task to segregate the semantically duplicate information, best answers/semantically matched questions and experts for better user experience [3]. These Q&A forums however facilitate instant information, comprehend issues related to higher response time and compromised quality of answers with the influx of questions and answers. Furthermore, semantically duplicate content falsify the mechanism employed for filtering. Thus the present needs shifted the point of focus towards the hitches of ‘filter failure’ from the issues of ‘information overload’. To build an intelligent, proficient and semantic filtering solutions that can adjust, realign the responses and give options as per user’s interest has become pivotal. Usually users are unable to

    represent their preferences with certainty as well as fuzziness in concerns, duplication in inquiries, and the imprecision connected with the enormous and various replies are some of the challenges that obstruct better information filtering systems. [4].

    Being a public platform, these CQA sites obtain queries not only from a variety of individuals all around the world, but also in different languages other than English. This causes bilingual or multilingual duplicity of Question. The problem becomes strenuous with the frequent use of informal languages or a mashed-up of multiple languages in the sentences. Reproducing a source language by using its alphabets into another language sentence is very common practice

    Enjoying the preview?
    Page 1 of 1