Abstract
This system paper is a description of the system submitted to “SemEval-2019 Task 5” Task B for the English language, where we had to primarily detect hate speech and then detect aggressive behaviour and its target audience in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet is containing hate speech. Thereafter we were required to find whether the tweet was showing aggressive behaviour, and then we had to find whether the targeted audience was an individual or a group of people.
Link to full paper
Abstract
This system paper is a description of the system submitted to “SemEval-2019 Task 6”, where we had to detect offensive language in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet contains offensive content, and then we had to find out whether the tweet was targeted against some individual, group or other entity. Finally we were required to classify the targeted audience.
Link to full paper
Abstract
In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.
Link to full paper
Abstract
This paper describes the system submitted to ”Sentiment Analysis at SEPLN (TASS)-2019” shared task. The task includes sentiment analysis of Spanish tweets, where the tweets are in different dialects spoken in Spain, Peru, Costa Rica, Uruguay and Mexico. The tweets are short (up to 240 characters) and the language is informal, i.e., it contains misspellings, emojis, onomatopeias etc. Sentiment analysis includes classification of the tweets into 4 classes, viz., Positive, Negative, Neutral and None. For preparing the proposed system, we use Deep Learning networks like LSTMs.
Link to full paper
Abstract
This paper is a description of the system submitted to ”Humor Analysis based on Human Annotation(HAHA)-2019” shared task. The task is divided into two sub-tasks which includes detection of humour in Spanish tweets and predicting a Humor score for the same. The tweets are short (up to 240 characters) and the language is informal, i.e., it contains spelling mistakes, emojis, emoticons, onomatopeias etc. Humor detection includes classification of the tweets into 2 classes, viz., Humorous, Not humorous. For preparing the proposed system, I use Deep Learning networks like LSTMs.
Link to full paper
Abstract
Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. In this paper we explore the use of one of such in predicting the difficulty of words. We treat the problem as a binary classification problem. We train traditional machine learning models and evaluate their performance on the task. Removing dependency on frequency of previously acquired words for measuring difficulty was one of our primary aims. Then we analyze a convolutional neural network based prediction model which operates at the character level and evaluate its efficiency compared to others.
Link to full paper
Abstract
Text simplification is one of the domains in Natural Language Processing which offers great promise for exploration. Simplifying sentences offer better results, as compared dealing with complex/compound sentences, in many language processing applications as well. Recently, Neural Networks have been used in simplifying texts, be it by state of the art LSTM’s and GRU cells or by Reinforcement learning models. In contrast, in this work, we present a classical approach consisting of two separate algorithms, for simplification of complex and compound sentences to their corresponding simple forms.
Link to full paper
Abstract
The identification of Hate Speech in Social Media has received much attention in research recently. There has been an ever-growing increase in demand particularly for research in languages other than English. The Hate Speech and Offensive Content (HASOC) track has created resources for Hate Speech Identification in three different languages namely Hindi, German, and English. We have participated in both Sub-tasks A and B of the 2020 shared task on hate speech and offensive content identification in Indo-European languages. Our approach relies on a combined model of multilingual RoBERTa (a Robustly Optimized BERT Pretraining Approach) model with pre-trained vectors and a Random Forest model using Word2Vec, TF-IDF, and other textual features as input. Our system has achieved a maximum Macro F1-score of 50.28% for English Sub-task A which is quite satisfactory relative to the performance of other systems and secured 8th position among participating teams.
Link to full paper
Abstract
Clustering is an unsupervised learning problem in the domain of machine learning and data science, where information about data instances may or may not be given. K-Means algorithm is one such clustering algorithms, the use of which is widespread. But, at the same time K-Means suffers from a few disadvantages such as low accuracy and high number of iterations. In order to rectify such problems, a modified K-Means algorithm has been demonstrated, named as K-RMS clustering algorithm in the present work. The modifications have been done so that the accuracy increases albeit with less number of iterations and specially performs well for decimal data compared to K-Means. The modified algorithm has been tested on 12 datasets obtained from UCI web archive, and the results gathered are very promising.
Link to full paper
Abstract
This paper presents a method to apply Natural Language Processing for normalizing numeronyms to make them understandable by humans. We try to deal with the problem using two approaches, viz., semi-supervised approach and supervised approach. For the semi-supervised approach, we make use of the state of the art DamerauLevenshtein distance of words. We then apply Cosine Similarity for selection of the normalized text and reach greater accuracy in solving the problem. For the supervised approach, we used a deep learning architecture to solve the problem at hand. Our approach garners accuracy figures of 71% and 72% for Bengali and English (for the semisupervised approach) and 89% for the supervised approach, respectively.
Link to full paper
Abstract
For any topic, its factuality can be defined as the category that determines the status of events with certainty of presentation of them. The first edition of the FACT task mainly focused on determination of the factuality of verb based events. The present edition is aimed at identifying noun based events and determine the factuality of all events be it verbs or nouns. We have participated in Subtask-1 of FACT 2020 task which is to automatically propose a factual tag for each event in the text. In this paper we have presented a method which extracts various features like BERT embeddings, Word2Vec embeddings and TF-IDF (Term Frequency-Inverse Document Frequency) scores of commonly recurring words, along with other manually extracted features as input features and passes them through a SVM (Support Vector Machine) classifier for classification purposes. Our system has achieved a f1-score of 36.6% and accuracy of 59.9% which is quite satisfactory relative to performance of other systems.
Link to full paper
Abstract
Optical Character Recognition (OCR) system is used to convert the document images, either printed or handwritten, into its electronic counterpart. But dealing with handwritten texts is much more challenging than printed ones due to the erratic writing style of the individuals. The problem becomes more severe when the input image is a doctor’s prescription. Before feeding such an image to the OCR engine, the classification of printed and handwritten texts is a necessity as a doctor’s prescription contains both handwritten and printed texts which are to be processed separately. Much work has been done in the domain of handwritten and printed text separation albeit work related to doctor’s handwriting. In this paper, a method is proposed which first localizes the position of texts in a doctor’s prescription, and then separates out the printed texts from the handwritten ones. Due to the unavailability of a large database, we have used some standard data (image) augmentation techniques to evaluate as well as to prove the robustness of our method. Besides, we have also designed a Graphical User Interface (GUI) so that anybody can visualize the output by providing a prescription image as input.
Link to full paper
Abstract
Finding a suitable hotel based on user’s need and affordability is a complex decision-making process. Nowadays, the availability of an ample amount of online reviews made by the customers helps us in this regard. This very fact gives us a promising research direction in the field of tourism called hotel recommendation system which also helps in improving the information processing of consumers. Real-world reviews may showcase different sentiments of the customers towards a hotel and each review can be categorized based on different aspects such as cleanliness, value, service, etc. Keeping these facts in mind, in the present work, we have proposed a hotel recommendation system using Sentiment Analysis of the hotel reviews, and aspect-based review categorization which works on the queries given by a user. Furthermore, we have provided a new rich and diverse dataset of online hotel reviews crawled from Tripadvisor.com. We have followed a systematic approach which first uses an ensemble of a binary classification called Bidirectional Encoder Representations from Transformers (BERT) model with three phases for positive–negative, neutral–negative, neutral–positive sentiments merged using a weight assigning protocol. We have then fed these pre-trained word embeddings generated by the BERT models along with other different textual features such as word vectors generated by Word2vec, TF–IDF of frequent words, subjectivity score, etc. to a Random Forest classifier. After that, we have also grouped the reviews into different categories using an approach that involves fuzzy logic and cosine similarity. Finally, we have created a recommender system by the aforementioned frameworks. Our model has achieved a Macro F1-score of 84% and test accuracy of 92.36% in the classification of sentiment polarities. Also, the results of the categorized reviews have formed compact clusters. The results are quite promising and much better compared to state-of-the-art models.
Link to full paper
Abstract
SemEval-2020 Task 12 was OffenseEval: Multilingual Offensive Language Identification in Social Media (Zampieri et al., 2020). The task was subdivided into multiple languages anddatasets were provided for each one. The task was further divided into three sub-tasks: offensivelanguage identification, automatic categorization of offense types, and offense target identification.I participated in the task-C, that is, offense target identification. For preparing the proposed system,I made use of Deep Learning networks like LSTMs and frameworks like Keras which combine thebag of words model with automatically generated sequence based features and manually extractedfeatures from the given dataset. My system on training on 25% of the whole dataset achieves macro averaged f1 score of 47.763%.
Link to full paper
Abstract
Code-mixing is a phenomenon which arises mainly in multilingual societies. Multilingual people, who are well versed in their native languages and also English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. This linguistic phenomenon poses a great challenge to conventional NLP domains such as Sentiment Analysis, Machine Translation, and Text Summarization, to name a few. In this work, we focus on working out a plausible solution to the domain of Code-Mixed Sentiment Analysis. This work was done as participation in the SemEval-2020 Sentimix Task, where we focused on the sentiment analysis of English-Hindi code-mixed sentences. our username for the submission was “sainik.mahata” and team name was “JUNLP”. We used feature extraction algorithms in conjunction with traditional machine learning algorithms such as SVR and Grid Search in an attempt to solve the task. Our approach garnered an f1-score of 66.2% when tested using metrics prepared by the organizers of the task.
Link to full paper
Abstract
In this modern era, language has no geographic boundary. Therefore, for developing an automated system for search engines using audio, tele-medicine, emergency service via phone etc., the first and foremost requirement is to identify the language. The fundamental difficulty of automatic speech recognition is that the speech signals vary significantly due to different speakers, speech variation, language variation, age and sex wise voice modulation variation, contents and acoustic conditions and so on. In this paper, we have proposed a deep learning based ensemble architecture, called FuzzyGCP, for spoken language identification from speech signals. This architecture combines the classification principles of a Deep Dumb Multi Layer Perceptron (DDMLP), Deep Convolutional Neural Network (DCNN) and Semi-supervised Generative Adversarial Network (SSGAN) to increase the precision to maximum and finally applies Ensemble learning using Choquet integral to predict the final output, i.e., the language class. We have evaluated our model on four standard benchmark datasets comprising of two Indic language datasets and two foreign language datasets. Irrespective of the languages, the F1-score of the proposed language identification model is as high as 98% in MaSS dataset and worst performance is that of 67% on the VoxForge dataset which is much better compared to maximum of 44% by state-of-the-art models on multi-class classification. The link to the source code of our model is available here.
Link to full paper
Abstract
The outbreak of a global pandemic called coronavirus has created unprecedented circumstances resulting into a large number of deaths and risk of community spreading throughout the world. Desperate times have called for desperate measures to detect the disease at an early stage via various medically proven methods like chest computed tomography (CT) scan, chest X-Ray, etc., in order to prevent the virus from spreading across the community. Developing deep learning models for analysing these kinds of radiological images is a well-known methodology in the domain of computer based medical image analysis. However, doing the same by mimicking the biological models and leveraging the newly developed neuromorphic computing chips might be more economical. These chips have been shown to be more powerful and are more efficient than conventional central and graphics processing units. Additionally, these chips facilitate the implementation of spiking neural networks (SNNs) in real-world scenarios. To this end, in this work, we have tried to simulate the SNNs using various deep learning libraries. We have applied them for the classification of chest CT scan images into COVID and non-COVID classes. Our approach has achieved very high F1 score of 0.99 for the potential-based model and outperforms many state-of-the-art models. The working code associated with our present work can be found here.
Link to full paper
Abstract
Offensive language identification has been an active area of research in natural language processing. With the emergence of multiple social media platforms offensive language identification has emerged as a need of the hour. Traditional offensive language identification models fail to deliver acceptable results as social media contents are largely in multilingual and are code-mixed in nature. This paper tries to resolve this problem by using IndicBERT and BERT architectures, to facilitate identification of offensive languages for Kannada-English, Malayalam-English, and Tamil-English code-mixed language pairs extracted from social media. The presented approach when evaluated on the test corpus provided precision, recall, and F1 score for language pair Kannada-English as 0.62, 0.71, and 0.66, respectively, for language pair Malayalam-English as 0.77, 0.43, and 0.53, respectively, and for Tamil-English as 0.71, 0.74, and 0.72, respectively.
Link to full paper
Abstract
The problem of gender and age identification has been addressed by many researchers, however, the attention given to it compared to the other related problems of face recognition in particular is far less. The success achieved in this domain has not seen much improvement compared to the other face recognition problems. Any language in the world has a separate set of words and grammatical rules when addressing people of different ages. The decision associated with its usage, relies on our ability to demarcate these individual characteristics like gender and age from the facial appearances at one glance. With the rapid usage of Artificial Intelligence (AI) based systems in different fields, we expect that such decision making capability of these systems match as much as to the human capability. To this end, in this work, we have designed a deep learning based model, called GRANet (Gated Residual Attention Network), for the prediction of age and gender from the facial images. This is a modified and improved version of Residual Attention Network where we have included the concept of Gate in the architecture. Gender identification is a binary classification problem whereas prediction of age is a regression problem. We have decomposed this regression problem into a combination of classification and regression problems for achieving better accuracy. Experiments have been done on five publicly available standard datasets namely FG-Net, Wikipedia, AFAD, UTKFAce and AdienceDB. Obtained results have proven its effectiveness for both age and gender classification, thus making it a proper candidate for the same against any other stateof-the-art methods.
Link to full paper
Abstract
Nowadays, we can observe the applications of machine learning in every field, ranging from the quality testing of materials to the building of powerful computer vision tools. One such recent application is the recommendation system, which is a method that suggests products to users based on their preferences. In this paper, our focus is on a specific recommendation system called movie recommendation. Here, we make use of user reviews of movies in order to establish a general outlook about the movie and then use that outlook to recommend that movie to other users. However, a huge number of available reviews has baffled sophisticated review systems. Consequently, there is a need to find a method of extracting meaningful information from the available reviews and use that in classifying a movie review and predicting the sentiment in each one. In a typical scenario, a review can either be positive, negative, or indifferent about a movie. However, the available research articles in the field mainly consider this as a two-class classification problem—positive and negative. The most popular work in this field was performed on Stanford and Rotten Tomatoes datasets, which are somewhat outdated. Our work is based on self-scraped reviews from the IMDB website, and we have annotated the reviews into one of the three classes—positive, negative, and neutral. Our dataset is called JUMRv1—Jadavpur University Movie Recommendation dataset version 1. For the evaluation of JUMRv1, we took an exhaustive approach by testing various combinations of word embeddings, feature selection methods, and classifiers. We also analysed the performance trends, if there were any, and attempted to explain them. Our work sets a benchmark for movie recommendation systems that is based on the newly developed dataset using a three-class sentiment classification.
Link to full paper
Abstract
The on-going pandemic has opened the pandora’s box of the plethora of hidden problems which the society has been hiding for years. But the positive side to the present scenario is the opening up of opportunities to solve these problems on the global stage. One such area which was being flooded with all kinds of different emotions, and reaction from the people all over the world, is twitter, which is a micro blogging platform. Coronavirus related hash tags have been trending all over for many days unlikeany other event in the past. Our experiment mainly deals with the collection, tagging and classification of these tweets based on the different keywords that they may belong to, using the Naive Bayes algorithm at the core.
Link to full paper
Abstract
Training translation systems with complex and compound sentences are generally considered computationally tough and such systems fail to process, the large syntactical information given out by these sentences. This issue subsequently, affects the overall quality of translations. On the other hand, simple sentences are shorter by nature and produce less syntactical information. Therefore, it would be safe to say that training translation systems using simple sentences only, would result in better translation output. However, training a translation system requires a large and quality parallel corpus involving two natural languages. While parallel corpus for various language pairs is abundant, such lexicons for low-resourced languages, consisting of only simple sentences are rare. In such a scenario, the development of such a parallel lexicon is the initial purpose of the present work. Building the same would require differentiating complex and/or compound sentences from the overall corpus and then converting them into simple sentences. Since, the work includes two languages, English and Bengali, different algorithms to accomplish the same, is documented in this paper. Converting complex and compound sentences to simple instances results in fragmenting sentences into two or more segments, which then needs to be aligned to make them semantically similar. Hence, a basic alignment technique has also been proposed to mitigate this problem. After developing the parallel corpus, we needed to check for its effectiveness in solving the quality issues of translation systems discussed earlier. For this, state-of-the-art translation modules like Statistical Machine Translation and Neural Machine Translation, have been trained using the developed corpus as well as with the raw parallel corpus consisting of sentences of mixed complexities. The performance of these translation models has been compared using automated as well as manual evaluation metrics. The results are promising and prove that translation systems do perform better when trained using simple sentence language pairs.
Link to full paper
Abstract
Compared to other features of the human body, voice is quite complex and dynamic, in a sense that a speech can be spoken in various languages with different accents and in different emotional states. Recognizing the gender, i.e. male or female from the voice of an individual, is by all accounts a minor errand for human beings. Similar goes for speaker identification if we are well accustomed with the speaker for a long time. Our ears function as the front end, accepting the sound signs which our cerebrum processes and settles on our disposition. Although being trivial for us, it becomes a challenging task to mimic for any computing device. Automatic gender, emotion and speaker identification systems have many applications in surveillance, multimedia technology, robotics and social media. In this paper, we propose a Golden Ratio-aided Neural Network (GRaNN) architecture for the said purposes. As deciding the number of units for each layer in deep NN is a challenging issue, we have done this using the concept of Golden Ratio. Prior to that, an optimal subset of features are selected from the feature vector extracted, common for all three tasks, from spectral images obtained from the input voice signals. We have used a wrapper-filter framework where minimum redundancy maximum relevance selected features are fed to Mayfly algorithm combined with adaptive beta hill climbing (AβHC) algorithm. Our model achieves accuracies of 99.306% and 95.68% for gender identification in RAVDESS and Voice Gender datasets, 95.27% for emotion identification in RAVDESS dataset and 67.172% for speaker identification in RAVDESS dataset. Performance comparison of this model with existing models on the publicly available datasets confirms its superiority over those models. Results also ensure that we have chosen the common feature set meticulously, which works equally well on three different pattern classification tasks. The proposed wrapper-filter framework reduces the feature dimension significantly, thereby lessening the storage requirement and training time. Finally, strategically selecting the number units in each layer in NN help increases the overall performance of all three pattern classification tasks.
Link to full paper
Abstract
User privacy is an important concern that should be handled in data intensive applications. Interestingly, differential privacy is a privacy model that can be applied to such datasets. This model is advantageous as it does not make any strong assumption about the adversary. In this work, we have introduced the notion of differential privacy in the domain of Human Activity Recognition (HAR). Real life accelerometer data has been collected from different smartphone configurations that were carried by the users in different manner according to their convenience. Our contribution in this work is to propose a privacy preserving HAR framework incorporating algorithms to preserve the differential privacy of the user data. The algorithm exploits the scalar and the vector parts of the accelerometer readings and applies privacy preserving mechanisms on it. A Deep Multi Layer Perceptron (DMLP) framework has been utilized for activity classification. We have achieved comparatively similar results with an enhanced surplus of achievement of privacy in terms of data and are so far the first of its kind in the aforementioned domain of HAR based on smartphone sensing data. The proposed framework is implemented both on collected real life dataset capturing different smartphone configurations and usage behavior and benchmark datasets.
Link to full paper
I will send over a mail whenever I add new content :)
Avishek Garain is an Undergrad at Jadavpur University in the department of Computer Science and Engineering. His interest lies primarily in Deep Learning for Natural Language Processing and Computer Vision. He follows up several research journals to keep himself up-to-date with current research and also actively takes part in research at the undergraduate level under Dr. Sudip Kumar Naskar, Dr. Dipankar Das and Dr. Ram Sarkar Dept. of CSE, Jadavpur University.
He loves to teach people about his experience on Deep Learning and enthusiastically invites research collaborations and internship opportunities in the field of Deep Learning for Natural Language Processing and Computer Vision.