Go Home

Data Science Projects

Predicting Political Orientation From Twitter Data (Italian political parties)

4 min read

The objective of this project was to predict political orientation from twitter contents. Twitter is a social networking service that allows users to post real time messages (called tweets). In the first phase we collected data based on an online survey that indirectly asked about the political preference of the surveyed individuals and data based on looking up twitter accounts that follow one of 5 parties included in the analysis (Lega, Movimento-5-Stelle, Partito Democratico, Forza Italia and Fratelli d’Italia) from the parties’ official twitter accounts. The second phase consisted of downloading tweets from the accounts gathered in phase 1. In the Third phase, the data was preprocessed and prepared. This was achieved by cleaning the data from noise and applying some of the feature engineering techniques. In the final phase, we tried multiple classifiers such as the Logistic Regression Model, Xgboost Classifier, Random Forest, Stochastic Gradient Descent Classifier and Multinomial Naïve Bayes classifier. From these models, we could finally predict the political orientation of “Movimento-5-Stelle” followers.

Image Classification with Fashion - MNIST data (Zalando.de)

7 min read

Image classification is a supervised learning problem, models are trained on labelled sample images to classify new unseen images. Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Analyzing the dataset, we concluded that most of the classic ML models (Logistic Regression, Support Vector Machine, Random Forests, and k-Nearest Neighbors) as well as simple Neural Network achieve an accuracy of 85%-90% and in all of them the main struggle was differentiating between two out of the ten classes, namely the shirt and t-shirt classes. In order to better evaluate the model we tried Convolutional Neural Network using multiple filters which are trained to detect different features.

Subject and Posture Classification with Convolutional Neural Network

7 min read

Sleeping is one of the most important activities in our daily life that affects our health. However, few people really know their sleep habits, which is important to avoid possible sleep-related diseases. One benefit of sleep monitoring is that it can lead to positive changes. People are more likely to change their habits if they track them. Pressure sensor mats consist of grid-like and flexible force sensors that are now commercially available to continuously measure the pressure distribution under body parts in different sleeping positions. In this study, we propose a convolutional neural network (CNN) for three different classification tasks. the first task is capable of accurately detecting subjects,the second task is capable of detecting three standard sleeping postures (supine, right and left) and the last one is a multitask classification of subject identification and posture recognition at the same time. We evaluate the performance of our models applying it on two different data-sets. Our model showed really promising results for both experiments used in each classification tasks.

6 min read

From its foundation in 2006, Twitter gained more and more popularity and became an important source of information about people’s interaction, opinions and language. Moreover, it is widely used by politicians and by those who are interested in politics: this feature makes Twitter a great resource for political psychology analyses. Two types of data can be extracted from Twitter: non-textual information (e.g. the follower-friend ratio, and other users’ behaviors) and the content of the tweets (Sylwester and Purver, 2015). Both these kinds of data may reveal differences or similarities between Republicans and Democrats. The main hypothesis was that the language used on Twitter may reflect the psychological differences between liberals and conservatives. The analysis consisted of three parts: the study of the way Republicans and Democrats followers interact on Twitter; understanding which words are the most differentiating between the groups, and a timeline content analysis. For what concerns the first part, the results showed that Republican users had more followers than Democrat users, while Democrat users followed more accounts. The follower-friend ratio correlated with the use of 1st person plural pronouns (“we”, “us”, “our”) instead of 1st person singular ones (“I”, “me”, “mine”): the users which more often used the plural pronouns were followed by more people. Moreover, Republicans employed mentions more often than Democrat users. No significant difference was found in re-tweeting. The content analysis was conducted on the most frequent words. The LIWC (Linguistic Inquiry Word Count analysis) was used. The results showed some differences both in the most discussed topics (e.g. national identity vs culture) and in some meta-semantic categories: Republicans use more often 1st person plural pronouns, negation (“not”) and “the” article, which is usually related to a appeal to authority (e.g. “the lord”, “the senate”). Democrat followers tend to use 1st person singular pronouns, more swear words, more emotionally expressive words (“feel”was one of the most differentiating words), often 9 expressing positive sentiments but also anxiety. Overall, the study revealed some differences between Republican and Democrat followers, both in the style of interaction and in the language used.

Statistical analysis on the data gathered during Covid-19 outbreak (Worldwide)

4 min read

The objective of this project was to perform a Statistical Data Analysis of the worldwide COVID -19 outbreak. The data was collected from five major sources like World Health Organization (WHO), European Centre for Disease Prevention and Control (ECDC), WorldoMeters, COVID Tracking Project and Oxford University. The data was preprocessed and prepared. This was achieved by cleaning the data from noise and applying some of the feature engineering techniques. In the end, we applied Linear Regression model and polynomial models to analyze the peak of the total number of confirmed cases globally and see the trend of confirmed cases vs recovery vs death across the globe. From these models, we could predict the global trend of confirmed cases peak.

Biological data classification based on RNA & DNA sequences and PSI blasts

3 min read

The aim of this project is to characterize a single domain, particularly Nucleotidyl Transferase from the organism Yersinia pseudotuberculosis - a bacterial species that most commonly causes foodborne illness. We have been assigned a domain sequence from which we build a sequence model to provide structural and functional characterization of the domain family. Using an automated process and by exploiting some of the available Multiple Sequence Alignment tools like T-Coffee, ClustalOmega and Muscle we developed a model that represents our domain precisely considering the important metrics. We then constructed the family structures and sequences databases to be able to provide additional insights into the structural and functional characteristics of our domain. After performing pairwise and multiple sequence structural alignments, we obtained the CATH superfamily 3.90.550.10 and several families of which the most dominant was our starting sequence’s family - PF00483. For the functional part, we found out that it only covers two of the three sub-ontologies namely the biological processes (biosynthetic and metabolic processes) and molecular function (transferase activity).

back