____________________________________________________________ sensational tragedy shocked the international community
____________________________________________________________ Abstract : The sinking of the RMS Titanic is one of the mostinfamous shipwrecks in history. On April 15, 1912, during her maidenvoyage, the Titanic sank after colliding with an iceberg, killing 1502 out of2224 passengers and crew. This sensational tragedy shocked the internationalcommunity and led to better safety regulations for ships.
In this paper we aregoing to make the predictive analysis ofwhat sorts of people were likely to survive and using some tools of machine learing to predict which passengers survived thetragedy with accuracy.. IndexTerms- Machine learning .________________________________________________________________________________________________________ I. Introduction Machine learning means the application of anycomputer-enabled algorithm that can be applied against a data set to find apattern in the data. This encompassesbasically all types of data science algorithms, supervised, unsupervised,segmentation,classification, or regression”.few important areas where machine learning canbe applied are Handwriting Recognition:convert written letters into digital letters Language Translation:translate spoken and or written languages (e.g.
Google Translate) Speech Recognition:convert voice snippets to text (e.g. Siri, Cortana, and Alexa)ü Image Classification:label images with appropriate categories (e.g.
Google Photos) Autonomous Drivin:genable cars to drive (e.g. NVIDIA and Google Car) some featuresof machine learning algorithms are :Features are the observations that are used to form predictions For image classification, the pixels are the features For voice recognition, the pitch and volume of the sound samples are the features For autonomous cars, data from the cameras, range sensors, and GPS are features Extracting relevant features is important for buildinga model Source of mail is an irrelevant feature whenclassifying imagesSource is relevant when classifying emails becauseSPAM often originates from reported sources2.Literaturesurvey Everymachine learning algorithm works best under a given set of conditions. Makingsure your algorithm fits the assumptions / requirements ensures superiorperformance. You can’t use any algorithm in any condition.Instead, in such situations, you should try usingalgorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc.
why Logistic Regression ? it isused to model the probability of an evenet occuring depending on the values ofthe independent variables which can be categorical and numerical and toestimate the probability that an event occurs for a randomly selected onservationsversus the probability that the evecnt does not occur and it is used to predictthe effects of series of varibales on abinary response variable and it is used to classify observations by estimatingthe probability that an observation is in a particular category Peformance of Logisticregression model: AIC (AkaikeInformation Criteria) –The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of modelcoefficients. Therefore, we always prefer model with minimum AIC value Null Deviance and Residual Deviance –Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residualdeviance indicates the response predicted by a model on adding independent variables.
Lower the value, better themodel. Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. McFadden R2 is called as pseudo R2. Whenanalyzingdata with a logistic regression, an equivalent statistic to R-squared does not exist. However, to evaluate the goodness-of-fit of logistic models, several pseudo R-squareds have been developed.
accuracy=truepostives + true negatives/(truepostivies+true negatives+false positives+false negatives) Decision Trees Decision tree is a hierarchical tree structurethat canbe used to divide up a large collection of records into smaller sets of classes by applying asequence of simple decision rules. A decision tree model consists of a set ofrules for dividing a large heterogeneous population into smaller, morehomogeneous(mutually exclusive) classes.The attributes of the classes can beany type of variables from binary, nominal, ordinal, and quantitative values,while the classes must be qualitative type (categorical or binary, or ordinal).In short, given a data of attributes together with its classes, a decision treeproduces a sequence of rules (or series of questions) that can be used torecognize the class.One rule is applied after another, resulting in a hierarchyof segments within segments.
The hierarchy is called a tree, and each segmentis called a node.With each successive division, the members of the resultingsets become more and more similar to each other. Hence, the algorithm used to construct decision treeis referred to as recursive partitioningDecision tree applications : prediction tumor cells as benign or maligant classify credit card transaction as legitimate or fradulent classify buyers from non -buyers decision on whether or not to approve a loan diagnosis of various diseases based on symptoms and profiles 3.
Methodolgy: our approach to solve the problem:1. collect the raw data need to solve the problem.2. improt the dataset into the working environment 3.Data preprocessing whichincludes data wrangling and feature engineering .4.explore the data and prepare a model for performing analysis usingmachine learing algorithms 5.Evaluate the model and re-iterate till we get satisfactory modelperformance 6.
Compare the results and select a model which gives a more accurateresult. the data we collected isstill rawdata which is very likely tocontains mistakes ,missing values and corrupt values. before drawing anyconclusions from the data we need to do some data preprocessing which involvesdata wrangling and feature engineering .data wrangling is the process of cleaning and unify the messy andcomplex data sets for easy access and analysis feature engineering process attempts to create additionalrelevant features from existing raw features in the data and to increase thepredictive power of learing algorithms 4 Experimental Analysis and Discussion a) Data set description: The original data has been split into twogroups :training dataset(70%) and test dataset(30%).The trainingset should be used to build your machine learning models..
The testset should be used to see how well your model performs on unseen data. Forthe test set, we do not provide the ground truth for each passenger. It is yourjob to predict these outcomes. For each passenger in the test set, use themodel you trained to predict whether or not they survived the sinking of theTitanic. b) Measures c) Results after training with the algorithms , we have to validate our trainedalgorithms with test data set and measure the algorithms performance withgodness of fit with confusion matrix for validation. 70% of data as trainingdata set and 30% as training data setconfusion matrix for decision tree trained data set testdata set References predictions 0 1 0 395 71 1 45 203 References predictions 0 1 0 97 20 1 12 48 confusion matrix for logistic regression trained data testdata References predictions 0 1 0 395 12 1 21 204 References predictions 0 1 0 97 12 1 21 47 d) Enhancements and reasoning predicting the survivalrate with others machine learing algorithms like random forests , various SupportVector machines may improve the accuracyof prediction for the given data set. 5. Conclusion:Theanalyses revealed interesting patterns across individual-level features.
Factors such as socioeconomic status, social norms and family compositionappeared to have an impact on likelihood of survival. These conclusions,however, were derived from findings in the dataThe accuracy of predicting thesurvival rate using decision tree algorithm(83.7) is high when compared withlogistic regression(81.3) for a givendata set