Titanic Survival Prediction using Decision Trees

A machine learning project that uses Decision Trees to predict the survival chances of Titanic passengers.

Project Overview

Why choose this?

Although this project has no practical use, I believe it highlights how significantly age and class affected survival chances aboard the Titanic.

This project aims to predict the survival chances of Titanic passengers.

This project uses Decision Trees to make predictions.

When making predictions for this project, data such as age, gender, passenger class, and survival status were used.

Data Sources

Given the niche nature of this project, Kaggle was the primary source for the dataset used to make predictions.

The data was cleaned (e.g., by removing empty values) to enable graph generation.

Pre-processing steps included converting non-numerical data into numerical format (e.g., gender) and selecting relevant features for training the Decision Tree.

Data Visualization and Understanding

Using Python (Jupyter Notebook), I visualized the Decision Tree generated by the model to better understand how features influenced survival predictions.

The Decision Tree demonstrates splits in data based on features like gender, passenger class, and age to make predictions. These splits highlight the importance of certain variables in determining survival.

The images below showcase different decision trees from the random forest, each varying slightly from the others.

The visualization clearly shows how female passengers, children, and those in higher classes had better survival chances. This aligns with historical accounts of the Titanic disaster.

Algorithms

Decision Trees were chosen for their interpretability and ability to handle both numerical and categorical data effectively.

Unlike linear regression, which assumes a linear relationship, Decision Trees capture non-linear patterns and provide clear rules for decision-making.

In this project, I limited the tree depth to prevent overfitting and ensure generalizable predictions.

Online Sources

Online tutorials and documentation were invaluable for understanding and implementing Decision Trees. The following sources were particularly helpful:

Scikit-Learn Decision Tree Documentation

Kaggle Titanic Dataset

Tools Used

This project was developed using Python, with Jupyter Notebook as the primary development environment.

The following Python libraries were crucial for the project:

Pandas: for data manipulation and preparation.
NumPy: for efficient numerical computations.
Scikit-Learn: for implementing the Decision Tree algorithm.
Matplotlib: for visualizing the Decision Tree structure.

These tools allowed for seamless pre-processing, model training, and visualization.

Pre-Processing

The Kaggle Titanic dataset required significant pre-processing to be usable for training the Decision Tree model.

Steps included:

Filling missing values in the 'Age' column with the median age.
Converting non-numerical columns, such as 'Sex', into numerical values for the model.
Removing irrelevant columns to focus on key features like passenger class, age, and gender.