Project Objective
The Titanic is a well-known disaster, provided passenger data isn’t completed, including gender, ticket price, title, and cabin class. The data is divided into 'train' and 'test' based on whether the survival status of the passengers is known or unknown, with an attempt to predict the likelihood of survival, using the known data, for the passengers whose fate is unknown.
Data Ingestion
The data is sourced from Kaggle.
Data Processing
1. First, merge the 'train' and 'test' data and use Power BI for observation.
Based on observation:
The only difference between the 'test' and 'train' data is the 'survived' column.
There are many missing values in the 'Age' column that need to be filled in.
There is a large number of missing values in the 'Cabin' column, and based on observation, the cabin class appears to be helpful in inferring the survival probability.
Based on observation, there may be a relationship between 'Cabin' and 'Fare.' To investigate the correlation between 'Fare,' 'Cabin,' and 'Survived,' tests or interval grouping may be necessary.
'Embark' and 'Fare' have few missing values and can be inferred easily.
Based on observation, there is not much difference between missing and non-missing values in various columns of 'Cabin.'
Based on observation, there are no extreme outliers (no need for standardization procedures).
2. Fill in the values for 'Embark' and 'Fare'.
3. After binning 'Cabin' and 'Fare,' observe again.
4. Feature engineering: title, Ticket, family_size, Cabin.
5. Convert column data into numerical values (for running models) and fill in the 'Cabin' values.
6. Convert column data into numerical values (for running models) and fill in the 'Age' values. The data after filling in the values is presented as follows:
Analysis Methods
Since this machine learning utilizes a multi-feature dataset with known outcomes ('survival') to predict unknown outcomes, random forest regression analysis is adopted to predict and obtain the weights of each variable.
Presentation
Predicted survival outcomes for 418 unknown passengers are obtained.
Comments