Test Train Split

When should we do the test train split on a dataset?

  1. After cleaning the complete dataset
  2. First split the dataset and then only clean the training dataset

When attempting the final submission of the bootcamp, I adopted the first approach. But when I imported the test dataset (provided separately for the assignment), I faced the following error:
" could not convert string to float: ‘blue-collar’ "
The column containing this value was converted into numeric values using get_dummies() function.

The same error occurred when I adopted the second approach, when I entered the code to predict the values for test dataset (as the split was done in the beginning).

So to summarize, I am asking two questions -

  1. When to split the dataset into train and test?
  2. How to tackle the error " could not convert string to float: ‘blue-collar’ "?


Hi @anshmjn01
Whenever you are providing the data to the machine learning models, the data should be numerical. The ML model won’t throw any error when numerical data provided.

Coming to the dataset splitting. It’s always good to split the dataset before performing any cleaning tasks like filling missing values, transformations, etc. otherwise it can led to data leakage to the ML models.

But if I split the data first and then convert categorical data into numeric data, my test data will contain categorical data.

So I should convert the categorical columns of test data first and then predict the test values, right?
Or is there any other way to go about it?

Yes @anshmjn01 , convert the categorical data of test dataset into numerical, then make prediction using the built model.

Okay thanks @manish_kc_06