Getting Started With Data Science Bootcamp: Final Assignment | DPhi


This is a companion discussion topic for the original entry at https://dphi.tech/challenges/74

Hello, I would like to ask a question. What is the difference between variable no_of_contacts with prev_attempts ? In the data description, it is not clearly specified. Thanks

Hi @samuelandrom
no_of_contacts tells you how many time the particular customer was contacted for the current campaign while prev_attemps tells you how many times was that customer contacted in the previous campaigns (or say before this campaign)

Thank you @manish_kc_06
also another question, the last_contact_day and last_contact_month shows when the particular customer was last contacted. Then why, when I checked, those data were irrelevant to the days_passed (days passed since last contacted)? I checked there were 2 rows of data which only differed 1 day in the last_contact_day and last_contact_month, but the days_passed differed a lot?

can you share the screenshot of those data points? Also remember there can be irregularities in the data points. This is what you have to solve while data cleaning and data preparations as per your understanding from EDA of the dataset

@manish_kc_06 Here they are… It’s index number 1 and 4 in the Train_data.csv
Capture

I ended up dropping the days_passed column when I did the EDA, as the data was too short. But with no information regarding when the data is recorded, I thought I could not do any further EDA to the last_contact_day and last_contact_month

perform the data cleaning accordingly. one of them is correct information, both of them cannot be true at a time.

@manish_kc_06 Hi, please confirm Datathon last day to submit notebook. In Dashboard - >Assignment tab, it is given 22 April but when I click on link to register last day is 25 April. Please confirm. Thanks is advance.

Hi @priyankagrover
The deadline for final assignment is extended by 3 days. So final deadline to submit predictions or upload notebooks is 25 April as given on the datathon page.

Hello, I need help in calculating the difference between call_start and call_end columns and converting them into int datatype. Please guide

Please help!
After applying OneHotEncoding to the train_data the shape becomes (3102, 4673) but when same encoding is applied on the test_data provided by dphi, the shape becomes (935, 1781). As a result, my model which scores 0.81 doesn’t work on the test_data provided.

I keep getting this error: ValueError: X has 1781 features per sample; expecting 4673

What am I doing wrong?

Hi @codepanther
You are not applying one hot encoding properly. Please visit this tutorial to understand how to do one hot encoding using OneHotEncoder(): Handling Unknown Categories in both train and test set during One Hot Encoding

Hi @akashbnsl88
First convert the data from object type to datetime type then do data[‘call_start’] - data[‘call_end’]

Yes i have been trying that but will it be possible to convert datetime format to int format after that for the purpose of EDA and using the column for measuring correlation.
If possible then how can we do that.
I can send you the code where i am stuck.

You can convert datetime object to timestamp. Refer this thread: Python pandas convert datetime to timestamp effectively through dt accessor - Stack Overflow

Hello, I wanted to know what is exactly the default_or_not column represents. Please explain it in a little detail, if possible.

Default is the failure to repay a debt, including interest or principal, on a loan or security. A default can occur when a borrower is unable to make timely payments, misses payments, or avoids or stops making payments. Default risks are often calculated well in advance by creditors.

Thank you @manish_kc_06, I’ve seen what I was doing wrong. I really appreciate your help.

@manish_kc_06 Hi Manish,
My notebook was showing evaluation error and by mistake it was marked as final submission. I am unable to upload new notebook. Is it possible for you to enable notebook submissions for my account?
Second, I have divided train test size to 80/20 ratio. On model prediction, it is returning result for 621 rows which is 20% of the data set. When I am uploading prediction.csv it is saying that predictions has less number of rows as expected. Please help me in this regard.
Thanks in advance.

Hi @priyankagrover
Please use the test dataset that we have provided to make the submissions.


there should be 935 rows in prediction file as there are 935 rows in the test dataset.