Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.
Assignment - 2: Solution
(Intermediate - Advanced)
This is the second assignment of DPhi 5 Week Data Science Bootcamp that revolves aroung Exploratory Data Analysis on the Covid_19 dataset
Recommended to go through the dataset description here: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. We have used the dataset named ‘covid_19_data.csv’ file.
Note: To solve these questions please load the data from here: https://raw.githubusercontent.com/dphi-official/Datasets/master/covid_19_data.csv because the data at kaggle is updated everyday using which might lead to get you wrong answers for this particular assignment.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/covid_19_data.csv")
|0||1||01/22/2020||Anhui||Mainland China||1/22/2020 17:00||1.0||0.0||0.0|
|1||2||01/22/2020||Beijing||Mainland China||1/22/2020 17:00||14.0||0.0||0.0|
|2||3||01/22/2020||Chongqing||Mainland China||1/22/2020 17:00||6.0||0.0||0.0|
|3||4||01/22/2020||Fujian||Mainland China||1/22/2020 17:00||1.0||0.0||0.0|
|4||5||01/22/2020||Gansu||Mainland China||1/22/2020 17:00||0.0||0.0||0.0|
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40805 entries, 0 to 40804 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SNo 40805 non-null int64 1 ObservationDate 40805 non-null object 2 Province/State 22790 non-null object 3 Country/Region 40805 non-null object 4 Last Update 40805 non-null object 5 Confirmed 40805 non-null float64 6 Deaths 40805 non-null float64 7 Recovered 40805 non-null float64 dtypes: float64(3), int64(1), object(4) memory usage: 2.5+ MB
Information about the dataset:
There are 8 features:
- SNo: Serial number
- ObservationDate: Date of observation of the cases (format: MM/DD/YYYY)
- Province/State: Province or State of the country where cases were observed
- Country/Region: Country where cases were observed
- Last Update: Time in UTC at which the row is updated for the given province or country. (It is not in a standard format)
- Confirmed: Cumulative number of confirmed cases till the date
- Deaths: Cumulative number of deaths till the date
- Recovered: Cumulative number of recovered cases till date
Instruction for the learner:
- Recommended to revise ‘groupby()’ method of pandas
- Recommended to understand the word ‘Cummulative’ properly for three features - ‘Confirmed’, ‘Recovered’ and ‘Deaths’.
- Please note that the dtype of datetime columns are of object, convert them to datetime as shown below:
data['ObservationDate'] = pd.to_datetime(data['ObservationDate']) data['Last Update'] = pd.to_datetime(data['Last Update'])
|0||1||2020-01-22||Anhui||Mainland China||2020-01-22 17:00:00||1.0||0.0||0.0|
|1||2||2020-01-22||Beijing||Mainland China||2020-01-22 17:00:00||14.0||0.0||0.0|
|2||3||2020-01-22||Chongqing||Mainland China||2020-01-22 17:00:00||6.0||0.0||0.0|
|3||4||2020-01-22||Fujian||Mainland China||2020-01-22 17:00:00||1.0||0.0||0.0|
|4||5||2020-01-22||Gansu||Mainland China||2020-01-22 17:00:00||0.0||0.0||0.0|
Which of the following is correct month of median observation date?
- May (correct answer)
data['ObservationDate'].quantile(.5) # Median is nothing but the second quantile or the 50th percentile
Create a subset of your data, only taking observations which was last updated on 2020-06-13 03:33:14 and name this dataframe as updated_data. Select the correct statement about this dataframe i.e. updated_data?
- There are 729 observations / records
- There are 190 unique country’s records available in the dataframe
- There are 223 unique country’s records available in the dataframe.
- This dataset doesn’t have any missing values.
Correct Anwers: 1, 2
check option 1
updated_data = data[data['Last Update'] == "2020-06-13 03:33:14"] # Get the observations on last updated date 2020-06-13 03:33:14 using subsetting a dataframe
len(updated_data) # get the length of the dataframe i.e. the number of observations in the dataframe updated_data
check option 2 and 3
len(updated_data['Country/Region'].unique()) # Get the unique countries in the dataframe and the number of unique values using len()
check option 4
updated_data.isnull().sum() # get the total number of missing values in each column
SNo 0 ObservationDate 0 Province/State 169 Country/Region 0 Last Update 0 Confirmed 0 Deaths 0 Recovered 0 dtype: int64
Create a subset of data from the dataframe updated_data, only taking the top 10 countries which have the maximum number of confirmed cases with features - ‘Country/Region’, ‘Confirmed’, ‘Deaths’ and ‘Recovered’. Name this dataframe as ‘top_10’. Which of the following countries are not present in the dataframe ‘top_10’.
Correct answers: 2, 6, 7, 8
Which country has the lowest number of confirmed cases among the ten countries in top_10 dataframe?
Correct answer: 1
Solution code: for question 3 and 4
use groupby() to group all the observations with same country, sum all the observations and then sort the dataframe by confirmed cases.
top_10 = updated_data[['Country/Region', 'Confirmed','Recovered','Deaths']].groupby('Country/Region').sum().sort_values(by = 'Confirmed', ascending = False)[:10] top_10
Add two columns in top_10 dataframe - ‘Recovered_percentage’ and ‘Deaths_percentage’ where,
‘Recovered_percentage’ = (Recovered cases / Confirmed cases) * 100
‘Death_percentage’ = (Deaths cases / Confirmed cases) * 100
Among these 10 countries which country has the highest recovery percent?
correct answer: 4
top_10['Recovered_percentage'] = top_10['Recovered'] / top_10['Confirmed'] * 100 # calculating recovered percentage top_10['Deaths_percentage'] = top_10['Deaths'] / top_10['Confirmed'] * 100 # calculating deaths percentage
top_10.Recovered_percentage.sort_values(ascending = False) # sort values of recovery percentage in descending order
Country/Region Germany 91.619220 Italy 73.246440 Spain 61.829949 Brazil 53.706278 Russia 52.639493 Peru 49.878485 India 49.471491 France 37.622917 US 26.714970 UK 0.435459 Name: Recovered_percentage, dtype: float64
Among the ten countries in top_10 dataframe, which country has the lowest death percentage?
correct answer: 5
top_10.Deaths_percentage.sort_values() # sort deaths percentage in increading order
Country/Region Russia 1.312747 Peru 2.834423 India 2.856135 Germany 4.691122 Brazil 5.046754 US 5.596378 Spain 11.157482 UK 14.118790 Italy 14.482554 France 15.203913 Name: Deaths_percentage, dtype: float64
Create a subset of data from the initially loaded data i.e. the datafram ‘data’, which should include the day wise observations of country ‘Germany’ only with features - ‘ObservationDate’, ‘Confirmed’, ‘Recovered’, ‘Deaths’. Name this dataframe as ‘Germany_data’. From the dataset ‘Germany_data’, what does it look like?
[Note: Here you need graphs to visualize]
- The rate of confirmed cases and the rate of recovered cases are going parallel to each other.
- The rate of recovered cases is more than the rate of confirmed cases.
Correct answer: 1
# Getting daywise information for Germany Germany_day_wise = data[data['Country/Region'] == 'Germany'][['ObservationDate', 'Confirmed', 'Recovered', 'Deaths']].groupby('ObservationDate').sum().sort_values(by = 'ObservationDate', ascending = True) # plot the graphs for recovered, confirmed and deaths cases Germany_day_wise.plot(figsize = (16,6)) plt.title("Confirmed, Recovered and Deaths rates of Germany per day") plt.ylabel("Number of cases") plt.show()
As it can be seen the graphs confirmed cases and recovered cases are nearly parallel to each other
Take the help of the dataframe ‘updated_data’ to get the total confirmed, recovered and deaths cases worldwide. Choose the correct option
- Confirmed cases worldwide is 7632802.
- Recovered cases worldwide is 3613277.
- Total number of deaths worldwide is 425394.
Correct answer: 1, 2, 3
updated_data[['Confirmed', 'Recovered', 'Deaths']].sum() # updated_data has total confirmed recoverd and deaths cases country wise. using sum() will add all those values in each column which is nothing but worldwide data
Confirmed 7632802.0 Recovered 3613277.0 Deaths 425394.0 dtype: float64
Get the data of USA from the dataframe ‘updated_data’, group the data state/province wise. Which of the following information is correct about Province/State of USA?
- New York has the highest number of confirmed cases.
- Massachusetts has the 5th highest number of confirmed cases.
correct answers: 1, 2
# USA data USA_data = updated_data[updated_data['Country/Region'] == 'US'] # get USA data only
# group by state USA_data_by_state = USA_data[['Province/State', 'Confirmed', 'Recovered', 'Deaths']].groupby('Province/State').sum().sort_values( by = 'Confirmed', ascending = False)
USA_data_by_state['Confirmed'].sort_values(ascending = False) # sort the values in descending order
Province/State New York 381714.0 New Jersey 166164.0 California 146659.0 Illinois 131198.0 Massachusetts 105059.0 Texas 84927.0 Pennsylvania 82481.0 Florida 70971.0 Michigan 65672.0 Maryland 60613.0 Georgia 55783.0 Virginia 53211.0 Louisiana 44995.0 Connecticut 44689.0 North Carolina 41417.0 Ohio 40424.0 Indiana 39146.0 Arizona 33039.0 Minnesota 29795.0 Tennessee 29118.0 Colorado 28807.0 Washington 25171.0 Alabama 23710.0 Iowa 23350.0 Wisconsin 22246.0 Mississippi 19091.0 South Carolina 17170.0 Nebraska 16522.0 Missouri 15997.0 Rhode Island 15947.0 Utah 13577.0 Kentucky 12166.0 Arkansas 11547.0 Kansas 10973.0 Nevada 10704.0 Delaware 10173.0 District of Columbia 9654.0 New Mexico 9526.0 Oklahoma 7849.0 South Dakota 5742.0 Puerto Rico 5536.0 Oregon 5377.0 New Hampshire 5251.0 Idaho 3353.0 North Dakota 3016.0 Maine 2721.0 West Virginia 2249.0 Vermont 1119.0 Wyoming 1027.0 Hawaii 706.0 Alaska 624.0 Montana 573.0 Guam 183.0 Grand Princess 103.0 Virgin Islands 72.0 Diamond Princess cruise ship 49.0 Northern Mariana Islands 30.0 Recovered 0.0 Name: Confirmed, dtype: float64