1 Weather data

This week you will be looking at investigating historic weather data.

image0

Figure 1

An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky Of course, such data is hugely important for research into the large-scale, long-term shift in our planet’s weather patterns and average temperatures – climate change. However, such data is also incredibly useful for more mundane planning purposes. To demonstrate the learning this week, I, Rob Griffiths, will be using historic weather data to try and plan a summer holiday in the UK. You’ll use the data too and get a chance to work on your own project at the end of the week.

The dataset we’ll use to do this will come from the Weather Underground, which creates weather forecasts from data sent to them by a worldwide network of over 100,000 weather enthusiasts who have personal weather stations on their house or in their garden.

In addition to creating weather forecasts from that data, the Weather Underground also keeps that data as historic weather records allowing members of the public to download weather datasets for a particular time period and location. These datasets are downloaded as CSV files, explained in the next step.

Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data and ‘mould it’ for your purposes. You will then learn how to visualise data by creating graphs using the plot() function.

1.1 What is a CSV file?

A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.

image0

Figure 2

An image of many pins marking various countries on a globe Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:

[ ]:

Country,Population (1000s),TB deaths
Afghanistan,30552,13000.0
Albania,3173,20.0
Algeria,39208,5100.0
Andorra,79,0.26
Angola,21472,6900.0
Antigua and Barbuda,90,1.2
Argentina,41446,570.0
Armenia,2977,170.0

Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.

Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.

Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.

Before anything can be done with a CSV file with pandas, the following import statement must be executed:

In []:

from pandas import *

As you learned in Week 2, the import statement loads into memory all the code in the pandas module.

To read a CSV file into a dataframe, the pandas function read_csv() needs to be called.

In []:

df = read_csv('WHO POP TB all.csv')

The above code creates a dataframe from the data in the file WHO POP TB````all.csv and assigns it to the variable df. This is the simplest usage of the read_csv() function, just using a single argument, a string that holds the name of the CSV file.

However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.

In the next step, find out about dataframes and the ‘dot’ notation.

1.2 Dataframes and the ‘dot’ notation

In Week 2 you learned that dataframes have methods, which are like functions, that can only be called in the context of a dataframe.

For example, because the TB deaths dataframe df ``\ has a column named ‘Country’, the \ ``sort_values() method can be called like this:

In []:

df.sort_values('Country')

Because there is variable name, followed by a dot, followed by the method, this is called dot notation. Methods are said to be a property of a dataframe. In addition to methods, dataframes have another property – attributes.

image0

Figure 3

A multi-coloured image of many different sized circles. They could be described as bubbles

Attributes

A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is ``columns ``which holds a dataframe’s column names.

So the expression df.columns evaluates to the value of the columnsattribute inside the dataframe df. The following code will get and display the names of the columns in the dataframe df:

In []:

df.columns

Out[]:

[ ]:

Index(['Country', 'Population (1000s)', 'TB deaths'],
dtype='object')

1.3 Getting and displaying dataframe rows

Dataframes can have hundreds or thousands of rows, so it is not practical to display a whole dataframe.

However, there are a number of dataframe attributes and methods that allow you to get and display either a single row or a number of rows at a time. Three of the most useful methods are:`` iloc()``, head() and tail(). Note that to distinguish methods and attributes, we write () after a method’s name.

image0

Figure 4

An image of a data algorithm

The iloc attribute

A dataframe has a default integer index for its rows, which starts at 0 (zero). You can get and display any single row in a dataframe by using theiloc attribute with the index of the row you want to access as its argument. For example, the following code will get and display the first row of data in the dataframe df, which is at index 0:

In []:

df.iloc[0]

Out[]:

[ ]:

Country Afghanistan
Population (1000s) 30552
TB deaths 13000
Name: 0, dtype: object

Similarly, the following code will get and display the third row of data in the dataframe df, which is at index 2:

In []:

df.iloc[2]

Out[]:

[ ]:

Country Algeria
Population (1000s) 39208
TB deaths 5100.0
Name: 0, dtype: object


The head() method

The first few rows of a dataframe can be printed out with the head() method.

You can tell head() is a method, rather than an attribute such as columns, because of the parentheses (round brackets) after the property name.

If you don’t give any argument, i.e. don’t put any number within those parentheses, the default behaviour is to return the first five rows of the dataframe. If you give an argument, it will print that number of rows (starting from the row indexed by 0).

For example, executing the following code will get and display the first five rows in the dataframe df.

In []:

df.head()

Out[]:

Country

Population (1000s)

TB deaths

0

Afghanistan

30552

13000.00

1

Albania

3173

20.00

2

Algeria

39208

5100.00

3

Andorra

79

0.26

4

Angola

21472

6900.00

And, executing the following code will get and display the first seven rows in the dataframe df.

In []:

df.head(7)

Out[]:

Country

Population (1000s)

TB deaths

0

Afghanistan

30552

13000.00

1

Albania

3173

20.00

2

Algeria

39208

5100.00

3

Andorra

79

0.26

4

Angola

21472

6900.00

5

Antigua and Barbuda

90

1.20

6

Argentina

41446

570.00



The tail() method

The tail() method is similar to the head() method.

If no argument is given, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument, just like for the head() method.

In []:

df.tail()

Out[]:

Country

Population (1000s)

TB deaths

189

Venezuela (Bolivarian Republic of)

30405

480

190

Viet Nam

91680

17000

191

Yemen

24407

990

192

Zambia

14539

3600

193

Zimbabwe

14150

5700


1.4 Getting and displaying dataframe columns

You learned in Week 2 that you can get and display a single column of a dataframe by putting the name of the column (in quotes) within square brackets immediately after the dataframe’s name.

For example, like this:

In []:

df['TB deaths']

You then get output like this:

Out[]:

[ ]:

0    13000.00
1       20.00
2     5100.00
3        0.26
4     6900.00
5        1.20
6      570.00
...

Notice that although there is an index, there is no column heading. This is because what is returned is not a new dataframe with a single column but an example of the Series data type.

image0

Figure 5

An perspective image of the isle between many data storage towers. The floor and the storage units are lit up.

Each column in a dataframe is an example of a series

The Series data type is a collection of values with an integer index that starts from zero. In addition, the Series data type has many of the same methods and attributes as the DataFrame data type, so you can still execute code like:

In []:

df['TB deaths'].head()

Out[]:

[ ]:

0    13000.00
1       20.00
2     5100.00
3        0.26
4     6900.00
Name: TB deaths, dtype: float64

And

In []:

df['TB deaths'].iloc[2]

Out[]:

5100.00

However, pandas does provide a mechanism for you to get and display one or more selected columns as a new dataframe in its own right. To do this you need to use a list. A list in Python consists of one or more items separated by commas and enclosed within square brackets, for example ['Country'] or`` [‘Country’, ‘Population (1000s)’]``. This list is then put within outer square brackets immediately after the dataframe’s name, like this:

In []:

df[['Country']].head()

Out[]:

Country

0

Afghanistan

1

Albania

2

Algeria

3

Andorra

4

Angola

Note that the column is now named. The expression`` df[[‘Country’]]``(with two square brackets) evaluates to a new dataframe (which happens to have a single column) rather than a series.

To get a new dataframe with multiple columns you just need to put more column names in the list, like this:

In []:

df[['Country', 'Population (1000s)']].head()

Out[]:

Country

Population (1000s)

0

Afghanistan

30552

1

Albania

3173

2

Algeria

39208

3

Andorra

79

4

Angola

21472

The code has returned a new dataframe with just the 'Country' and 'Population (1000s)’ columns.

Exercise 1 Dataframes and CSV files

Question

Now that you’ve learned about CSV files and more about pandas you are ready to complete Exercise 1 in the exercise notebook 2.

Open the exercise 2 notebook and the data file you used last week WHO POP TB all.csv and save it in the folder you created in Week 1.

If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter. Once it’s open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in Week 1 Exercise 1.


1.5 Comparison operators

In Expressions, you learned that Python has arithmetic operators: +, /, - and * and that expressions such as 5 + 2 evaluate to a value (in this case the number 7).

image0

Figure 6

An illustration of two girls holding up signs. One sign says, ‘YES’, the other says, ‘NO’. Python also has what are called comparison operators, these are:

[ ]:

==    equals
!=    not equal
<     less than
>     greater than
<=    less than or equal to
>=    greater than or equal to

Expressions involving these operators always evaluate to a Boolean value, that is True or False. Here are some examples:

[ ]:

2 = = 2      evaluates to True
2 + 2 = = 5  evaluates to False
2 != 1 + 1   evaluates to False
45 < 50      evaluates to True
20 > 30      evaluates to False
100 <= 100   evaluates to True
101 >= 100   evaluates to True

The comparison operators can be used with other types of data, not just numbers. Used with strings they compare using alphabetical order. For example:

'aardvark' &lt; 'zebra' evaluates to True

In Calculating over columns you saw that when applied to whole columns, the arithmetic operators did the calculations row by row. Similarly, an expression like df['Country'] &gt;= 'K' will compare the country names, row by row, against the string ‘K’ and record whether the result is True or False in a series like this:

[ ]:

0    False
1    False
2    False
3    False
4    False
5    False
...
Name: Country, dtype: bool

If such an expression is put within square brackets immediately after a dataframe’s name, a new dataframe is obtained with only those rows where the result is True. So:

df[df['Country'] &gt;= 'K']

returns a new dataframe with all the columns of ``df ``but with only the rows corresponding to countries starting with K or a letter later in the alphabet.

As another example, to see the data for countries with over 80 million inhabitants, the following code will return and display a new dataframe with all the columns of df but with only the rows where it is True that the value in the 'Population (1000s)' column is greater than 80000:

In []:

df[df['Population (1000s)'] &gt; 80000]

Out[]:

Country

Population (1000s)

TB deaths

13

Bangladesh

156595

80000

23

Brazil

200362

4400

36

China

1393337

41000

53

Egypt

82056

550

58

Ethiopia

94101

30000

65

Germany

82727

300

77

India

1252140

240000

78

Indonesia

249866

64000

85

Japan

127144

2100

109

Mexico

122332

2200

124

Nigeria

173615

160000

128

Pakistan

182143

49000

134

Philippines

98394

27000

141

Russian Federation

142834

17000

185

United States of America

320051

490

190

Viet Nam

91680

17000

Exercise 2 Comparison operators

Question

You are ready to complete Exercise 2 in the Exercise notebook 2.

Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.

1.6 Bitwise operators

To build more complicated expressions involving column comparisons, there are two bitwise operators.

image0

Figure 7

An image of someone constructing a building from wooden blocks The &amp; operator means ‘and’ and the | operator (vertical bar, not uppercase letter ‘i’) means ‘or’. So, for example the expression:

[ ]:

(df['Country'] >= 'Latvia') & (df['Country'] <= 'Sweden')

will evaluate to a series containing Boolean values where the values areTrue only if the equivalent rows in the dataframe contain the countries ‘Latvia’ to ‘Sweden’, inclusive. However, the following expression which uses | (or) rather than & (and):

(df['Country'] &gt;= 'Latvia') | (df['Country'] &lt;= 'Sweden')

will evaluate to True for all countries, because every country comes alphabetically after ‘Latvia’ (e.g. the ‘UK’) or before ‘Sweden’ (e.g. ‘Brazil’).

Note the round brackets around each comparison. Without them you will get an error.

The whole expression with multiple comparisons has to be put within df[…] to get a dataframe with only those rows that match the condition.

As a further example, using different columns, it is relatively easy to find the rows in df where ‘Population (1000s)’ is greater than 80000 and where ‘TB deaths’ are greater than 10000.

In []:

df[(df['Population (1000s)'] &gt; 80000) &amp; (df['TB deaths'] &gt; 10000)]

Out []:

Country

Population (1000s)

TB deaths

13

Bangladesh

156595

80000

36

China

1393337

41000

58

Ethiopia

94101

30000

77

India

1252140

240000

78

Indonesia

249866

64000

124

Nigeria

173615

160000

128

Pakistan

182143

49000

134

Philippines

98394

27000

141

Russian Federation

142834

17000

190

Viet Nam

91680

17000

These expressions can get long and complicated, making it easy to miss a crucial round or square bracket. In those cases it is best to break up the expression into small steps. The previous example could also be written as:

In []:

[ ]:

population = df['Population (1000s)']
deaths = df['TB deaths']
df[(population > 80000) & (deaths > 10000)]

Exercise 3 Bitwise operators

Question

Complete Exercise 3 in the Exercise notebook 2.