4 Summary¶
Phew – you made it! Well done!
Figure 11
An image of a gymnast on her front with her legs split over her head During this week you’ve learned how to take a dataset that contains multiple possible groupings or subsets of data, and work with those groups to perform a variety of transformations.
In particular, you have learned how to:
split the data contained in a dataframe into multiple groups based on the unique ‘key’ values in a single column, or unique combinations of values that appear across two or more columns
apply
anaggregate
(summary) function to generate a single summary result for the group, and then combine these results to generate a summary report with as many rows as there were groups, one summary report row per group-apply
afilter
function that would use the rows contained in each group as the basis for a filtering operation, returning rows from each group who’s group properties matched the filter conditionsuse a pivot table to generate a variety of summary reports.
You may not have thought of performing gymnastics with data before, but as you’ve seen, we can start to work our data quite hard if we need to get it into shape!
4.1 Week 7 and 8 glossary¶
Here are alphabetical lists, for quick look up, of what this week introduced.
Concepts¶
An API , or application programming interface provides a way for computer programmes to make function or resource requests from a software application. Many online applications provide a web API that allows requests to be made over the internet using web addresses or URLs (uniform resource locator). A URL may include several parameters that act as arguments used to pass information into a function provided by the API. To prevent ambiguity, simple punctuation is avoided in URLs. Instead, ‘websafe’ encodings using the ASCII encoding scheme are typically used to describe punctuation characters.
The notion of grouping refers to the collecting together of sets of rows based on some defining characteristic. Grouping on one or more key columns splits the dataset into separate groups, one group for each unique combination of values that appears in the dataset across the key columns. Note that not all possible combinations of cross-column key values will necessarily exist in a dataset.
The split-apply-combine pattern describes a process in which a dataset is split into separate groups, some function is applied to the members of each separate group, and the results then combined to form an output dataset.
Functions and methods¶
df.to_csv(filename,index=False)
writes the contents of the dataframe df
to a CSV file with the specified filename in the current folder. The index
parameter controls whether the dataframe index is included in the output file.
read_csv(URL,dtype={})
can be used to read a CSV file in from a web location given the web address or URL of the file. We also made use of an additional parameter, dtype
to specify the data type of specified columns in a dataframe created from a CSV file.
df.groupby(columnName)
or df.groupby(listOfColumnNames)
is used to split a dataframe into separate groups indexed by the unique values of columnName
or unique combinations of the column values specified in the listOfColumnNames.
grouped.get_group(groupName)
is used to retrieve a particular group of rows by group name from a set of grouped items.
grouped.groups.keys()
is used to retrieve the names of groups that exist within a set of grouped items.
grouped.aggregate(operation)
applies a specified operation to a group (such as sum) and then combines the results into a single dataframe indexed by group.
grouped.apply(myFunction)
will apply a user defined function to the rows associated with each group in a set of grouped items and return a dataframe containing the combined rows returned from the user defined function.
grouped.filter(myFilterFunction)
will apply a user defined filtration function to each group in a set of grouped items that tests each group and returns a Boolean True or False value to say whether each group has passed the filter test. The .filter()
function then returns a single dataframe that contains the combined rows from groups that the user defined filter function let through.
pivot_table(df, index=indexColumnNames, columns=columnsColumnNames, values=valueColumnName, aggfunc=aggregationFunction)
generates a pivot table from a dataframe using unique combinations of values from one or more columns specified by the indexColumnNames
list to define the row index and unique combinations of values from one or more columns specified by the columnsColumnNames
list to define the columns. The pivot table cells are calculated by applying the aggfunc
function to
the valueColumnName
column in the group of rows associated with each cell.
4.2 What next?¶
##-- MEDIACONTENT
RUTH ALEXANDERHello again. You've now reached the end of this course on 'learning to code for data analysis'. You've seen and experienced some of the fundamentals of programming such as basic data types and structures like numbers, strings and lists and assignments and variables to store intermediate results. You've also seen the basic techniques to obtain, clean, transform, aggregate and visualise data. With a single line of code you can filter out the missing values, join two tables or make a chart. Finally you have seen how to use interactive notebooks to write up your own data analysis as a mixture of explanatory text and runnable code. Notebooks can be easily shared among a group of colleagues or publicly which means you can make a real contribution to ongoing research and debates. We hope you're keen to apply your newly gained skills to other data sets on issues that you care about. You'll be able to find data sets you can explore and interrogate in the fields of health, education, energy, climate change, poverty, crime and many more besides. Below this video you'll find links to open data sources but don't forget that your national government or even your local authority might provide open data that's more relevant to you. In a short course like this we could only scratch the surface of coding and data analysis. We hope to have inspired you to learn more about programming, data science and data management, or even statistics. Below you'll also find links to Open University courses, qualifications and free online resources that are related to the topics of this course. Whether you continue your studies with the Open University or not we do hope you really enjoyed learning to code for data analysis. We'd love to hear your feedback and suggestions. Thanks for participating and all the best for the future.
##-- ENDMEDIACONTENT
As the course comes to an end, what’s next in your learning journey? Ruth mentions extending your learning by investigating more open data.
Exploring open data further¶
The last few years has seen a wide variety of local and national governments and agencies publishing data as ‘open data’ that can be freely re-used by anyone. Explore some of this data yourself, at the following links:
UK government open data site – a directory of UK public datasets
US government open data site – the home of the US Government’s open data
Open Knowledge Global Open Data Index – a comprehensive directory of national open data initiatives
Open Data Inception – a geographic list of over 1500 data portals around the world
Google Public Data Explorer – a further list of data providers, with charts for some datasets
Many towns and cities also have their own data sites: search for the name of your town and the keywords ‘open data store’
Open data published by government departments and agencies such as performance of UK schools or prices paid for house sales in the UK
The pandas library supports a growing number of external data sources such as Google Analytics.
Complete our survey¶
We would love to know what you thought of the course and what you plan to do next. Whether you studied the course all in one go or dipped in and out, please take our end-of-course survey. Your feedback is anonymous but will help us to improve what we deliver.