From d0252221e769d79b10bd20c4fca959671d0768f3 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 28 May 2021 18:13:22 +0200 Subject: [PATCH 1/2] Add version of 2019 as md --- _solved/case1_bike_count.md | 587 +++++++++ _solved/case2_biodiversity_analysis.md | 653 ++++++++++ _solved/case2_biodiversity_processing.md | 1090 +++++++++++++++++ ...se3_bacterial_resistance_lab_experiment.md | 360 ++++++ _solved/case4_air_quality_analysis.md | 774 ++++++++++++ _solved/case4_air_quality_processing.md | 467 +++++++ _solved/pandas_01_data_structures.md | 464 +++++++ _solved/pandas_02_basic_operations.md | 302 +++++ _solved/pandas_03a_selecting_data.md | 490 ++++++++ _solved/pandas_03b_indexing.md | 357 ++++++ _solved/pandas_04_time_series_data.md | 444 +++++++ _solved/pandas_05_combining_datasets.md | 209 ++++ _solved/pandas_06_groupby_operations.md | 636 ++++++++++ _solved/pandas_07_reshaping_data.md | 465 +++++++ _solved/visualization_01_matplotlib.md | 443 +++++++ _solved/visualization_02_plotnine.md | 509 ++++++++ _solved/visualization_03_landscape.md | 799 ++++++++++++ _solved/workflow_example_evaluation.md | 364 ++++++ 18 files changed, 9413 insertions(+) create mode 100644 _solved/case1_bike_count.md create mode 100644 _solved/case2_biodiversity_analysis.md create mode 100644 _solved/case2_biodiversity_processing.md create mode 100644 _solved/case3_bacterial_resistance_lab_experiment.md create mode 100644 _solved/case4_air_quality_analysis.md create mode 100644 _solved/case4_air_quality_processing.md create mode 100644 _solved/pandas_01_data_structures.md create mode 100644 _solved/pandas_02_basic_operations.md create mode 100644 _solved/pandas_03a_selecting_data.md create mode 100644 _solved/pandas_03b_indexing.md create mode 100644 _solved/pandas_04_time_series_data.md create mode 100644 _solved/pandas_05_combining_datasets.md create mode 100644 _solved/pandas_06_groupby_operations.md create mode 100644 _solved/pandas_07_reshaping_data.md create mode 100644 _solved/visualization_01_matplotlib.md create mode 100644 _solved/visualization_02_plotnine.md create mode 100644 _solved/visualization_03_landscape.md create mode 100644 _solved/workflow_example_evaluation.md diff --git a/_solved/case1_bike_count.md b/_solved/case1_bike_count.md new file mode 100644 index 0000000..4bb9333 --- /dev/null +++ b/_solved/case1_bike_count.md @@ -0,0 +1,587 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

CASE - Bike count data

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + ++++ + + + ++++ + +In this case study, we will make use of the freely available bike count data of the city of Ghent. At the Coupure Links, next to the Faculty of Bioscience Engineering, a counter keeps track of the number of passing cyclists in both directions. + +Those data are available on the open data portal of the city: https://data.stad.gent/data/236 + +```{code-cell} ipython3 +import pandas as pd +import matplotlib.pyplot as plt +plt.style.use('seaborn-whitegrid') + +%matplotlib notebook +``` + +## Reading and processing the data + ++++ + +### Read csv data from URL + ++++ + +The data are avaible in CSV, JSON and XML format. We will make use of the CSV data. The link to download the data can be found on the webpage. For the first dataset, this is: + + link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv" + +A limit defines the size of the requested data set, by adding a limit parameter `limit` to the URL : + + link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit=100000" + +Those datasets contain the historical data of the bike counters, and consist of the following columns: + +- The first column `datum` is the date, in `dd/mm/yy` format +- The second column `tijd` is the time of the day, in `hh:mm` format +- The third and fourth column `ri Centrum` and `ri Mariakerke` are the counts at that point in time (counts between this timestamp and the previous) + +```{code-cell} ipython3 +limit = 200000 +link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit={}".format(limit) +``` + +
+ EXERCISE: +
    +
  • Read the csv file from the url into a DataFrame `df`, the delimiter of the data is `;`
  • +
  • Inspect the first and last 5 rows, and check the number of observations
  • +
  • Inspect the data types of the different columns
  • + +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df = pd.read_csv(link, sep=';') +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.tail() +``` + +```{code-cell} ipython3 +:clear_cell: true + +len(df) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.dtypes +``` + +
+ + Remark: If the download is very slow, consider to reset the limit variable to a lower value as most execises will just work with the first 100000 records as well. + +
+ ++++ + +### Data processing + ++++ + +As explained above, the first and second column (respectively `datum` and `tijd`) indicate the date and hour of the day. To obtain a time series, we have to combine those two columns into one series of actual datetime values. + ++++ + +
+ + EXERCISE: Preprocess the data + +
    +
  • Combine the 'datum' and 'tijd' columns into one Series of string datetime values (Hint: concatenating strings can be done with the addition operation)
  • +
  • Parse the string datetime values (Hint: specifying the format will make this a lot faster)
  • +
  • Set the resulting dates as the index
  • +
  • Remove the original 'tijd' and 'tijd' columns (Hint: check the drop method)
  • +
  • Rename the 'ri Centrum', 'ri Mariakerke' to 'direction_centre', 'direction_mariakerke' (Hint: check the rename function)
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +combined = df['datum'] + ' ' + df['tijd'] +combined.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.index = pd.to_datetime(combined, format="%d/%m/%Y %H:%M") +``` + +```{code-cell} ipython3 +:clear_cell: true + +df = df.drop(columns=['datum', 'tijd']) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'}) +``` + +```{code-cell} ipython3 +df.head() +``` + +Having the data available with an interpreted datetime, provides us the possibility of having time aware plotting: + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(10, 6)) +df.plot(colormap='coolwarm', ax=ax) +``` + +
+ + Remark: Interpretation of the dates with and without predefined date format. + +
+ ++++ + +When we just want to interpret the dates, without specifying how the dates are formatted, Pandas makes an attempt as good as possible: + +```{code-cell} ipython3 +%timeit -n 1 -r 1 pd.to_datetime(combined, dayfirst=True) +``` + +However, when we already know the format of the dates (and if this is consistent throughout the full dataset), we can use this information to interpret the dates: + +```{code-cell} ipython3 +%timeit pd.to_datetime(combined, format="%d/%m/%Y %H:%M") +``` + +
+ + Remember: Whenever possible, specify the date format to interpret the dates to datetime values! + +
+ ++++ + +### Write the data set cleaning as a function + +In order to make it easier to reuse the code for the preprocessing we have now implemented, let's convert the code to a Python function + ++++ + +
+ +EXERCISE: + +
    +
  • Write a function process_bike_count_data(df) that performs the processing steps as done above for an input DataFrame and return the updated DataFrame
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +def process_bike_count_data(df): + """ + Process the provided dataframe: parse datetimes and rename columns. + + """ + df.index = pd.to_datetime(df['datum'] + ' ' + df['tijd'], format="%d/%m/%Y %H:%M") + df = df.drop(columns=['datum', 'tijd']) + df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'}) + return df +``` + +```{code-cell} ipython3 +df_raw = pd.read_csv(link, sep=';') +df_preprocessed = process_bike_count_data(df_raw) +``` + +### Store our collected dataset as an interim data product + ++++ + +As we finished our data-collection step, we want to save this result as a interim data output of our small investigation. As such, we do not have to re-download all the files each time something went wrong, but can restart from our interim step. + +```{code-cell} ipython3 +df_preprocessed.to_csv("bike_count_interim.csv") +``` + +## Data exploration and analysis + ++++ + +We now have a cleaned-up dataset of the bike counts at Coupure Links. Next, we want to get an impression of the characteristics and properties of the data + ++++ + +### Load the interim data + ++++ + +Reading the file in from the interim file (when you want to rerun the whole analysis on the updated online data, you would comment out this cell...) + +```{code-cell} ipython3 +df = pd.read_csv("bike_count_interim.csv", index_col=0, parse_dates=True) +``` + +### Count interval verification + ++++ + +The number of bikers are counted for intervals of approximately 15 minutes. But let's check if this is indeed the case. +For this, we want to calculate the difference between each of the consecutive values of the index. We can use the `Series.diff()` method: + +```{code-cell} ipython3 +pd.Series(df.index).diff() +``` + +Again, the count of the possible intervals is of interest: + +```{code-cell} ipython3 +pd.Series(df.index).diff().value_counts() +``` + +There are a few records that is not exactly 15min. But given it are only a few ones, we will ignore this for the current case study and just keep them as such for this explorative study. + +Bonus question: do you know where the values of `-1 days +23:15:01` and `01:15:00` are coming from? + +```{code-cell} ipython3 +df.describe() +``` + +### Quiet periods + ++++ + +
+ +EXERCISE: + +
    +
  • Create a new Series, df_both which contains the sum of the counts of both directions
  • +
+ +
+ +_Tip:_ check the purpose of the `axis` argument of the `sum` function + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_both = df.sum(axis=1) +``` + +
+ +EXERCISE: + +
    +
  • Using the df_both from the previous exercise, create a new Series df_quiet which contains only those intervals for which less than 5 cyclists passed in both directions combined
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_quiet = df_both[df_both < 5] +``` + +
+ +EXERCISE: + +
    +
  • Using the original data, select only the intervals for which less than 3 cyclists passed in one or the other direction. Hence, less than 3 cyclists towards the centre or less than 3 cyclists towards Mariakerke.
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df[(df['direction_centre'] < 3) | (df['direction_mariakerke'] < 3)] +``` + +### Count statistics + ++++ + +
+ +EXERCISE: + +
    +
  • What is the average number of bikers passing each 15 min?
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df.mean() +``` + +
+ +EXERCISE: + +
    +
  • What is the average number of bikers passing each hour?
  • +
+ +_Tip:_ you can use `resample` to first calculate the number of bikers passing each hour. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df.resample('H').sum().mean() +``` + +
+ +EXERCISE: + +
    +
  • What are the 10 highest peak values observed during any of the intervals for the direction towards the centre of Ghent?
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['direction_centre'].nlargest(10) +# alternative: +# df['direction_centre'].sort_values(ascending=False).head(10) +``` + +
+ +EXERCISE: + +
    +
  • What is the maximum number of cyclist that passed on a single day calculated on both directions combined?
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_both = df.sum(axis=1) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df_daily = df_both.resample('D').sum() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df_daily.max() +``` + +```{code-cell} ipython3 +df_daily.nlargest(10) +``` + +2013-06-05 was the first time more than 10,000 bikers passed on one day. Apparanlty, this was not just by coincidence... http://www.nieuwsblad.be/cnt/dmf20130605_022 + ++++ + +### Trends as function of time + ++++ + +
+ +EXERCISE: + +
    +
  • How does the long-term trend look like? Calculate monthly sums and plot the result.
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_monthly = df.resample('M').sum() +df_monthly.plot() +``` + +
+ +EXERCISE: + +
    +
  • Let's have a look at some short term patterns. For the data of the first 3 weeks of January 2014, calculate the hourly counts and visualize them.
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_hourly = df.resample('H').sum() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df_hourly.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df_hourly['2014-01-01':'2014-01-20'].plot() +``` + +**New Year's Eve 2013-2014** + ++++ + +
+ +EXERCISE: + +
    +
  • Select a subset of the data set from 2013-12-31 12:00:00 untill 2014-01-01 12:00:00, store as variable newyear and plot this subset
  • +
  • Use a rolling function (check documentation of the function!) to smooth the data of this period and make a plot of the smoothed version
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +newyear = df["2013-12-31 12:00:00": "2014-01-01 12:00:00"] +``` + +```{code-cell} ipython3 +:clear_cell: true + +newyear.plot() +``` + +```{code-cell} ipython3 +:clear_cell: true + +newyear.rolling(10, center=True).mean().plot(linewidth=2) +``` + +A more advanced usage of matplotlib to create a combined plot: + +```{code-cell} ipython3 +:clear_cell: true + +# A more in-detail plotting version of the graph. +fig, ax = plt.subplots() +newyear.plot(ax=ax, color=['LightGreen', 'LightBlue'], legend=False, rot=0) +newyear.rolling(10, center=True).mean().plot(linewidth=2, ax=ax, color=['DarkGreen', 'DarkBlue'], rot=0) + +ax.set_xlabel('') +ax.set_ylabel('Cyclists count') +``` + +--- + +## The power of `groupby`... + +Looking at the data in the above exercises, there seems to be clearly a: + +- daily pattern +- weekly pattern +- yearly pattern + +Such patterns can easily be calculated and visualized in pandas using the DatetimeIndex attributes `weekday` combined with `groupby` functionality. Below a taste of the possibilities, and we will learn about this in the proceeding notebooks: + ++++ + +**Weekly pattern**: + +```{code-cell} ipython3 +df_daily = df.resample('D').sum() +``` + +```{code-cell} ipython3 +df_daily.groupby(df_daily.index.weekday).mean().plot(kind='bar') +``` + +**Daily pattern:** + +```{code-cell} ipython3 +df_hourly.groupby(df_hourly.index.hour).mean().plot() +``` + +So the daily pattern is clearly different for both directions. In the morning more people go north, in the evening more people go south. The morning peak is also more condensed. + ++++ + +**Monthly pattern** + +```{code-cell} ipython3 +df_monthly = df.resample('M').sum() +``` + +```{code-cell} ipython3 +from calendar import month_abbr +``` + +```{code-cell} ipython3 +ax = df_monthly.groupby(df_monthly.index.month).mean().plot() +ax.set_ylim(0) +xlabels = ax.set_xticklabels(list(month_abbr)[0::2]) #too lazy to write the month values yourself... +``` + +## Acknowledgements +Thanks to the [city of Ghent](https://data.stad.gent/) for opening their data diff --git a/_solved/case2_biodiversity_analysis.md b/_solved/case2_biodiversity_analysis.md new file mode 100644 index 0000000..95a7ceb --- /dev/null +++ b/_solved/case2_biodiversity_analysis.md @@ -0,0 +1,653 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

CASE - Biodiversity data - analysis

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +%matplotlib inline + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns + +plt.style.use('seaborn-whitegrid') +``` + +## Reading in the enriched survey data set + ++++ + +
+ EXERCISE: + +
    +
  • Read in the 'survey_data_completed.csv' file and save the resulting DataFrame as variable survey_data_processed (if you did not complete the previous notebook, a version of the csv file is available in the `../data` folder).
  • +
  • Interpret the 'eventDate' column directly as python datetime object and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the `read_csv` function)
  • +
  • Inspect the resulting frame (remember `.head()` and `.info()`) and check that the 'eventDate' indeed has a datetime data type.
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed = pd.read_csv("../data/survey_data_completed.csv", + parse_dates=['eventDate'], index_col="occurrenceID") +``` + +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed.info() +``` + +## Tackle missing values (NaN) and duplicate values + ++++ + +
+ EXERCISE: How many records are in the data set without information on the 'species' name? +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed['species'].isnull().sum() +``` + +
+ EXERCISE: How many duplicate records are present in the dataset? + +_Tip_: Pandas has a function to find `duplicated` values... +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed.duplicated().sum() +``` + +
+ EXERCISE: Extract a list of all duplicates, sort on the columns `eventDate` and `verbatimLocality` and show the first 10 records + +_Tip_: Check documentation of `duplicated` +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_processed[survey_data_processed.duplicated(keep=False)].sort_values(["eventDate", "verbatimLocality"]).head(10) +``` + +
+ +EXERCISE: Exclude the duplicate values from the survey data set and save the result as survey_data_unique + +__Tip__: Next to finding `duplicated` values, Pandas has a function to `drop duplicates`... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_unique = survey_data_processed.drop_duplicates() +``` + +```{code-cell} ipython3 +len(survey_data_unique) +``` + +
+ +EXERCISE: For how many records (rows) we have all the information available (i.e. no NaN values in any of the columns)? + +__Tip__: Just counting the nan (null) values won't work, maybe `dropna` can help you? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +len(survey_data_unique.dropna()) +``` + +
+ +EXERCISE: Select the subset of records without a species name, while having information on the sex and store the result as variable not_identified + +__Tip__: next to `isnull`, also `notnull` exists... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +mask = survey_data_unique['species'].isnull() & survey_data_unique['sex'].notnull() +not_identified = survey_data_unique[mask] +``` + +```{code-cell} ipython3 +not_identified.head() +``` + +
+ EXERCISE: Select only those records that do have species information and save them as the variable survey_data. Make sure survey_data is a copy of the original DataFrame. This is the DataFrame we will use in the further analyses. +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data = survey_data_unique.dropna(subset=['species']).copy() +``` + +
+ NOTE: For biodiversity studies, absence values (knowing that someting is not present) are useful as well to normalize the observations, but this is out of scope for these exercises. +
+ ++++ + +## Observations over time + ++++ + +
+ +EXERCISE: Make a plot visualizing the evolution of the number of observations for each of the individual years (i.e. annual counts). + +__Tip__: In the `pandas_04_time_series_data.ipynb` notebook, a powerful command to resample a time series + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data.resample('A', on='eventDate').size().plot() +``` + +To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation. We can actually use the plotnine library as well to make heatmaps, as it provides the [`geom_tile`](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html) geometry. Loading the library: + +```{code-cell} ipython3 +import plotnine as pn +``` + +
+ +EXERCISE: Create a table, called heatmap_prep_plotnine, based on the survey_data DataFrame with a column for the years, a column for the months a column with the counts (called `count`). + +__Tip__: You have to count for each year/month combination. Also `reset_index` could be useful. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +heatmap_prep_plotnine = survey_data.groupby([survey_data['eventDate'].dt.year, + survey_data['eventDate'].dt.month]).size() +heatmap_prep_plotnine.index.names = ["year", "month"] +heatmap_prep_plotnine = heatmap_prep_plotnine.reset_index(name='count') +``` + +```{code-cell} ipython3 +:clear_cell: true + +heatmap_prep_plotnine.head() +``` + +
+ +EXERCISE: Based on heatmap_prep_plotnine, make a heatmap using the plotnine package. + + +__Tip__: When in trouble, check [this section of the documentation](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html#Annotated-Heatmap) + +
+ +```{code-cell} ipython3 +:clear_cell: true + +(pn.ggplot(heatmap_prep_plotnine, pn.aes(x="factor(month)", y="year", fill="count")) + + pn.geom_tile() + + pn.scale_fill_cmap("Reds") + + pn.scale_y_reverse() + + pn.theme( + axis_ticks=pn.element_blank(), + panel_background=pn.element_rect(fill='white')) +) +``` + +Remark that we started from a `tidy` data format (also called *long* format). + +The heatmap functionality is also provided by the plotting library [seaborn](http://seaborn.pydata.org/generated/seaborn.heatmap.html) (check the docs!). Based on the documentation, seaborn uses the *short* format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values. + +Let's reformat the `heatmap_prep_plotnine` data to be useable for the seaborn heatmap function: + ++++ + +
+ +EXERCISE: Create a table, called heatmap_prep_sns, based on the heatmap_prep_plotnine DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinations. + +__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know + +
+ +```{code-cell} ipython3 +:clear_cell: true + +heatmap_prep_sns = heatmap_prep_plotnine.pivot_table(index='year', columns='month', values='count') +``` + +
+ EXERCISE: Using the seaborn documentation make a heatmap starting from the heatmap_prep_sns variable. +
+ +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots(figsize=(10, 8)) +ax = sns.heatmap(heatmap_prep_sns, cmap='Reds') +``` + +
+ +EXERCISE: Based on the heatmap_prep_sns DataFrame, return to the long format of the table with the columns `year`, `month` and `count` and call the resulting variable heatmap_tidy. + +__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know, but a `reset_index` could be useful as well + +
+ +```{code-cell} ipython3 +:clear_cell: true + +heatmap_tidy = heatmap_prep_sns.reset_index().melt(id_vars=["year"], value_name="count") +heatmap_tidy.head() +``` + +## Species abundance for each of the plots + ++++ + +The name of the observed species consists of two parts: the 'genus' and 'species' columns. For the further analyses, we want the combined name. This is already available as the 'name' column if you completed the previous notebook, otherwise you can add this again in the following exercise. + ++++ + +
+ +EXERCISE: Make a new column 'name' that combines the 'Genus' and 'species' columns (with a space in between). + +__Tip__: You are aware you can count with strings in Python 'a' + 'b' = 'ab'? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species'] +``` + +
+ +EXERCISE: Which 8 species have been observed most of all? + +__Tip__: Pandas provide a function to combine sorting and showing the first n records, see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html)... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data.groupby("name").size().nlargest(8) +``` + +```{code-cell} ipython3 +:clear_cell: true + +survey_data['name'].value_counts()[:8] +``` + +
+ EXERCISE: How many records are available of each of the species in each of the plots (called `verbatimLocality`)? How would you visualize this information with seaborn? +
+ +```{code-cell} ipython3 +:clear_cell: true + +species_per_plot = survey_data.reset_index().pivot_table(index="name", + columns="verbatimLocality", + values="occurrenceID", + aggfunc='count') + +# alternative ways to calculate this +#species_per_plot = survey_data.groupby(['name', 'plot_id']).size().unstack(level=-1) +#species_per_plot = pd.crosstab(survey_data['name'], survey_data['plot_id']) +``` + +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots(figsize=(8,8)) +sns.heatmap(species_per_plot, ax=ax, cmap='Reds') +``` + +
+ +EXERCISE: What is the number of different species in each of the plots? Make a bar chart, using Pandas `plot` function, providing for each plot the diversity of species, by defining a matplotlib figure and ax to make the plot. Change the y-label to 'plot number' + +__Tip__: next to `unique`, Pandas also provides a function `nunique`... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +n_species_per_plot = survey_data.groupby(["verbatimLocality"])["name"].nunique() + +fig, ax = plt.subplots(figsize=(6, 6)) +n_species_per_plot.plot(kind="barh", ax=ax, color="lightblue") +ax.set_ylabel("plot number") + +# Alternative option: +# inspired on the pivot table we already had: +# species_per_plot = survey_data.reset_index().pivot_table( +# index="name", columns="verbatimLocality", values="occurrenceID", aggfunc='count') +# n_species_per_plot = species_per_plot.count() +``` + +
+ +EXERCISE: What is the number of plots each species have been observed? Make an horizontal bar chart using Pandas `plot` function providing for each species the spread amongst the plots for which the species names are sorted to the number of plots + +
+ +```{code-cell} ipython3 +:clear_cell: true + +n_plots_per_species = survey_data.groupby(["name"])["verbatimLocality"].nunique().sort_values() + +fig, ax = plt.subplots(figsize=(8, 8)) +n_plots_per_species.plot(kind="barh", ax=ax, color='0.4') + +# Alternatives +# species_per_plot2 = survey_data.reset_index().pivot_table(index="verbatimLocality", +# columns="name", +# values="occurrenceID", +# aggfunc='count') +# nplots_per_species = species_per_plot2.count().sort_values(ascending=False) +# or +# species_per_plot.count(axis=1).sort_values(ascending=False).plot(kind='bar') +``` + +
+ +EXERCISE: First, exclude the NaN-values from the `sex` column and save the result as a new variable called `subselection_sex`. Based on this variable `subselection_sex`, calculate the amount of males and females present in each of the plots. Save the result (with the verbatimLocality as index and sex as column names) as a variable n_plot_sex. + +__Tip__: Release the power of `unstack`... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +subselection_sex = survey_data.dropna(subset=["sex"]) +#subselection_sex = survey_data[survey_data["sex"].notnull()] +``` + +```{code-cell} ipython3 +:clear_cell: true + +n_plot_sex = subselection_sex.groupby(["sex", "verbatimLocality"]).size().unstack(level=0) +n_plot_sex.head() +``` + +As such, we can use the variable `n_plot_sex` to plot the result: + +```{code-cell} ipython3 +:clear_cell: false + +n_plot_sex.plot(kind='bar', figsize=(12, 6), rot=0) +``` + +
+ +EXERCISE: Create the previous plot with the plotnine library, directly from the variable subselection_sex. + +__Tip__: When in trouble, check these [docs](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#Two-Variable-Bar-Plot). + +
+ +```{code-cell} ipython3 +:clear_cell: true + +(pn.ggplot(subselection_sex, pn.aes(x="verbatimLocality", fill="sex")) + + pn.geom_bar(position='dodge') + + pn.scale_x_discrete(breaks=np.arange(1, 25, 1), limits=np.arange(1, 25, 1)) +) +``` + +## Select subsets according to taxa of species + +```{code-cell} ipython3 +survey_data["taxa"].unique() +``` + +```{code-cell} ipython3 +survey_data['taxa'].value_counts() +#survey_data.groupby('taxa').size() +``` + +
+ +EXERCISE: Select the records for which the `taxa` is equal to 'Rabbit', 'Bird' or 'Reptile'. Call the resulting variable `non_rodent_species`. + +__Tip__: You do not have to combine three different conditions, as Pandas has a function to check if something is in a certain list of values + +
+ +```{code-cell} ipython3 +:clear_cell: true + +non_rodent_species = survey_data[survey_data['taxa'].isin(['Rabbit', 'Bird', 'Reptile'])] +``` + +```{code-cell} ipython3 +len(non_rodent_species) +``` + +
+ +EXERCISE: Select the records for which the `taxa` starts with an 'ro' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable r_species. + +__Tip__: Remember the `.str.` construction to provide all kind of string functionalities? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +r_species = survey_data[survey_data['taxa'].str.lower().str.startswith('ro')] +``` + +```{code-cell} ipython3 +len(r_species) +``` + +
+ EXERCISE: Select the records that are not Birds. Call the resulting variable non_bird_species. +
+ +```{code-cell} ipython3 +:clear_cell: true + +non_bird_species = survey_data[survey_data['taxa'] != 'Bird'] +``` + +```{code-cell} ipython3 +len(non_bird_species) +``` + +## (OPTIONAL SECTION) Evolution of species during monitoring period + ++++ + +*In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked* + ++++ + +
+ EXERCISE: Plot using Pandas `plot` function the number of records for `Dipodomys merriami` on yearly basis during time +
+ +```{code-cell} ipython3 +:clear_cell: true + +merriami = survey_data[survey_data["name"] == "Dipodomys merriami"] +``` + +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots() +merriami.groupby(merriami['eventDate'].dt.year).size().plot(ax=ax) +ax.set_xlabel("") +ax.set_ylabel("number of occurrences") +``` + +
+ NOTE: Check the difference between the following two graphs? What is different? Which one would you use? +
+ +```{code-cell} ipython3 +merriami = survey_data[survey_data["species"] == "merriami"] +fig, ax = plt.subplots(2, 1, figsize=(14, 8)) +merriami.groupby(merriami['eventDate']).size().plot(ax=ax[0], style="-") # top graph +merriami.resample("D", on="eventDate").size().plot(ax=ax[1], style="-") # lower graph +``` + +
+ +EXERCISE: Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time for the whole monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale + +__Tip__: have a look at the documentation of the pandas plot function. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii', + 'Reithrodontomys megalotis', 'Chaetodipus baileyi'])] +``` + +```{code-cell} ipython3 +:clear_cell: true + +month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size() +``` + +```{code-cell} ipython3 +:clear_cell: true + +species_evolution = month_evolution.unstack(level=0) +axs = species_evolution.plot(subplots=True, figsize=(14, 8), sharey=True) +``` + +
+ EXERCISE: Reproduce the previous plot using the plotnine package. +
+ +```{code-cell} ipython3 +:clear_cell: true + +subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii', + 'Reithrodontomys megalotis', 'Chaetodipus baileyi'])] +month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size() +``` + +```{code-cell} ipython3 +:clear_cell: true + +(pn.ggplot(month_evolution.reset_index(name='count'), + pn.aes(x='eventDate', y='count', color='name')) + + pn.geom_line() + + pn.facet_wrap('name', nrow=4) + + pn.theme_light() +) +``` + +
+ EXERCISE: Evaluate the yearly amount of occurrences for each of the 'taxa' as a function of time. +
+ +```{code-cell} ipython3 +:clear_cell: true + +year_evolution = survey_data.groupby("taxa").resample('A', on='eventDate').size() +species_evolution = year_evolution.unstack(level=0) +axs = species_evolution.plot(subplots=True, figsize=(16, 8), sharey=False) +``` + +
+ EXERCISE: Calculate the number of occurrences for each weekday, grouped by each year of the monitoring campaign, without using the `pivot` functionality. Call the variable count_weekday_years +
+ +```{code-cell} ipython3 +:clear_cell: true + +count_weekday_years = survey_data.groupby([survey_data["eventDate"].dt.year, survey_data["eventDate"].dt.dayofweek]).size().unstack() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# Alternative +#years = survey_data["eventDate"].dt.year.rename('year') +#dayofweaks = survey_data["eventDate"].dt.dayofweek.rename('dayofweak') +#count_weekday_years = pd.crosstab(index=years, columns=dayofweaks) +``` + +```{code-cell} ipython3 +count_weekday_years.head() +``` + +```{code-cell} ipython3 +count_weekday_years.plot() +``` + +
+ EXERCISE: Based on the variable `count_weekday_years`, calculate for each weekday the median amount of records based on the yearly count values. Modify the labels of the plot to indicate the actual days of the week (instead of numbers) +
+ +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots() +count_weekday_years.median(axis=0).plot(kind='barh', ax=ax, color='#66b266') +xticks = ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', "Thursday", "Friday", "Saturday", "Sunday"]) +``` + +Nice work! diff --git a/_solved/case2_biodiversity_processing.md b/_solved/case2_biodiversity_processing.md new file mode 100644 index 0000000..5a2ac1e --- /dev/null +++ b/_solved/case2_biodiversity_processing.md @@ -0,0 +1,1090 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

CASE - Biodiversity data - data cleaning and enrichment

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt + +plt.style.use('seaborn-whitegrid') +``` + +**Scenario**:
You are interested in occurrence data for a number of species in Flanders. Unfortunately, the sources for this type of data are still scattered among different institutes. After a mailing campaign, you receive a number of files from different formats, in various data formats and styles... + +You decide to be brave and script the interpretation and transformation, in order to provide reproducibility of your work. Moreover, similar analysis will be needed in the future with new data requests. You *hope* that future data requests will result in similar data formats from the individual partners. So, having a script will enhance the efficiency at that moment. + ++++ + +Besides from the technical differences in the data formats (`csv`, `excel`, `shapefile`, `sqlite`...), there are also differences in the naming of the content. For example, the coordinates, can be named `x`/`y`, `decimalLatitude`/`decimalLongitude`, `lat`/`long`... Luckely, you know of an international **open data standard** to describe occurrence data, i.e. [Darwin Core (DwC)](http://rs.tdwg.org/dwc/terms/#sex). Instead of inventing your own data model, you decide to comply to this international standard. The latter will enhance communication and will also make your data compliant to other data services working with this kind of data. + ++++ + +In short, the DwC describes a flat table (cfr. CSV) with an agreed name convention on the header names and conventions on how certain data types need to be represented. Whereas the standard definitions are out of scope, an in depth description is given [here](https://www.tdwg.org/standards/dwc/). For this tutorial, we will focus on a few of the existing terms to learn some elements about data cleaning: +* `eventDate`: ISO 6801 format of dates +* `scientificName`: the accepted scientific name of the species +* `decimalLatitude`/`decimalLongitude`: coordinates of the occurrence in WGS84 format +* `sex`: either `male` or `female` to characterise the sex of the occurrence +* `occurrenceID`: a identifier within the dataset to identify the individual records +* `datasetName`: a static string defining the source of the data + +Futhermore, additional information concering the taxonomy will be added using an external API service + ++++ + +**Dataset to work on:** + ++++ + +For this dataset, the data is provided in the following main data files: +* `surveys.csv` the data with the surveys observed in the individual plots +* `species.csv` the overview list of the species shortnames +* `plot_location.xlsx` the overview of coordinates of the individual locations + +The data originates from a [study](http://esapubs.org/archive/ecol/E090/118/metadata.htm) of a Chihuahuan desert ecosystem near Portal, Arizona + +![](../img/plot_overview.png) + ++++ + +## Survey-data + ++++ + +Reading in the data of the individual surveys: + +```{code-cell} ipython3 +survey_data = pd.read_csv("../data/surveys.csv") +``` + +```{code-cell} ipython3 +survey_data.head() +``` + +
+ +EXERCISE: How many individual records (occurrences) does the survey data set contain? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +len(survey_data) +``` + +### Adding the data source information as static column + ++++ + +For convenience when this dataset will be combined with other datasets, we first add a column of static values, defining the `datasetName` of this particular data: + +```{code-cell} ipython3 +datasetname = "Ecological Archives E090-118-D1." +``` + +Adding this static value as a new column `datasetName`: + ++++ + +
+ +EXERCISE: Add a new column, 'datasetName', to the survey data set with datasetname as value for all of the records (static value for the entire data set) + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data["datasetName"] = datasetname +``` + +### Cleaning the sex_char column into a DwC called [sex](http://rs.tdwg.org/dwc/terms/#sex) column + ++++ + +
+ +EXERCISE: Get a list of the unique values for the column sex_char. + +__Tip__, to find the unique values, look for a function called `unique`... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data["sex_char"].unique().tolist() +``` + +So, apparently, more information is provided in this column, whereas according to the [metadata](http://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm) information, the sex information should be either `M` (male) or `F` (female). We will create a column, named `sex` and convert the symbols to the corresponding sex, taking into account the following mapping of the values (see [metadata](http://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm) for more details): +* `M` -> `male` +* `F` -> `female` +* `R` -> `male` +* `P` -> `female` +* `Z` -> nan + +At the same time, we will save the original information of the `sex_char` in a separate column, called `verbatimSex`, as a reference. + ++++ + +In summary, we have to: +* create a new column `verbatimSex`, which is a copy of the current `sex_char` column +* create a new column with the name `sex` +* map the original values of the `sex_char` to the values `male` and `female` according to the listing above + ++++ + +Converting the name of the column header `sex_char` to `verbatimSex` with the `rename` function: + +```{code-cell} ipython3 +survey_data = survey_data.rename(columns={'sex_char': 'verbatimSex'}) +``` + +
+ +EXERCISE: Express the mapping of the the values (e.g. M -> male) into a dictionary object called sex_dict + +__Tip__: (1) a NaN-value can be defined as `np.nan`, (2) a dictionary is a Python data structure, https://docs.python.org/3/tutorial/datastructures.html#dictionaries + +
+ +```{code-cell} ipython3 +:clear_cell: true + +sex_dict = {"M": "male", + "F": "female", + "R": "male", + "P": "female", + "Z": np.nan} +``` + +
+ +EXERCISE: Use the dictionary to replace the values in the `verbatimSex` column to the new values according to the `sex_dict` mapping dictionary and save the mapped values in a new column 'sex'. + +__Tip__: to replace values using a mapping dictionary, look for a function called `replace`... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data['sex'] = survey_data['verbatimSex'].replace(sex_dict) +``` + +Checking the current frequency of values (this should result in the values `male`, `female` and `nan`): + +```{code-cell} ipython3 +survey_data["sex"].unique() +``` + +To check what the frequency of occurrences is for male/female of the categories, a bar chart is an possible representation: + ++++ + +
+ +EXERCISE: Make a horizontal bar chart comparing the number of male, female and unknown (NaN) records in the dataset + +__Tip__: check in the help of the Pandas plot function for the `kind` parameter + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data["sex"].value_counts(dropna=False).plot(kind="barh", color="#00007f") +``` + +
+ +NOTE: The usage of `groupby` combined with a `count` of each group would be an option as well. However, the latter does not support to count the `NaN` values as well. The `value_counts` method does support this with the `dropna=False` argument. + +
+ ++++ + +### Solving double entry field by decoupling + ++++ + +When checking the species unique information: + +```{code-cell} ipython3 +survey_data["species"].unique() +``` + +```{code-cell} ipython3 +survey_data.head(10) +``` + +There apparently exists a double entry: `'DM and SH'`, which basically defines two records and should be decoupled to two individual records (i.e. rows). Hence, we should be able to create a additional row based on this split. To do so, Pandas provides a dedicated function since version 0.25, called `explode`. Starting from a small subset example: + +```{code-cell} ipython3 +example = survey_data.loc[7:10, "species"] +example +``` + +Using the `split` method on strings, we can split the string using a given character, in this case the word `and`: + +```{code-cell} ipython3 +example.str.split("and") +``` + +The `explode` method will create a row for each element in the list: + +```{code-cell} ipython3 +example_split = example.str.split("and").explode() +example_split +``` + +Hence, the `DM` and `SH` are now enlisted in separate rows. Other rows remain unchanged. The only remaining issue is the spaces around the characters: + +```{code-cell} ipython3 +example_split.iloc[1], example_split.iloc[2] +``` + +Which we can solve again using the string method `strip`, removing the spaces before and after the characters: + +```{code-cell} ipython3 +example_split.str.strip() +``` + +To make this reusable, let's create a dedicated function to combine these steps, called `solve_double_field_entry`: + +```{code-cell} ipython3 +def solve_double_field_entry(df, keyword="and", column="verbatimEventDate"): + """split on keyword in column for an enumeration and create extra record + + Parameters + ---------- + df: pd.DataFrame + DataFrame with a double field entry in one or more values + keyword: str + word/character to split the double records on + column: str + column name to use for the decoupling of the records + """ + df[column] = df[column].str.split(keyword) + df = df.explode(column) + df[column] = df[column].str.strip() # remove white space around the words + return df +``` + +The function takes a DataFrame as input, splits the record into separate rows and returns the updated DataFrame. We can use this function to get an update of the dataFrame, with the an additional row (occurrence) added by decoupling the specific field: + ++++ + +
+EXERCISE: Use the function solve_double_field_entry to create a dataFrame with an additional row, by decoupling the double entries. Save the result as a variable survey_data_decoupled. +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_decoupled = solve_double_field_entry(survey_data.copy(), + "and", + column="species") # get help of the function by SHIFT + TAB +# REMARK: the copy() statement here (!) see pandas_03b_indexing.ipynb notebook `chained indexing section` +``` + +```{code-cell} ipython3 +survey_data_decoupled["species"].unique() +``` + +```{code-cell} ipython3 +survey_data_decoupled.head(11) +``` + +### Create new occurrence identifier + ++++ + +The `record_id` is no longer a unique identifier after the decoupling of this dataset. We will make a new dataset-specific identifier, by adding a column called `occurrenceID` that takes a new counter as identifier. As a simply and straightforward approach, we will use a new counter for the whole dataset, starting with 1: + +```{code-cell} ipython3 +np.arange(1, len(survey_data_decoupled) + 1, 1) +``` + +Create a new column with header occurrenceID with the values 1 -> 35550 as field values: + +```{code-cell} ipython3 +:clear_cell: false + +survey_data_decoupled["occurrenceID"] = np.arange(1, len(survey_data_decoupled) + 1, 1) +``` + +
+ + Remark: A reset of the index to generate this column with `reset_index(drop=False)` would be technically perfectly valid. Still, we want the indices to start at 1 instead of 0 (and Python starts counting at 0!) + +
+ ++++ + +To overcome the confusion on having both a `record_id` and `occurrenceID` field, we will remove the `record_id` term: + +```{code-cell} ipython3 +survey_data_decoupled = survey_data_decoupled.drop(columns="record_id") +``` + +Hence, columns can be `drop`-ped out of a DataFrame + +```{code-cell} ipython3 +survey_data_decoupled.head(10) +``` + +### Converting the date values + ++++ + +In the survey-dataset we received, the `month`, `day`, and `year` columns are containing the information about the date, i.e. `eventDate` in DarwinCore terms. We want this data in a ISO format `YYYY-MM-DD`. A convenvient Pandas function is the usage of `to_datatime`, which provides multiple options to interpret dates. One of thes options is the automatic interpretation of some 'typical' columns, like `year`, `month` and `day`, when passing a DataFrame. + +```{code-cell} ipython3 +# pd.to_datetime(survey_data_decoupled[["year", "month", "day"]]) # uncomment the line and test this statement +``` + +This is not working, not all dates can be interpreted... We should get some more information on the reason of the errors. By using the option `coerce`, the problem makers will be labeled as a missing value `NaT`. We can count the number of dates that can not be interpreted: + +```{code-cell} ipython3 +sum(pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()) +``` + +
+ +EXERCISE: Make a subselection of survey_data_decoupled containing those records that can not correctly be interpreted as date values and save the resulting dataframe as variable trouble_makers + +
+ +```{code-cell} ipython3 +:clear_cell: true + +mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull() +trouble_makers = survey_data_decoupled[mask] +``` + +Checking some charactersitics of the trouble_makers: + +```{code-cell} ipython3 +trouble_makers.head() +``` + +```{code-cell} ipython3 +trouble_makers["day"].unique() +``` + +```{code-cell} ipython3 +trouble_makers["month"].unique() +``` + +```{code-cell} ipython3 +trouble_makers["year"].unique() +``` + +So, basically the problem is the presence of day `31` during the months April and September of the year 2000. At this moment, we would have to recheck the original data in order to know how the issue could be solved. Apparently, - for this specific case - there has been a data-entry problem in 2000, making the `31` days during this period should actually be `30`. It would be optimal to correct this in the source dataset, but for the further exercise, it will be corrected here. + ++++ + +
+ EXERCISE: Replace in the dataFrame survey_data_decoupled all of the troublemakers day values into the value 30 +
+ +```{code-cell} ipython3 +:clear_cell: true + +mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull() +survey_data_decoupled.loc[mask, "day"] = 30 +``` + +Now, we do the parsing again to create a proper `eventDate` field, containing the dates: + +```{code-cell} ipython3 +survey_data_decoupled["eventDate"] = \ + pd.to_datetime(survey_data_decoupled[["year", "month", "day"]]) +``` + +Just let's do a check the amount of data for each year: + ++++ + +
+ EXERCISE: Create a horizontal bar chart with the number of records for each year +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_decoupled.groupby("year").size().plot(kind='barh', color="#00007f", figsize=(10, 10)) +``` + +```{code-cell} ipython3 +survey_data_decoupled.head() +``` + +Currently, the dates are stored in a python specific date format: + +```{code-cell} ipython3 +survey_data_decoupled["eventDate"].dtype +``` + +This is great, because it allows for many functionalities: + +```{code-cell} ipython3 +survey_data_decoupled.eventDate.dt #add a dot and press TAB to explore the date options it provides +``` + +
+ +EXERCISE: Create a horizontal bar chart with the number of records for each year (cfr. supra), but without using the column `year` + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.year).size().plot(kind='barh', color="#00007f", figsize=(10, 10)) +``` + +So, we actually do not need the `day`, `month`, `year` columns anymore and have other options available as well + ++++ + +
+ EXERCISE: Create a bar chart with the number of records for each weekday +
+ +```{code-cell} ipython3 +:clear_cell: true + +nrecords_by_weekday = survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.weekday).size() +ax = nrecords_by_weekday.plot(kind="barh", color="#00007f", figsize=(6, 6)) +# I you want to represent the ticklabels as proper names, uncomment the following line: +#ticklabels = ax.set_yticklabels(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]) +``` + +When saving the information to a file (e.g. CSV-file), this data type will be automatically converted to a string representation. However, we could also decide to explicitly provide the string format the dates are stored (losing the date type functionalities), in order to have full control on the way these dates are formatted: + +```{code-cell} ipython3 +survey_data_decoupled["eventDate"] = survey_data_decoupled["eventDate"].dt.strftime('%Y-%m-%d') +``` + +```{code-cell} ipython3 +survey_data_decoupled["eventDate"].head() +``` + +As we do not need the `day`, `month`, `year` columns anymore, we can drop them from the DataFrame: + ++++ + +
+ +EXERCISE: Remove the columns day, month and year from the `survey_data_decoupled` DataFrame: + +__Tip__: Remember the `drop` method? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_decoupled = survey_data_decoupled.drop(columns=["day", "month", "year"]) +``` + +## Add coordinates from the plot locations + ++++ + +### Loading the coordinate data + ++++ + +The individual plots are only identified by a `plot` identification number. In order to provide sufficient information to external users, additional information about the coordinates should be added. The coordinates of the individual plots are saved in another file: `plot_location.xlsx`. We will use this information to further enrich our dataset and add the Darwin Core Terms `decimalLongitude` and `decimalLatitude`. + ++++ + +
+ +EXERCISE: Read in the excel file 'plot_location.xlsx' and store the data as the variable `plot_data`, with 3 columns: plot, xutm, yutm. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +plot_data = pd.read_excel("../data/plot_location.xlsx", skiprows=3, index_col=0) +``` + +```{code-cell} ipython3 +plot_data.head() +``` + +### Transforming to other coordinate reference system + ++++ + +These coordinates are in meters, more specifically in [UTM 12 N](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system) coordinate system. However, the agreed coordinate system for Darwin Core is the [World Geodetic System 1984 (WGS84)](http://spatialreference.org/ref/epsg/wgs-84/). + +As this is not a GIS course, we will shortcut the discussion about different projection systems, but provide an example on how such a conversion from UTM12N to WGS84 can be performed with the projection toolkit `pyproj` and by relying on the existing EPSG codes (a registry originally setup by the association of oil & gas producers). + ++++ + +First, we define out two projection systems, using their corresponding EPSG codes: + +```{code-cell} ipython3 +import pyproj +``` + +```{code-cell} ipython3 +utm12n = pyproj.Proj("+init=EPSG:32612") +wgs84 = pyproj.Proj("+init=EPSG:4326") +``` + +The reprojection can be done by the function `transform` of the projection toolkit, providing the coordinate systems and a set of x, y coordinates. For example, for a single coordinate, this can be applied as follows: + +```{code-cell} ipython3 +pyproj.transform(utm12n, wgs84, 681222.131658, 3.535262e+06) +``` + +Instead of writing a `for` loop to do this for each of the coordinates in the list, we can apply this function to each of them: + ++++ + +
+ EXERCISE: Apply the pyproj function transform to plot_data, using the columns xutm and yutm and save the resulting output in 2 new columns, called decimalLongitude and decimalLatitude: + +
    +
  • Create a function transform_utm_to_wgs that takes a row of a DataFrame and returns a Series of two elements with the longitude and latitude.
  • +
  • Test this function on the first row of plot_data
  • +
  • Now apply this function on all rows (remember the axis parameter)
  • +
  • Assign the result of the previous step to decimalLongitude and decimalLatitude columns
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +def transform_utm_to_wgs(row): + """ + Converts the x and y coordinates of this row into a Series of the + longitude and latitude. + + """ + utm12n = pyproj.Proj("+init=EPSG:32612") + wgs84 = pyproj.Proj("+init=EPSG:4326") + + return pd.Series(pyproj.transform(utm12n, wgs84, row['xutm'], row['yutm'])) +``` + +```{code-cell} ipython3 +:clear_cell: true + +transform_utm_to_wgs(plot_data.loc[0]) +``` + +```{code-cell} ipython3 +:clear_cell: true + +plot_data.apply(transform_utm_to_wgs, axis=1) +``` + +```{code-cell} ipython3 +:clear_cell: true + +plot_data[["decimalLongitude" ,"decimalLatitude"]] = plot_data.apply(transform_utm_to_wgs, axis=1) +``` + +```{code-cell} ipython3 +plot_data.head() +``` + +The above function `transform_utm_to_wgs` you have created is a very specific function that knows the structure of the DataFrame you will apply it to (it assumes the 'xutm' and 'yutm' column names). We could also make a more generic function that just takes a X and Y coordinate and returns the Series of converted coordinates (`transform_utm_to_wgs2(X, Y)`). + +To apply such a more generic function to the `plot_data` DataFrame, we can make use of the `lambda` construct, which lets you specify a function on one line as an argument: + + plot_data.apply(lambda row : transform_utm_to_wgs2(row['xutm'], row['yutm']), axis=1) + +If you have time, try to implement this function and test it as well. + ++++ + +### (intermezzo) Checking the coordinates on a map + ++++ + +To check the transformation, let's put these on an interactive map. [Leaflet](http://leafletjs.com/) is a famous service for this and in many programming languages wrappers do exist to simplify the usage. [Folium](https://github.com/python-visualization/folium) is an extensive library providing multiple options. As we just want to do a quick checkup of the coordinates, we will rely on the package [mplleaflet](https://github.com/jwass/mplleaflet), which just converts a matplotlib image to a leaflet map: + +```{code-cell} ipython3 +import mplleaflet # https://github.com/jwass/mplleaflet +``` + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(5, 8)) +plt.plot(plot_data['decimalLongitude'], plot_data['decimalLatitude'], 'rs') + +mplleaflet.display(fig=fig) # zoom out to see where the measurement plot is located +``` + +### Join the coordinate information to the survey data set + ++++ + +All points are inside the desert region as we expected, so we can extend our survey dataset with this coordinate information. Making the combination of two data sets based on a common identifier is completely similar to the usage of `JOIN` operations in databases. In Pandas, this functionality is provided by [`pd.merge`](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-DataFrame-joining-merging). + +In practice, we have to add the columns `decimalLongitude`/`decimalLatitude` to the current dataset `survey_data_decoupled`, by using the plot identification number as key to join. + ++++ + +
+ EXERCISE: Extract only the columns to join to our survey dataset: the plot identifiers, decimalLatitude and decimalLongitude into a new variable named plot_data_selection +
+ +```{code-cell} ipython3 +:clear_cell: true + +plot_data_selection = plot_data[["plot", "decimalLongitude", "decimalLatitude"]] +``` + +
+ +EXERCISE: Based on the documentation of Pandas merge, add the coordinate information (plot_data_selection) to the survey data set and save the resulting DataFrame as survey_data_plots. + +__Tip__: documentation of [merge](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-DataFrame-joining-merging)... + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_plots = pd.merge(survey_data_decoupled, plot_data_selection, + how="left", on="plot") +``` + +```{code-cell} ipython3 +survey_data_plots.head() +``` + +The plot locations need to be stored with the variable name `verbatimLocality` indicating th identifier as integer value of the plot: + +```{code-cell} ipython3 +survey_data_plots = survey_data_plots.rename(columns={'plot': 'verbatimLocality'}) +``` + +## Add species names to dataset + ++++ + +The column `species` only provides a short identifier in the survey overview. The name information is stored in a separate file `species.csv`. As we want our dataset to include this information, we will read in this data and add it to our survey dataset: + ++++ + +
+ +EXERCISE: Read in the 'species.csv' file and save the resulting DataFrame as variable species_data + +__Tip__: check the delimiter (`sep`) to define + +
+ +```{code-cell} ipython3 +:clear_cell: true + +species_data = pd.read_csv("../data/species.csv", sep=";") +``` + +```{code-cell} ipython3 +species_data.head() +``` + +### Fix a wrong acronym naming + ++++ + +When reviewing the metadata, you see that in the data-file the acronym `NE` is used to describe `Neotoma albigula`, whereas in the [metadata description](http://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm), the acronym `NA` is used. + ++++ + +
+ EXERCISE: Convert the value of 'NE' to 'NA' by using boolean indexing for the `species_id` column +
+ +```{code-cell} ipython3 +:clear_cell: true + +species_data.loc[species_data["species_id"] == "NE", "species_id"] = "NA" +``` + +(At the same time, you decide to cure this problem at the source and alert the data provider about this issue.) + ++++ + +### Merging surveys and species + ++++ + +As we now prepared the two series, we can combine the data, using the `merge` operation. Take into account that our key-column is different for `species_data` and `survey_data_plots`, respectively `species_id` and `species`: + ++++ + +We want to add the data of the species to the survey data, in order to see the full species names: + ++++ + +
+ +EXERCISE: Merge the `survey_data_plots` data set with the `species_data` information in order to pass the species information to the survey data: + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_species = pd.merge(survey_data_plots, species_data, how="left", # LEFT OR INNER? + left_on="species", right_on="species_id") +``` + +```{code-cell} ipython3 +len(survey_data_species) +``` + +The join is ok, but we are left with some redundant columns and wrong naming: + +```{code-cell} ipython3 + survey_data_species.head() +``` + +We do not need the columns `species_x` and `species_id` column anymore, as we will use the scientific names from now on: + +```{code-cell} ipython3 +survey_data_species = survey_data_species.drop(["species_x", "species_id"], axis=1) +``` + +The column `species_y` could just be named `species`: + +```{code-cell} ipython3 +survey_data_species = survey_data_species.rename(columns={"species_y": "species"}) +``` + +```{code-cell} ipython3 +survey_data_species.head() +``` + +```{code-cell} ipython3 +len(survey_data_species) +``` + +Let's now save our cleaned-up to a csv file, so we can further analyze the data in a following notebook: + +```{code-cell} ipython3 +survey_data_species.to_csv("interim_survey_data_species.csv", index=False) +``` + +## (OPTIONAL SECTION) Using a API service to match the scientific names + ++++ + +As the current species names are rather short and could eventually lead to confusion when shared with other users, retrieving additional information about the different species in our dataset would be useful to integrate our work with other research. An option is to match our names with an external service to request additional information about the different species. + +One of these services is [GBIF API](http://www.gbif.org/developer/species). The service can most easily be illustrated with a small example:

+In a new tabblad of the browser, go to the URL [http://www.gbif.org/species/2475532](http://www.gbif.org/species/2475532), which corresponds to the page of `Alcedo atthis` (*ijsvogel* in dutch). One could for each of the species in the list we have do a search on the website of GBIF to find the corresponding page of the different species, from which more information can be extracted manually. However, this would take a lot of time... + +Therefore, GBIF (and many other organisations!) provides a service to extract the same information on a machine-readable way, in order to automate these searches. As an example, let's search for the information of `Alcedo atthis`, using the GBIF API: Go to the URL: [http://api.gbif.org/v1/species/match?name=Alcedo atthis](http://api.gbif.org/v1/species/match?name=%22Alcedo%20atthis%22) and check the output. What we did is a machine-based search on the GBIF website for information about `Alcedo atthis`. + +The same can be done using Python. The main library we need to this kind of automated searches is the [`requests` package](http://docs.python-requests.org/en/master/), which can be used to do request to any kind of API out there. + +```{code-cell} ipython3 +import requests +``` + +### Example matching with Alcedo Atthis + ++++ + +For the example of `Alcedo atthis`: + +```{code-cell} ipython3 +species_name = 'Alcedo atthis' +``` + +```{code-cell} ipython3 +base_string = 'http://api.gbif.org/v1/species/match?' +request_parameters = {'verbose': False, 'strict': True, 'name': species_name} +message = requests.get(base_string, params=request_parameters).json() +message +``` + +From which we get a dictionary containing more information about the taxonomy of the `Alcedo atthis`. + ++++ + +In the species data set available, the name to match is provided in the combination of two columns, so we have to combine those to in order to execute the name matching: + +```{code-cell} ipython3 +genus_name = "Callipepla" +species_name = "squamata" +name_to_match = '{} {}'.format(genus_name, species_name) +base_string = 'http://api.gbif.org/v1/species/match?' +request_parameters = {'strict': True, 'name': name_to_match} # use strict matching(!) +message = requests.get(base_string, params=request_parameters).json() +message +``` + +To apply this on our species data set, we will have to do this request for each of the individual species/genus combination. As, this is a returning functionality, we will write a small function to do this: + ++++ + +### Writing a custom matching function + ++++ + +
+ +EXERCISE: Write a function, called `name_match` that takes the `genus`, the `species` and the option to perform a strict matching or not as inputs, performs a matching with the GBIF name matching API and return the received message as a dictionary. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +def name_match(genus_name, species_name, strict=True): + """ + Perform a GBIF name matching using the species and genus names + + Parameters + ---------- + genus_name: str + name of the genus of the species + species_name: str + name of the species to request more information + strict: boolean + define if the mathing need to be performed with the strict + option (True) or not (False) + + Returns + ------- + message: dict + dictionary with the information returned by the GBIF matching service + """ + name = '{} {}'.format(genus_name, species_name) + base_string = 'http://api.gbif.org/v1/species/match?' + request_parameters = {'strict': strict, 'name': name} # use strict matching(!) + message = requests.get(base_string, params=request_parameters).json() + return message +``` + +
+ +NOTE: For many of these API request handling, dedicated packages do exist, e.g. pygbif provides different functions to do requests to the GBIF API, basically wrapping the request possibilities. For any kind of service, just ask yourself: is the dedicated library providing sufficient additional advantage, or can I easily setup the request myself. (or sometimes: for which the documentation is the best...)

Many services do exist for a wide range of applications, e.g. scientific name matching, matching of addresses, downloading of data,... + +
+ ++++ + +Testing our custom matching function: + +```{code-cell} ipython3 +genus_name = "Callipepla" +species_name = "squamata" +name_match(genus_name, species_name, strict=True) +``` + +However, the matching won't provide an answer for every search: + +```{code-cell} ipython3 +genus_name = "Lizard" +species_name = "sp." +name_match(genus_name, species_name, strict=True) +``` + +### Match each of the species names of the survey data set + ++++ + +Hence, in order to add this information to our survey DataFrame, we need to perform the following steps: +1. extract the unique genus/species combinations in our dataset and combine them in single column +2. match each of these names to the GBIF API service +3. process the returned message: + * if a match is found, add the information of the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey' + * if no match was found: nan-values +4. Join the DataFrame of unique genus/species information with the enriched GBIF info to the `survey_data_species` data set + ++++ + +
+ EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using the function drop_duplicates(). Save the result as the variable unique_species +
+ +```{code-cell} ipython3 +:clear_cell: true + +#%%timeit +unique_species = survey_data_species[["genus", "species"]].drop_duplicates().dropna() +``` + +```{code-cell} ipython3 +len(unique_species) +``` + +
+ +EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using groupby. Save the result as the variable unique_species + +
+ +```{code-cell} ipython3 +:clear_cell: true + +#%%timeit +unique_species = \ + survey_data_species.groupby(["genus", "species"]).first().reset_index()[["genus", "species"]] +``` + +```{code-cell} ipython3 +len(unique_species) +``` + +
+ EXERCISE: Combine the columns genus and species to a single column with the complete name, save it in a new column named 'name' +
+ +```{code-cell} ipython3 +:clear_cell: true + +unique_species["name"] = unique_species["genus"] + " " + unique_species["species"] +# an alternative approach worthwhile to know: +#unique_species["name"] = unique_species["genus"].str.cat(unique_species["species"], " ") +``` + +```{code-cell} ipython3 +unique_species.head() +``` + +To perform the matching for each of the combination, different options do exist. + +Just to show the possibility of using `for` loops, the addition of the matched information will be done as such. First, we will store everything in one dictionary, where the keys of the dictionary are the index values of `unique_species` (in order to later merge them again) and the values are the entire messages (which are dictionaries aon itself). The format will look as following: + +``` +species_annotated = {O: {'canonicalName': 'Squamata', 'class': 'Reptilia', 'classKey': 358, ...}, + 1: {'canonicalName':...}, + 2:...} +``` + +```{code-cell} ipython3 +species_annotated = {} +for key, row in unique_species.iterrows(): + species_annotated[key] = name_match(row["genus"], row["species"], strict=True) +``` + +```{code-cell} ipython3 + species_annotated +``` + +We can now transform this to a pandas DataFrame: + ++++ + +
+ +EXERCISE: Convert the dictionary species_annotated into a pandas DataFrame with the row index the key-values corresponding to unique_species and the column headers the output columns of the API response. Save the result as the variable df_species_annotated. + +__Tip__: `transpose` can be used to flip rows and columns + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_species_annotated = pd.DataFrame(species_annotated).transpose() +``` + +```{code-cell} ipython3 +df_species_annotated.head() +``` + +### Select relevant information and add this to the survey data + ++++ + +
+ +EXERCISE: Subselect the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey' from the DataFrame `df_species_annotated`. Save it as the variable df_species_annotated_subset + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df_species_annotated_subset = df_species_annotated[['class', 'kingdom', 'order', 'phylum', + 'scientificName', 'status', 'usageKey']] +``` + +```{code-cell} ipython3 +df_species_annotated_subset.head() +``` + +
+ EXERCISE: Join the df_species_annotated_subset information to the `unique_species` overview of species. Save the result as variable unique_species_annotated: +
+ +```{code-cell} ipython3 +:clear_cell: true + +unique_species_annotated = pd.merge(unique_species, df_species_annotated_subset, + left_index=True, right_index=True) +``` + +
+ +EXERCISE: Add the `unique_species_annotated` data to the `survey_data_species` data set, using both the genus and species column as keys. Save the result as the variable survey_data_completed. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +survey_data_completed = pd.merge(survey_data_species, unique_species_annotated, + how='left', on= ["genus", "species"]) +``` + +```{code-cell} ipython3 +len(survey_data_completed) +``` + +```{code-cell} ipython3 +survey_data_completed.head() +``` + +Congratulations! You did a great cleaning job, save your result: + +```{code-cell} ipython3 +survey_data_completed.to_csv("../data/survey_data_completed.csv", index=False) +``` + +## Acknowledgements + ++++ + +* `species.csv` and `survey.csv` are used from the [data carpentry workshop](https://github.com/datacarpentry/python-ecology-lesson) This data is from the paper S. K. Morgan Ernest, Thomas J. Valone, and James H. +Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708. http://esapubs.org/archive/ecol/E090/118/ +* The `plot_location.xlsx` is a dummy created location file purely created for this exercise, using the plots location on google maps +* [GBIF API](http://www.gbif.org/developer/summary) diff --git a/_solved/case3_bacterial_resistance_lab_experiment.md b/_solved/case3_bacterial_resistance_lab_experiment.md new file mode 100644 index 0000000..8050eb5 --- /dev/null +++ b/_solved/case3_bacterial_resistance_lab_experiment.md @@ -0,0 +1,360 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

CASE - Bacterial resistance experiment

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2017, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + ++++ + +In this case study, we will make use of the open data, affiliated to the following [journal article](http://rsbl.royalsocietypublishing.org/content/12/5/20160064): + +>Arias-Sánchez FI, Hall A (2016) Effects of antibiotic resistance alleles on bacterial evolutionary responses to viral parasites. Biology Letters 12(5): 20160064. https://doi.org/10.1098/rsbl.2016.0064 + + ++++ + + + ++++ + +Check the full paper on the [web version](http://rsbl.royalsocietypublishing.org/content/12/5/20160064). The study handles: +> Antibiotic resistance has wide-ranging effects on bacterial phenotypes and evolution. However, the influence of antibiotic resistance on bacterial responses to parasitic viruses remains unclear, despite the ubiquity of such viruses in nature and current interest in therapeutic applications. We experimentally investigated this by exposing various Escherichia coli genotypes, including eight antibiotic-resistant genotypes and a mutator, to different viruses (lytic bacteriophages). Across 960 populations, we measured changes in population density and sensitivity to viruses, and tested whether variation among bacterial genotypes was explained by their relative growth in the absence of parasites, or mutation rate towards phage resistance measured by fluctuation tests for each phage + +```{code-cell} ipython3 +import pandas as pd +import matplotlib.pyplot as plt +import plotnine as p9 +``` + +## Reading and processing the data + ++++ + +The data is available on [Dryad](http://www.datadryad.org/resource/doi:10.5061/dryad.90qb7.3), a general purpose data repository providing all kinds of data sets linked to journal papers. The downloaded data is available in this repository in the `data` folder as an excel-file called `Dryad_Arias_Hall_v3.xlsx`. + +For the exercises, two sheets of the excel file will be used: +* `Main experiment`: + + +| Variable name | Description | +|---------------:|:-------------| +|**AB_r** | Antibotic resistance | +|**Bacterial_genotype** | Bacterial genotype | +|**Phage_t** | Phage treatment | +|**OD_0h** | Optical density at the start of the experiment (0h) | +|**OD_20h** | Optical density after 20h | +|**OD_72h** | Optical density at the end of the experiment (72h) | +|**Survival_72h** | Population survival at 72h (1=survived, 0=extinct) | +|**PhageR_72h** | Bacterial sensitivity to the phage they were exposed to (0=no bacterial growth, 1= colony formation in the presence of phage) | + +* `Falcor`: we focus on a subset of the columns: + +| Variable name | Description | +|---------------:|:-------------| +| **Phage** | Bacteriophage used in the fluctuation test (T4, T7 and lambda) | +| **Bacterial_genotype** | Bacterial genotype. | +| **log10 Mc** | Log 10 of corrected mutation rate | +| **log10 UBc** | Log 10 of corrected upper bound | +| **log10 LBc** | Log 10 of corrected lower bound | + ++++ + +Reading the `main experiment` data set from the corresponding sheet: + +```{code-cell} ipython3 +main_experiment = pd.read_excel("../data/Dryad_Arias_Hall_v3.xlsx", sheet_name="Main experiment") +main_experiment +``` + +Read the `Falcor` data and subset the columns of interest: + +```{code-cell} ipython3 +falcor = pd.read_excel("../data/Dryad_Arias_Hall_v3.xlsx", sheet_name="Falcor", + skiprows=1) +falcor = falcor[["Phage", "Bacterial_genotype", "log10 Mc", "log10 UBc", "log10 LBc"]] +falcor.head() +``` + +## Tidy the `main_experiment` data + ++++ + +*(If you're wondering what `tidy` data representations are, check again the `visualization_02_plotnine.ipynb` notebook)* + ++++ + +Actually, the columns `OD_0h`, `OD_20h` and `OD_72h` are representing the same variable (i.e. `optical_density`) and the column names itself represent a variable, i.e. `experiment_time_h`. Hence, it is stored in the table as *short* format and we could *tidy* these columns by converting them to 2 columns: `experiment_time_h` and `optical_density`. + ++++ + +Before making any changes to the data, we will add an identifier column for each of the current rows to make sure we keep the connection in between the entries of a row when converting from wide to ong format. + +```{code-cell} ipython3 +main_experiment["experiment_ID"] = ["ID_" + str(idx) for idx in range(len(main_experiment))] +main_experiment +``` + +
+ +EXERCISE: + +
    +
  • Convert the columns `OD_0h`, `OD_20h` and `OD_72h` to a long format with the values stored in a column `optical_density` and the time in the experiment as `experiment_time_h`. Save the variable as tidy_experiment
  • + +
+ +__Tip__: Have a look at `pandas_07_reshaping_data.ipynb` to find out the required function + +
+ +```{code-cell} ipython3 +:clear_cell: true + +tidy_experiment = main_experiment.melt(id_vars=['AB_r', 'Bacterial_genotype', 'Phage_t', + 'Survival_72h', 'PhageR_72h', 'experiment_ID'], + value_vars=['OD_0h', 'OD_20h', 'OD_72h'], + var_name='experiment_time_h', + value_name='optical_density', ) +tidy_experiment +``` + +## Visual data exploration + +```{code-cell} ipython3 +tidy_experiment.head() +``` + +
+ +EXERCISE: + +
    +
  • Make a histogram to check the distribution of the `optical_density`
  • +
  • Change the border color of the bars to `white` and the fillcolor to `lightgrey`
  • +
  • Change the overall theme to any of the available themes
  • +
+ +_Tip_: plotnine required data, aesthetics and a geometry. Add color additions to the geometry as parameters of the method and theme options as additional statements (`+`) +
+ +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(tidy_experiment, p9.aes(x='optical_density')) + + p9.geom_histogram(bins=30, color='white', fill='lightgrey') + + p9.theme_bw() +) +``` + +
+ +EXERCISE: + +
    +
  • Use a `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h`)
  • + +
+ +_Tip_: within plotnine, searching for a specific geometry always starts with typing `geom_` + TAB-button +
+ +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(tidy_experiment, p9.aes(x='experiment_time_h', + y='optical_density')) + + p9.geom_violin() +) +``` + +
+ +EXERCISE: + +
    +
  • For each `Phage_t` in an individual subplot, use a `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h`)
  • +
+ +_Tip_: remember `facet_wrap`? + +
+ + +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(tidy_experiment, p9.aes(x='experiment_time_h', + y='optical_density')) + + p9.geom_violin() + + p9.facet_wrap('Phage_t') +) +``` + +
+ +EXERCISE: + +
    +
  • Create a summary table of the average `optical_density` with the `Bacterial_genotype` in the rows and the `experiment_time_h` in the columns
  • +
+ +_Tip_: no plotnine required here + +
+ +```{code-cell} ipython3 +:clear_cell: true + +pd.pivot_table(tidy_experiment, values='optical_density', + index='Bacterial_genotype', + columns='experiment_time_h', + aggfunc='mean') +``` + +```{code-cell} ipython3 +:clear_cell: true + +tidy_experiment.groupby(['Bacterial_genotype', 'experiment_time_h'])['optical_density'].mean().unstack() +``` + +
+ +EXERCISE: + +
    +
  • Calculate for each combination of `Bacterial_genotype`, `Phage_t` and `experiment_time_h` the mean `optical_density` and store the result as a dataframe called `density_mean`
  • +
  • Based on `density_mean`, make a barplot of the mean values for each `Bacterial_genotype`, with for each Bacterial_genotype an individual bar per `Phage_t` in a different color (grouped bar chart).
  • +
  • Use the `experiment_time_h` to split into subplots. As we mainly want to compare the values within each subplot, make sure the scales in each of the subplots are adapted to the data range, and put the subplots on different rows.
  • +
  • (OPTIONAL) change the color scale of the bars to a color scheme provided by colorbrewer
  • + +
+ +
+ + +```{code-cell} ipython3 +:clear_cell: true + +density_mean = (tidy_experiment + .groupby(['Bacterial_genotype','Phage_t', 'experiment_time_h'])['optical_density'] + .mean().reset_index()) +``` + +```{code-cell} ipython3 +density_mean.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(density_mean, p9.aes(x='Bacterial_genotype', + y='optical_density', + fill='Phage_t')) + + p9.geom_bar(stat='identity', position='dodge') + + p9.facet_wrap('experiment_time_h', scales='free', nrow=3) + + p9.scale_fill_brewer(type='qual', palette=8) +) +``` + +## Reproduce the graphs of the original paper + ++++ + +
+ +EXERCISE: + +
    +
  • Check Figure 2 of the original journal paper in the 'correction' part of the pdf:
  • + +
  • Reproduce the graph using the `falcor` data and the plotnine package (don't bother yet about the style or the order on the x axis). The 'log10 mutation rate' on the figure corresponds to the `log10 Mc` column.
  • +
  • Check the documentation to find out how to add errorbars to the graph. The upper and lower bound for the error bars are given in the `log10 UBc` and `log10 LBc` columns.
  • +
  • Make sure the `WT(2)` and `MUT(2)` are used as respectively `WT` and `MUT`.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +falcor["Bacterial_genotype"] = falcor["Bacterial_genotype"].replace({'WT(2)': 'WT', + 'MUT(2)': 'MUT'}) +``` + +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(falcor, p9.aes(x='Bacterial_genotype', y='log10 Mc')) + + p9.geom_point() + + p9.facet_wrap('Phage', nrow=3) + + p9.geom_errorbar(p9.aes(ymin='log10 LBc', ymax='log10 UBc'), width=.2) + + p9.theme_bw() +) +``` + +
+ +EXERCISE (OPTIONAL): + +
    +
  • Check Figure 1 of the original journal paper:
  • +
  • Reproduce the graph using the `tidy_experiment` data and the plotnine package. Notice that the plot shows the optical density at the end of the experiment (72h).
  • +
  • Take the `geom_` that closest represents the original.
  • +
  • Check the documentation for further tuning, e.g. `as_labeller`...
  • +
+
+ + +```{code-cell} ipython3 +:clear_cell: true + +end_of_experiment = tidy_experiment[tidy_experiment["experiment_time_h"] == "OD_72h"].copy() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# The Nan-values of the PhageR_72h when no phage represent survival (1) +end_of_experiment["PhageR_72h"] = end_of_experiment["PhageR_72h"].fillna(0.) +``` + +```{code-cell} ipython3 +:clear_cell: true + +# precalculate the median value +end_of_experiment["Phage_median"] = end_of_experiment.groupby(["Phage_t", "Bacterial_genotype"])['optical_density'].transform('median') + +p9.options.figure_size = (8, 10) +(p9.ggplot(end_of_experiment, p9.aes(x='Bacterial_genotype', + y='optical_density')) + + p9.geom_jitter(mapping=p9.aes(color='factor(PhageR_72h)'), + width=0.2, height=0., size=2, fill='white') + + p9.facet_wrap("Phage_t", nrow=4, + labeller=p9.as_labeller({'C_noPhage' : '(a) no phage', 'L' : '(b) phage $\lambda$', + 'T4' : '(c) phage T4', 'T7': '(d) phage T7'})) + + p9.theme_bw() + + p9.xlab("Bacterial genotype") + + p9.ylab("Bacterial density (OD)") + + p9.theme(strip_text=p9.element_text(size=11)) + + p9.geom_crossbar(inherit_aes=False, alpha=0.5, + mapping=p9.aes(x='Bacterial_genotype', y='Phage_median', + ymin='Phage_median', ymax='Phage_median')) + + p9.scale_color_manual(values=["black", "red"], guide=False) +) +``` diff --git a/_solved/case4_air_quality_analysis.md b/_solved/case4_air_quality_analysis.md new file mode 100644 index 0000000..9ffed58 --- /dev/null +++ b/_solved/case4_air_quality_analysis.md @@ -0,0 +1,774 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Case study: air quality data of European monitoring stations (AirBase)


+ +__AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.__ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import plotnine as pn + +pd.options.display.max_rows = 8 +``` + +In the previous notebook, we processed some raw data files of the AirBase air quality data. As a reminder, the data contains hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations: + +- FR04037 (PARIS 13eme): urban background site at Square de Choisy +- FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia +- BETR802: urban traffic site in Antwerp, Belgium +- BETN029: rural background site in Houtem, Belgium + +See http://www.eea.europa.eu/themes/air/interactive/no2 + ++++ + +# Importing and quick exploration + ++++ + +We processed the individual data files in the previous notebook, and saved it to a csv file `../data/airbase_data_processed.csv`. Let's import the file here (if you didn't finish the previous notebook, a version of the dataset is also available in `../data/airbase_data.csv`): + +```{code-cell} ipython3 +alldata = pd.read_csv('../data/airbase_data.csv', index_col=0, parse_dates=True) +``` + +We only use the data from 1999 onwards: + +```{code-cell} ipython3 +data = alldata['1999':].copy() +``` + +Some first exploration with the *typical* functions: + +```{code-cell} ipython3 +data.head() # tail() +``` + +```{code-cell} ipython3 +data.info() +``` + +```{code-cell} ipython3 +data.describe(percentiles=[0.1, 0.5, 0.9]) +``` + +```{code-cell} ipython3 +data.plot(figsize=(12,6)) +``` + +
+ATTENTION!:

+ +When just using `.plot()` without further notice (selection, aggregation,...) +
    +
  • Risk of running into troubles by overloading your computer processing (certainly with looooong time series)
  • +
  • Not always the most informative/interpretable visualisation
  • +
+
+ ++++ + +**Plot only a subset** + ++++ + +Why not just using the `head`/`tail` possibilities? + +```{code-cell} ipython3 +data.tail(500).plot(figsize=(12,6)) +``` + +**Summary figures** + ++++ + +Use summary statistics... + +```{code-cell} ipython3 +data.plot(kind='box', ylim=[0,250]) +``` + +Also with seaborn plots function, just start with some subsets as first impression... + +As we already have seen previously, the plotting library [seaborn](http://seaborn.pydata.org/generated/seaborn.heatmap.html) provides some high-level plotting functions on top of matplotlib (check the [docs](http://seaborn.pydata.org/examples/index.html)!). One of those functions is `pairplot`, which we can use here to quickly visualize the concentrations at the different stations and their relation: + +```{code-cell} ipython3 +import seaborn as sns +``` + +```{code-cell} ipython3 +sns.pairplot(data.tail(5000).dropna()) +``` + +# Is this a tidy dataset ? + +```{code-cell} ipython3 +data.head() +``` + +In principle this is not a tidy dataset. The variable that was measured is the NO2 concentration, and is divided in 4 columns. Of course those measurements were made at different stations, so one could interpet it as separate variables. But in any case, such format typically does not work well with `plotnine` which expects a pure tidy format. + + +Reason to not use a tidy dataset here: + +* bigger memory use +* timeseries functionality like resample works better +* pandas plotting already does what we want when having different columns for *some* types of plots (eg line plots of the timeseries) + ++++ + +
+ +EXERCISE: + +
    +
  • Create a tidy version of this dataset data_tidy, ensuring the result has new columns 'station' and 'no2'.
  • +
  • Check how many missing values are contained in the 'no2' column.
  • +
  • Drop the rows with missing values in that column.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data_tidy = data.reset_index().melt(id_vars=["datetime"], var_name='station', value_name='no2') +data_tidy.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +data_tidy['no2'].isnull().sum() +``` + +```{code-cell} ipython3 +:clear_cell: true + +data_tidy = data_tidy.dropna() +``` + +In the following exercises we will mostly do our analysis on `data`and often use pandas (or seaborn) plotting, but once we produced some kind of summary dataframe as the result of an analysis, then it becomes more interesting to convert that result to a tidy format to be able to use the more advanced plotting functionality of `plotnine`. + ++++ + +# Exercises + ++++ + +
+ +REMINDER:

+ +Take a look at the [Timeseries notebook](pandas_04_time_series_data.ipynb) when you require more info about: + +
    +
  • resample
  • +
  • string indexing of DateTimeIndex
  • +

+ +Take a look at the [matplotlib](visualization_01_matplotlib.ipynb) and [plotnine](visualization_02_plotnine.ipynb) notebooks when you require more info about the plot requirements. + +
+ ++++ + +
+ +EXERCISE: + +
    +
  • Plot the monthly mean and median concentration of the 'FR04037' station for the years 2009 - 2013 in a single figure/ax
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots() +data.loc['2009':, 'FR04037'].resample('M').mean().plot(ax=ax, label='mean') +data.loc['2009':, 'FR04037'].resample('M').median().plot(ax=ax, label='median') +ax.legend(ncol=2) +ax.set_title("FR04037"); +``` + +```{code-cell} ipython3 +:clear_cell: true + +data.loc['2009':, 'FR04037'].resample('M').agg(['mean', 'median']).plot() +``` + +
+ +EXERCISE + +
    +
  • Make a violin plot for January 2011 until August 2011 (check out the documentation to improve the plotting settings)
  • +
  • Change the y-label to 'NO$_2$ concentration (µg/m³)'
  • +

+ +NOTE: + +When having the data not in a long format but when having different columns for which you want to make violin plots, you can use [seaborn](http://seaborn.pydata.org/examples/index.html). +When using the tidy data, we can use `plotnine`. +
+ +```{code-cell} ipython3 +:clear_cell: true + +# with seaborn +fig, ax = plt.subplots() +sns.violinplot(data=data['2011-01': '2011-08'], palette="GnBu_d", ax=ax) +ax.set_ylabel("NO$_2$ concentration (µg/m³)") +``` + +```{code-cell} ipython3 +:clear_cell: true + +# with plotnine +data_tidy_subset = data_tidy[(data_tidy['datetime'] >= "2011-01") & (data_tidy['datetime'] < "2011-09")] + +(pn.ggplot(data_tidy_subset, pn.aes(x='station', y='no2')) + + pn.geom_violin() + + pn.ylab("NO$_2$ concentration (µg/m³)")) +``` + +
+ +EXERCISE + +
    +
  • Make a bar plot with pandas of the mean of each of the stations in the year 2012 (check the documentation of Pandas plot to adapt the rotation of the labels) and make sure all bars have the same color.
  • +
  • Using the matplotlib objects, change the y-label to 'NO$_2$ concentration (µg/m³)
  • +
  • Add a 'darkorange' horizontal line on the ax for the y-value 40 µg/m³ (command for horizontal line from matplotlib: axhline).
  • +
  • Place the text 'Yearly limit is 40 µg/m³' just above the 'darkorange' line.
  • +

+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots() +data['2012'].mean().plot(kind='bar', ax=ax, rot=0, color='C0') +ax.set_ylabel("NO$_2$ concentration (µg/m³)") +ax.axhline(y=40., color='darkorange') +ax.text(0.01, 0.48, 'Yearly limit is 40 µg/m³', + horizontalalignment='left', fontsize=13, + transform=ax.transAxes, color='darkorange'); +``` + +
+ +EXERCISE: Did the air quality improve over time? + +
    +
  • For the data from 1999 till the end, plot the yearly averages
  • +
  • For the same period, add the overall mean (all stations together) as an additional line to the graph, use a thicker black line (linewidth=4 and linestyle='--')
  • +
  • [OPTIONAL] Add a legend above the ax for all lines
  • + + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +fig, ax = plt.subplots() + +data['1999':].resample('A').mean().plot(ax=ax) +data['1999':].mean(axis=1).resample('A').mean().plot(color='k', + linestyle='--', + linewidth=4, + ax=ax, + label='Overall mean') +ax.legend(loc='center', ncol=3, + bbox_to_anchor=(0.5, 1.06)) +ax.set_ylabel("NO$_2$ concentration (µg/m³)"); +``` + +
+ +REMEMBER:

+ +`resample` is a special version of a`groupby` operation. For example, taking annual means with `data.resample('A').mean()` is equivalent to `data.groupby(data.index.year).mean()` (but the result of `resample` still has a DatetimeIndex).

+ +Checking the index of the resulting DataFrame when using **groupby** instead of resample: You'll notice that the Index lost the DateTime capabilities: + + +> data.groupby(data.index.year).mean().index + + +Results in: + + +Int64Index([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, + 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, + 2012], + dtype='int64')$ + + +
+ +When using **resample**, we keep the DateTime capabilities: + + +> data.resample('A').mean().index + + +Results in: + + +DatetimeIndex(['1999-12-31', '2000-12-31', '2001-12-31', '2002-12-31', + '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31', + '2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31', + '2011-12-31', '2012-12-31'], + dtype='datetime64[ns]', freq='A-DEC') + +
+ +But, `groupby` is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle. +
+ ++++ + +
+ +EXERCISE + +
    +
  • How does the typical yearly profile (typical averages for the different months over the years) look like for the different stations? (add a 'month' column as a first step)
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +# add a column to the dataframe that indicates the month (integer value of 1 to 12): +data['month'] = data.index.month + +# now, we can calculate the mean of each month over the different years: +data.groupby('month').mean() + +# plot the typical monthly profile of the different stations: +data.groupby('month').mean().plot() +``` + +```{code-cell} ipython3 +data = data.drop("month", axis=1) +``` + +Note: Technically, we could reshape the result of the groupby operation to a tidy format (we no longer have a real time series), but since we already have the things we want to plot as lines in different columns, doing `.plot` already does what we want. + ++++ + +
+ +EXERCISE + +
    +
  • Plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +# Groupby wise +df2011 = data['2011'] +df2011.groupby(df2011.index.week)[['BETN029', 'BETR801']].quantile(0.95).plot() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# Resample wise (not possible to use quantile directly on a resample, so you need a lambda function) +# Note the different x-axis labels +df2011[['BETN029', 'BETR801']].resample('W').agg(lambda x: x.quantile(0.75)).plot() +``` + +
+ +EXERCISE + +
    +
  • Plot the typical diurnal profile (typical hourly averages) for the different stations taking into account the whole time period.
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data.groupby(data.index.hour).mean().plot() +``` + +
+ +EXERCISE

+ +What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)

+ +Start with only visualizing the different in diurnal profile for the BETR801 station. In a next step, make the same plot for each station.

+ +**Hints:** + +
    +
  • Add a column 'weekend' defining if a value of the index is in the weekend (i.e. weekdays 5 and 6) or not
  • +
  • Add a column 'hour' with the hour of the day for each row.
  • +
  • You can groupby on multiple items at the same time.
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data['weekend'] = data.index.weekday.isin([5, 6]) +data['weekend'] = data['weekend'].replace({True: 'weekend', False: 'weekday'}) +data['hour'] = data.index.hour +``` + +```{code-cell} ipython3 +:clear_cell: true + +data_weekend = data.groupby(['weekend', 'hour']).mean() +data_weekend.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# using unstack and pandas plotting +data_weekend_BETR801 = data_weekend['BETR801'].unstack(level=0) +data_weekend_BETR801.plot() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# using a tidy dataset and plotnine +data_weekend_BETR801_tidy = data_weekend['BETR801'].reset_index() + +(pn.ggplot(data_weekend_BETR801_tidy, + pn.aes(x='hour', y='BETR801', color='weekend')) + + pn.geom_line()) +``` + +```{code-cell} ipython3 +:clear_cell: true + +# tidy dataset that still includes all stations + +data_weekend_tidy = pd.melt(data_weekend.reset_index(), id_vars=['weekend', 'hour'], + var_name='station', value_name='no2') +data_weekend_tidy.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# when still having multiple factors, it becomes useful to convert to tidy dataset and use plotnine +(pn.ggplot(data_weekend_tidy, + pn.aes(x='hour', y='no2', color='weekend')) + + pn.geom_line() + + pn.facet_wrap('station')) +``` + +```{code-cell} ipython3 +data = data.drop(['hour', 'weekend'], axis=1) +``` + +
+ +EXERCISE:

+ +
    +
  • Calculate the correlation between the different stations (check in the documentation, google "pandas correlation" or use the magic function %psearch)
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr() +``` + +
+ +EXERCISE:

+ +Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?

+ +**Hints:** + +
    +
  • Create a new DataFrame, called exceedances, (with boolean values) indicating if the threshold is exceeded or not
  • +
  • Remember that the sum of True values can be used to count elements
  • +
  • Adding a horizontal line can be done with the matplotlib function ax.axhline
  • + + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +exceedances = data > 200 +``` + +```{code-cell} ipython3 +:clear_cell: true + +# group by year and count exceedances (sum of boolean) +exceedances = exceedances.groupby(exceedances.index.year).sum() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# Make a barplot of the yearly number of exceedances +ax = exceedances.loc[2005:].plot(kind='bar') +ax.axhline(18, color='k', linestyle='--') +``` + +# More advanced exercises... + +```{code-cell} ipython3 +data = alldata['1999':].copy() +``` + +
+ +EXERCISE: Perform the following actions for the station `'FR04012'` only: + +
    +
  • Remove the rows containing NaN or zero values
  • +
  • Sort the values of the rows according to the air quality values (low to high values)
  • +
  • Rescale the values to the range [0-1] and store result as FR_scaled (Hint: check wikipedia)
  • +
  • Use pandas to plot these values sorted, not taking into account the dates
  • +
  • Add the station name 'FR04012' as y-label
  • +
  • [OPTIONAL] Add a vertical line to the plot where the line (hence, the values of variable FR_scaled) reach the value 0.3. You will need the documentation of np.searchsorted and matplotlib's axvline
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +FR_station = data['FR04012'] # select the specific data series +FR_station = FR_station[(FR_station.notnull()) & (FR_station != 0.0)] # exclude the Nan and zero values +``` + +```{code-cell} ipython3 +:clear_cell: true + +FR_sorted = FR_station.sort_values(ascending=True) +FR_scaled = (FR_sorted - FR_sorted.min())/(FR_sorted.max() - FR_sorted.min()) +``` + +```{code-cell} ipython3 +:clear_cell: true + +fig, axfr = plt.subplots() +FR_scaled.plot(use_index=False, ax = axfr) #alternative version: FR_scaled.reset_index(drop=True).plot(use_index=False) +axfr.set_ylabel('FR04012') +# optional addition, just in case you need this +axfr.axvline(x=FR_scaled.searchsorted(0.3), color='0.6', linestyle='--', linewidth=3) +``` + +
+ +EXERCISE: + +
    +
  • Create a Figure with two subplots (axes), for which both axis are shared
  • +
  • In the left subplot, plot the histogram (30 bins) of station 'BETN029', only for the year 2009
  • +
  • In the right subplot, plot the histogram (30 bins) of station 'BETR801', only for the year 2009
  • +
  • Add the title representing the station name on each of the subplots, you do not want to have a legend
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +# Mixing an matching matplotlib and Pandas +fig, (ax1, ax2) = plt.subplots(1, 2, + sharex=True, + sharey=True) + +data.loc['2009', ['BETN029', 'BETR801']].plot(kind='hist', subplots=True, + bins=30, legend=False, + ax=(ax1, ax2)) +ax1.set_title('BETN029') +ax2.set_title('BETR801') +# Remark: the width of the bins is calculated over the x data range for both plots together +``` + +```{code-cell} ipython3 +:clear_cell: true + +# A more step by step approach (equally valid) +fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, sharex=True) +data.loc['2009', 'BETN029'].plot(kind='hist', bins=30, ax=ax1) +ax1.set_title('BETN029') +data.loc['2009', 'BETR801'].plot(kind='hist', bins=30, ax=ax2) +ax2.set_title('BETR801') +# Remark: the width of the bins is calculated over the x data range for each plot individually +``` + +
+ +EXERCISE + +
    +
  • Make a selection of the original dataset of the data in January 2009, call the resulting variable subset
  • +
  • Add a new column, called 'weekday', to the variable subset which defines for each data point the day of the week
  • +
  • From the subset DataFrame, select only Monday (= day 0) and Sunday (=day 6) and remove the others (so, keep this as variable subset)
  • +
  • Change the values of the weekday column in subset according to the following mapping: {0:"Monday", 6:"Sunday"}
  • +
  • With plotnine, make a scatter plot of the measurements at 'BETN029' vs 'FR04037', with the color variation based on the weekday. Add a linear regression to this plot.
  • +

+ +**Note**: If you run into the **SettingWithCopyWarning** and do not know what to do, recheck [pandas_03b_indexing](pandas_03b_indexing.ipynb) + +
+ +```{code-cell} ipython3 +:clear_cell: true + +subset = data['2009-01'].copy() +subset["weekday"] = subset.index.weekday +subset = subset[subset['weekday'].isin([0, 6])] +``` + +```{code-cell} ipython3 +:clear_cell: true + +subset["weekday"] = subset["weekday"].replace(to_replace={0:"Monday", 6:"Sunday"}) +``` + +```{code-cell} ipython3 +:clear_cell: true + +(pn.ggplot(subset, + pn.aes(x="BETN029", y="FR04037", color="weekday")) + + pn.geom_point() + + pn.stat_smooth(method='lm')) +``` + +
+ +EXERCISE: + +
    +
  • The maximum daily, 8 hour mean, should be below 100 µg/m³. What is the number of exceedances of this limit for each year/station?


  • +
+ + +**Tip:**
+ +Have a look at the `rolling` method to perform moving window operations.

+ +**Note:**
+This is not an actual limit for NO$_2$, but a nice exercise to introduce the `rolling` method. Other pollutans, such as 0$_3$ have actually such kind of limit values based on 8-hour means. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +exceedances = data.rolling(8).mean().resample('D').max() > 100 +``` + +```{code-cell} ipython3 +:clear_cell: true + +exceedances = exceedances.groupby(exceedances.index.year).sum() +ax = exceedances.plot(kind='bar') +``` + +
+ +EXERCISE: + +
    +
  • Visualize the typical week profile for station 'BETR801' as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).


  • +
+ + +**Tip:**
+ +The boxplot method of a DataFrame expects the data for the different boxes in different columns. For this, you can either use `pivot_table` or a combination of `groupby` and `unstack` + + +
+ ++++ + +Calculating daily means and add weekday information: + +```{code-cell} ipython3 +:clear_cell: true + +data_daily = data.resample('D').mean() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# add a weekday column +data_daily['weekday'] = data_daily.index.weekday +data_daily.head() +``` + +Plotting with plotnine: + +```{code-cell} ipython3 +:clear_cell: true + +# plotnine +(pn.ggplot(data_daily["2012"], + pn.aes(x='factor(weekday)', y='BETR801')) + + pn.geom_boxplot()) +``` + +Reshaping and plotting with pandas: + +```{code-cell} ipython3 +:clear_cell: true + +# when using pandas to plot, the different boxplots should be different columns +# therefore, pivot table so that the weekdays are the different columns +data_daily['week'] = data_daily.index.week +data_pivoted = data_daily['2012'].pivot_table(columns='weekday', index='week', values='BETR801') +data_pivoted.head() +data_pivoted.boxplot(); +``` + +```{code-cell} ipython3 +:clear_cell: true + +# An alternative method using `groupby` and `unstack` +data_daily['2012'].groupby(['weekday', 'week'])['BETR801'].mean().unstack(level=0).boxplot(); +``` diff --git a/_solved/case4_air_quality_processing.md b/_solved/case4_air_quality_processing.md new file mode 100644 index 0000000..8e67a29 --- /dev/null +++ b/_solved/case4_air_quality_processing.md @@ -0,0 +1,467 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Case study: air quality data of European monitoring stations (AirBase)


+ +__AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.__ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + ++++ + +AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The [air quality database](https://www.eea.europa.eu/data-and-maps/data/aqereporting-8/air-quality-zone-geometries) consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants. + ++++ + +Some of the data files that are available from AirBase were included in the data folder: the hourly **concentrations of nitrogen dioxide (NO2)** for 4 different measurement stations: + +- FR04037 (PARIS 13eme): urban background site at Square de Choisy +- FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia +- BETR802: urban traffic site in Antwerp, Belgium +- BETN029: rural background site in Houtem, Belgium + +See http://www.eea.europa.eu/themes/air/interactive/no2 + +```{code-cell} ipython3 +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt + +pd.options.display.max_rows = 8 +plt.style.use("seaborn-whitegrid") +``` + +# Processing a single file + +We will start with processing one of the downloaded files (`BETR8010000800100hour.1-1-1990.31-12-2012`). Looking at the data, you will see it does not look like a nice csv file: + +```{code-cell} ipython3 +with open("../data/BETR8010000800100hour.1-1-1990.31-12-2012") as f: + print(f.readline()) +``` + +So we will need to do some manual processing. + ++++ + +Just reading the tab-delimited data: + +```{code-cell} ipython3 +data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None) +``` + +```{code-cell} ipython3 +data.head() +``` + +The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names. + ++++ + +
+ +EXERCISE:

Clean up this dataframe by using more options of `read_csv` (see its [docstring](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)) + + +
+ +```{code-cell} ipython3 +# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag' +hours = ["{:02d}".format(i) for i in range(24)] +column_names = ['date'] + [item for pair in zip(hours, ['flag' + str(i) for i in range(24)]) for item in pair] +``` + +```{code-cell} ipython3 +:clear_cell: true + +data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012", + sep='\t', header=None, names=column_names, na_values=[-999, -9999]) +``` + +```{code-cell} ipython3 +:clear_cell: true + +data.head() +``` + +For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data). + ++++ + +
+ +EXERCISE: +

+Drop all 'flag' columns ('flag1', 'flag2', ...) + +```{code-cell} ipython3 +flag_columns = [col for col in data.columns if 'flag' in col] +# we can now use this list to drop these columns +``` + +```{code-cell} ipython3 +:clear_cell: true + +data = data.drop(flag_columns, axis=1) +``` + +```{code-cell} ipython3 +data.head() +``` + +Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries. + ++++ + +
+ +REMEMBER: + + +Recap: reshaping your data with [`stack` / `melt` and `unstack` / `pivot`](./pandas_07_reshaping_data.ipynb) + + + + + +
+ ++++ + +
+ +EXERCISE: + +

+ +Reshape the dataframe to a timeseries. +The end result should look like:

+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BETR801
1990-01-02 09:00:0048.0
1990-01-02 12:00:0048.0
1990-01-02 13:00:0050.0
1990-01-02 14:00:0055.0
......
2012-12-31 20:00:0016.5
2012-12-31 21:00:0014.5
2012-12-31 22:00:0016.5
2012-12-31 23:00:0015.0
+

170794 rows × 1 columns

+
+ +
    +
  • Reshape the dataframe so that each row consists of one observation for one date + hour combination
  • +
  • When you have the date and hour values as two columns, combine these columns into a datetime (tip: string columns can be summed to concatenate the strings) and remove the original columns
  • +
  • Set the new datetime values as the index, and remove the original columns with date and hour values
  • + +
+ + +**NOTE**: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions. + +
+ + ++++ + +Reshaping using `melt`: + +```{code-cell} ipython3 +:clear_cell: true + +data_stacked = pd.melt(data, id_vars=['date'], var_name='hour') +data_stacked.head() +``` + +Reshaping using `stack`: + +```{code-cell} ipython3 +:clear_cell: true + +# we use stack to reshape the data to move the hours (the column labels) into a column. +# But we don't want to move the 'date' column label, therefore we first set this as the index. +# You can check the difference with "data.stack()" +data_stacked = data.set_index('date').stack() +data_stacked.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# We reset the index to have the date and hours available as columns +data_stacked = data_stacked.reset_index() +data_stacked = data_stacked.rename(columns={'level_1': 'hour'}) +data_stacked.head() +``` + +Combine date and hour: + +```{code-cell} ipython3 +:clear_cell: true + +# Now we combine the dates and the hours into a datetime, and set this as the index +data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['hour'], format="%Y-%m-%d%H") +``` + +```{code-cell} ipython3 +:clear_cell: true + +# Drop the origal date and hour columns +data_stacked = data_stacked.drop(['date', 'hour'], axis=1) +data_stacked.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +# rename the remaining column to the name of the measurement station +# (this is 0 or 'value' depending on which method was used) +data_stacked = data_stacked.rename(columns={0: 'BETR801'}) +``` + +```{code-cell} ipython3 +data_stacked.head() +``` + +Our final data is now a time series. In pandas, this means that the index is a `DatetimeIndex`: + +```{code-cell} ipython3 +data_stacked.index +``` + +```{code-cell} ipython3 +data_stacked.plot() +``` + +# Processing a collection of files + ++++ + +We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above. + ++++ + +
+ +EXERCISE: + +
    +
  • Write a function read_airbase_file(filename, station), using the above steps the read in and process the data, and that returns a processed timeseries.
  • +
+
+ +```{code-cell} ipython3 +def read_airbase_file(filename, station): + """ + Read hourly AirBase data files. + + Parameters + ---------- + filename : string + Path to the data file. + station : string + Name of the station. + + Returns + ------- + DataFrame + Processed dataframe. + """ + + ... + + return ... +``` + +```{code-cell} ipython3 +:clear_cell: true + +def read_airbase_file(filename, station): + """ + Read hourly AirBase data files. + + Parameters + ---------- + filename : string + Path to the data file. + station : string + Name of the station. + + Returns + ------- + DataFrame + Processed dataframe. + """ + + # construct the column names + hours = ["{:02d}".format(i) for i in range(24)] + flags = ['flag' + str(i) for i in range(24)] + colnames = ['date'] + [item for pair in zip(hours, flags) for item in pair] + + # read the actual data + data = pd.read_csv(filename, sep='\t', header=None, na_values=[-999, -9999], names=colnames) + + # drop the 'flag' columns + data = data.drop([col for col in data.columns if 'flag' in col], axis=1) + + # reshape + data = data.set_index('date') + data_stacked = data.stack() + data_stacked = data_stacked.reset_index() + + # parse to datetime and remove redundant columns + data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H") + data_stacked = data_stacked.drop(['date', 'level_1'], axis=1) + data_stacked = data_stacked.rename(columns={0: station}) + + return data_stacked +``` + +Test the function on the data file from above: + +```{code-cell} ipython3 +import os +``` + +```{code-cell} ipython3 +filename = "../data/BETR8010000800100hour.1-1-1990.31-12-2012" +station = os.path.split(filename)[-1][:7] +``` + +```{code-cell} ipython3 +station +``` + +```{code-cell} ipython3 +:clear_cell: false + +test = read_airbase_file(filename, station) +test.head() +``` + +We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe. + ++++ + +
+ +EXERCISE: + +
    +
  • Use the glob.glob function to list all 4 AirBase data files that are included in the 'data' directory, and call the result data_files.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: false + +import glob +``` + +```{code-cell} ipython3 +:clear_cell: true + +data_files = glob.glob("../data/*0008001*") +data_files +``` + +
+ +EXERCISE: + +
    +
  • Loop over the data files, read and process the file using our defined function, and append the dataframe to a list.
  • +
  • Combine the the different DataFrames in the list into a single DataFrame where the different columns are the different stations. Call the result combined_data.
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +dfs = [] + +for filename in data_files: + station = filename.split("/")[-1][:7] + df = read_airbase_file(filename, station) + dfs.append(df) +``` + +```{code-cell} ipython3 +:clear_cell: true + +combined_data = pd.concat(dfs, axis=1) +``` + +```{code-cell} ipython3 +combined_data.head() +``` + +Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file. + +```{code-cell} ipython3 +# let's first give the index a descriptive name +combined_data.index.name = 'datetime' +``` + +```{code-cell} ipython3 +combined_data.to_csv("../data/airbase_data_processed.csv") +``` diff --git a/_solved/pandas_01_data_structures.md b/_solved/pandas_01_data_structures.md new file mode 100644 index 0000000..fc116f3 --- /dev/null +++ b/_solved/pandas_01_data_structures.md @@ -0,0 +1,464 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

01 - Pandas: Data Structures

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +``` + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt +``` + +# Introduction + ++++ + +Let's directly start with importing some data: the `titanic` dataset about the passengers of the Titanic and their survival: + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +Starting from reading such a tabular dataset, Pandas provides the functionalities to answer questions about this data in a few lines of code. Let's start with a few examples as illustration: + ++++ + +
+ +
    +
  • What is the age distribution of the passengers?
  • +
+ +
+ +```{code-cell} ipython3 +df['Age'].hist() +``` + +
+ +
    +
  • How does the survival rate of the passengers differ between sexes?
  • +
+ +
+ +```{code-cell} ipython3 +df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x)) +``` + +
+ +
    +
  • Or how does the survival rate differ between the different classes of the Titanic?
  • +
+ +
+ +```{code-cell} ipython3 +df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar') +``` + +
+ +
    +
  • Are young people (e.g. < 25 years) likely to survive?
  • +
+ +
+ +```{code-cell} ipython3 +df['Survived'].sum() / df['Survived'].count() +``` + +```{code-cell} ipython3 +df25 = df[df['Age'] <= 25] +df25['Survived'].sum() / len(df25['Survived']) +``` + +All the needed functionality for the above examples will be explained throughout the course, but as a start: the data types to work with. + ++++ + +# The pandas data structures: `DataFrame` and `Series` + +Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``). + ++++ + +## DataFrame: 2D tabular data + ++++ + +A `DataFrame` is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index. + + + ++++ + +For the examples here, we are going to create a small DataFrame with some data about a few countries. + +When creating a DataFrame manually, a common way to do this is from dictionary of arrays or lists: + +```{code-cell} ipython3 +data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], + 'population': [11.3, 64.3, 81.3, 16.9, 64.9], + 'area': [30510, 671308, 357050, 41526, 244820], + 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} +countries = pd.DataFrame(data) +countries +``` + +In practice, you will of course often import your data from an external source (text file, excel, database, ..), which we will see later. + +Note that in the IPython notebook, the dataframe will display in a rich HTML view. + ++++ + +### Attributes of the DataFrame + +The DataFrame has a built-in concept of named rows and columns, the **`index`** and **`columns`** attributes: + +```{code-cell} ipython3 +countries.index +``` + +By default, the index is the numbers *0* through *N - 1* + +```{code-cell} ipython3 +countries.columns +``` + +To check the data types of the different columns: + +```{code-cell} ipython3 +countries.dtypes +``` + +An overview of that information can be given with the `info()` method: + +```{code-cell} ipython3 +countries.info() +``` + +
+ +__NumPy__ provides + +* multi-dimensional, homogeneously typed arrays (single data type!) + +
+ +__Pandas__ provides + +* 2D, heterogeneous data structure (multiple data types!) +* labeled (named) row and column index + +
+ ++++ + +## One-dimensional data: `Series` (a column of a DataFrame) + +A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created: + +```{code-cell} ipython3 +s = pd.Series([0.1, 0.2, 0.3, 0.4]) +s +``` + +And often, you access a Series representing a column in the data, using typical `[]` indexing syntax and the column name: + +```{code-cell} ipython3 +countries['area'] +``` + +### Attributes of a Series: `index` and `values` + +The series also has an **index**, which by default is the numbers *0* through *N - 1* (but no `.columns`): + +```{code-cell} ipython3 +s.index +``` + +You can access the underlying numpy array representation with the `.values` attribute: + +```{code-cell} ipython3 +s.values +``` + +We can access series values via the index, just like for NumPy arrays: + +```{code-cell} ipython3 +s[0] +``` + +Unlike the NumPy array, though, this index can be something other than integers: + +```{code-cell} ipython3 +s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd']) +s2 +``` + +```{code-cell} ipython3 +s2['c'] +``` + +### Pandas Series versus dictionaries + ++++ + +In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value. + +In fact, it's possible to construct a series directly from a Python dictionary: + +```{code-cell} ipython3 +pop_dict = {'Germany': 81.3, + 'Belgium': 11.3, + 'France': 64.3, + 'United Kingdom': 64.9, + 'Netherlands': 16.9} +population = pd.Series(pop_dict) +population +``` + +We can index the populations like a dict as expected ... + +```{code-cell} ipython3 +population['France'] +``` + +... but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series. + +Eg element-wise operations: + +```{code-cell} ipython3 +population * 1000 +``` + +## Some useful methods on these data structures + ++++ + +Exploration of the Series and DataFrame is essential (check out what you're dealing with). + +```{code-cell} ipython3 +countries.head() # Top rows +``` + +```{code-cell} ipython3 +countries.tail() # Bottom rows +``` + +The ``describe`` method computes summary statistics for each column: + +```{code-cell} ipython3 +countries.describe() +``` + +**Sort**ing your data **by** a specific column is another important first-check: + +```{code-cell} ipython3 +countries.sort_values(by='population') +``` + +The **`plot`** method can be used to quickly visualize the data in different ways: + +```{code-cell} ipython3 +countries.plot() +``` + +However, for this dataset, it does not say that much: + +```{code-cell} ipython3 +countries['population'].plot(kind='barh') +``` + +
+ +**EXERCISE**: + +* You can play with the `kind` keyword of the `plot` function in the figure above: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin', 'box' + +
+ ++++ + +# Importing and exporting data + ++++ + +A wide range of input/output formats are natively supported by pandas: + +* CSV, text +* SQL database +* Excel +* HDF5 +* json +* html +* pickle +* sas, stata +* Parquet +* ... + +```{code-cell} ipython3 +# pd.read_ +``` + +```{code-cell} ipython3 +# countries.to_ +``` + +
+ +**Note: I/O interface** + +* All readers are `pd.read_...` +* All writers are `DataFrame.to_...` + +
+ ++++ + +# Application on a real dataset + ++++ + +Throughout the pandas notebooks, many of exercises will use the titanic dataset. This dataset has records of all the passengers of the Titanic, with characteristics of the passengers (age, class, etc. See below), and an indication whether they survived the disaster. + + +The available metadata of the titanic data set provides the following information: + +VARIABLE | DESCRIPTION +------ | -------- +Survived | Survival (0 = No; 1 = Yes) +Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) +Name | Name +Sex | Sex +Age | Age +SibSp | Number of Siblings/Spouses Aboard +Parch | Number of Parents/Children Aboard +Ticket | Ticket Number +Fare | Passenger Fare +Cabin | Cabin +Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) + ++++ + +
+ +**EXERCISE**: + +* Read the CVS file (available at `../data/titanic.csv`) into a pandas DataFrame. Call the result `df`. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df = pd.read_csv("../data/titanic.csv") +``` + +
+ +**EXERCISE**: + +* Quick exploration: show the first 5 rows of the DataFrame. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df.head() +``` + +
+ +**EXERCISE**: + +* How many records (i.e. rows) has the titanic dataset? + +
Hints + +* The length of a DataFrame gives the number of rows (`len(..)`). Alternatively, you can check the "shape" (number of rows, number of columns) of the DataFrame using the `shape` attribute. + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +len(df) +``` + +
+ EXERCISE: + +* Select the 'Age' column (remember: we can use the [] indexing notation and the column label). + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Age'] +``` + +
+ EXERCISE: + +* Make a box plot of the Fare column. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Fare'].plot(kind='box') +``` + +
+ +**EXERCISE**: + +* Sort the rows of the DataFrame by 'Age' column, with the oldest passenger at the top. Check the help of the `sort_values` function and find out how to sort from the largest values to the lowest values + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df.sort_values(by='Age', ascending=False) +``` + +--- +# Acknowledgement + + +> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014). diff --git a/_solved/pandas_02_basic_operations.md b/_solved/pandas_02_basic_operations.md new file mode 100644 index 0000000..2cd628a --- /dev/null +++ b/_solved/pandas_02_basic_operations.md @@ -0,0 +1,302 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

02 - Pandas: Basic operations on Series and DataFrames

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd + +import numpy as np +import matplotlib.pyplot as plt +``` + +As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes. + +```{code-cell} ipython3 +# redefining the example objects + +population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, + 'United Kingdom': 64.9, 'Netherlands': 16.9}) + +countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], + 'population': [11.3, 64.3, 81.3, 16.9, 64.9], + 'area': [30510, 671308, 357050, 41526, 244820], + 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}) +``` + +```{code-cell} ipython3 +countries.head() +``` + +# The 'new' concepts + ++++ + +## Elementwise-operations + ++++ + +Just like with numpy arrays, many operations are element-wise: + +```{code-cell} ipython3 +population / 100 +``` + +```{code-cell} ipython3 +countries['population'] / countries['area'] +``` + +```{code-cell} ipython3 +np.log(countries['population']) +``` + +which can be added as a new column, as follows: + +```{code-cell} ipython3 +countries["log_population"] = np.log(countries['population']) +``` + +```{code-cell} ipython3 +countries.columns +``` + +```{code-cell} ipython3 +countries['population'] > 40 +``` + +
+ +REMEMBER: + +* When you have an operation which does NOT work element-wise or you have no idea how to do it directly in Pandas, use the **apply()** function +* A typical use case is with a custom written or a **lambda** function + +
+ +```{code-cell} ipython3 +countries["capital"].apply(lambda x: len(x)) # in case you forgot the functionality: countries["capital"].str.len() +``` + +```{code-cell} ipython3 +def population_annotater(population): + """annotate as large or small""" + if population > 50: + return 'large' + else: + return 'small' +``` + +```{code-cell} ipython3 +countries["population"].apply(population_annotater) # a custom user function +``` + +## Aggregations (reductions) + ++++ + +Pandas provides a large set of **summary** functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column). + ++++ + +The average population number: + +```{code-cell} ipython3 +population.mean() +``` + +The minimum area: + +```{code-cell} ipython3 +countries['area'].min() +``` + +For dataframes, often only the numeric columns are included in the result: + +```{code-cell} ipython3 +countries.median() +``` + +# Application on a real dataset + ++++ + +Reading in the titanic data set... + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +Quick exploration first... + +```{code-cell} ipython3 +df.head() +``` + +```{code-cell} ipython3 +len(df) +``` + +The available metadata of the titanic data set provides the following information: + +VARIABLE | DESCRIPTION +------ | -------- +Survived | Survival (0 = No; 1 = Yes) +Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) +Name | Name +Sex | Sex +Age | Age +SibSp | Number of Siblings/Spouses Aboard +Parch | Number of Parents/Children Aboard +Ticket | Ticket Number +Fare | Passenger Fare +Cabin | Cabin +Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) + ++++ + +
+EXERCISE: + +
    +
  • What is the average age of the passengers?
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Age'].mean() +``` + +
+EXERCISE: + +
    +
  • Plot the age distribution of the titanic passengers
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Age'].hist() #bins=30, log=True +``` + +
+EXERCISE: + +
    +
  • What is the survival rate? (the relative number of people that survived)
  • +
+
+ +Note: the 'Survived' column indicates whether someone survived (1) or not (0). +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Survived'].sum() / len(df['Survived']) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df['Survived'].mean() +``` + +
+EXERCISE: + +
    +
  • What is the maximum Fare? And the median?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Fare'].max() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df['Fare'].median() +``` + +
+ +EXERCISE: + +
    +
  • Calculate the 75th percentile (`quantile`) of the Fare price (Tip: look in the docstring how to specify the percentile)
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Fare'].quantile(0.75) +``` + +
+EXERCISE: + +
    +
  • Calculate the normalized Fares (normalized relative to its mean)
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Fare'] / df['Fare'].mean() +``` + +
+EXERCISE: + +
    +
  • Calculate the log of the Fares, and add this as a new column ('Fare_log') to the DataFrame.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +np.log(df['Fare']) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df['Fare_log'] = np.log(df['Fare']) +df.head() +``` + +# Acknowledgement + + +> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014). + +--- diff --git a/_solved/pandas_03a_selecting_data.md b/_solved/pandas_03a_selecting_data.md new file mode 100644 index 0000000..fa397f3 --- /dev/null +++ b/_solved/pandas_03a_selecting_data.md @@ -0,0 +1,490 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

03 - Pandas: Indexing and selecting data - part I

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +``` + +```{code-cell} ipython3 +# redefining the example objects + +# series +population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, + 'United Kingdom': 64.9, 'Netherlands': 16.9}) + +# dataframe +data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], + 'population': [11.3, 64.3, 81.3, 16.9, 64.9], + 'area': [30510, 671308, 357050, 41526, 244820], + 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} +countries = pd.DataFrame(data) +countries +``` + +# Subsetting data + ++++ + +## Subset variables (columns) + ++++ + +For a DataFrame, basic indexing selects the columns (cfr. the dictionaries of pure python) + +Selecting a **single column**: + +```{code-cell} ipython3 +countries['area'] # single [] +``` + +Remember that the same syntax can also be used to *add* a new columns: `df['new'] = ...`. + +We can also select **multiple columns** by passing a list of column names into `[]`: + +```{code-cell} ipython3 +countries[['area', 'population']] # double [[]] +``` + +## Subset observations (rows) + ++++ + +Using `[]`, slicing or boolean indexing accesses the **rows**: + ++++ + +### Slicing + +```{code-cell} ipython3 +countries[0:4] +``` + +### Boolean indexing (filtering) + ++++ + +Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy. + +The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed. + +```{code-cell} ipython3 +countries['area'] > 100000 +``` + +```{code-cell} ipython3 +countries[countries['area'] > 100000] +``` + +```{code-cell} ipython3 +countries[countries['population'] > 50] +``` + +An overview of the possible comparison operations: + +Operator | Description +------ | -------- +== | Equal +!= | Not equal +> | Greater than +>= | Greater than or equal +< | Lesser than +!= | Lesser than or equal + +and to combine multiple conditions: + +Operator | Description +------ | -------- +& | And (`cond1 & cond2`) +\| | Or (`cond1 \| cond2`) + ++++ + +
+REMEMBER:

+ +So as a summary, `[]` provides the following convenience shortcuts: + +* **Series**: selecting a **label**: `s[label]` +* **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` +* **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]` + +
+ ++++ + +## Some other useful methods: `isin` and `string` methods + ++++ + +The `isin` method of Series is very useful to select rows that may contain certain values: + +```{code-cell} ipython3 +s = countries['capital'] +``` + +```{code-cell} ipython3 +s.isin? +``` + +```{code-cell} ipython3 +s.isin(['Berlin', 'London']) +``` + +This can then be used to filter the dataframe with boolean indexing: + +```{code-cell} ipython3 +countries[countries['capital'].isin(['Berlin', 'London'])] +``` + +Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the `startswith` method: + +```{code-cell} ipython3 +string = 'Berlin' +``` + +```{code-cell} ipython3 +string.startswith('B') +``` + +In pandas, these are available on a Series through the `str` namespace: + +```{code-cell} ipython3 +countries['capital'].str.startswith('B') +``` + +For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling + ++++ + +# Exercises using the Titanic dataset + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +
+ +EXERCISE: + +
    +
  • Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +males = df[df['Sex'] == 'male'] +``` + +```{code-cell} ipython3 +:clear_cell: true + +males['Age'].mean() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df[df['Sex'] == 'female']['Age'].mean() +``` + +We will later see an easier way to calculate both averages at the same time with groupby. + ++++ + +
+ +EXERCISE: + +
    +
  • How many passengers older than 70 were on the Titanic?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +len(df[df['Age'] > 70]) +``` + +```{code-cell} ipython3 +:clear_cell: true + +(df['Age'] > 70).sum() +``` + +
+ +EXERCISE: + +
    +
  • Select the passengers that are between 30 and 40 years old?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df[(df['Age'] > 30) & (df['Age'] <= 40)] +``` + +
+ +EXERCISE: + +
    +
  • Split the 'Name' column on the `,` extract the first part (the surname), and add this as new column 'Surname' .
  • +
+ +
+Tip: try it first on a single string (and for this, check the `split` method of a string), and then try to 'apply' this on each row. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0]) +``` + +
+ +EXERCISE: + +
    +
  • Select all passenger that have a surname starting with 'Williams'.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df[df['Surname'].str.startswith('Williams')] +``` + +
+ +EXERCISE: + +
    +
  • Select all rows for the passengers with a surname of more than 15 characters.
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +df[df['Surname'].str.len() > 15] +``` + +```{code-cell} ipython3 + +``` + +# [OPTIONAL] more exercises + ++++ + +For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder. + +```{code-cell} ipython3 +cast = pd.read_csv('../data/cast.csv') +cast.head() +``` + +```{code-cell} ipython3 +titles = pd.read_csv('../data/titles.csv') +titles.head() +``` + +
+ +EXERCISE: + +
    +
  • How many movies are listed in the titles dataframe?
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +len(titles) +``` + +
+ +EXERCISE: + +
    +
  • What are the earliest two films listed in the titles dataframe?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles.sort_values('year').head(2) +``` + +
+ +EXERCISE: + +
    +
  • How many movies have the title "Hamlet"?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +len(titles[titles['title'] == 'Hamlet']) +``` + +
+ +EXERCISE: + +
    +
  • List all of the "Treasure Island" movies from earliest to most recent.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles[titles.title == 'Treasure Island'].sort_values('year') +``` + +
+ +EXERCISE: + +
    +
  • How many movies were made from 1950 through 1959?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +len(titles[(titles['year'] >= 1950) & (titles['year'] <= 1959)]) +``` + +```{code-cell} ipython3 +:clear_cell: true + +len(titles[titles['year'] // 10 == 195]) +``` + +
+ +EXERCISE: + +
    +
  • How many roles in the movie "Inception" are NOT ranked by an "n" value?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +inception = cast[cast['title'] == 'Inception'] +``` + +```{code-cell} ipython3 +:clear_cell: true + +len(inception[inception['n'].isnull()]) +``` + +```{code-cell} ipython3 +:clear_cell: true + +inception['n'].isnull().sum() +``` + +
+ +EXERCISE: + +
    +
  • But how many roles in the movie "Inception" did receive an "n" value?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +len(inception[inception['n'].notnull()]) +``` + +
+ +EXERCISE: + +
    +
  • Display the cast of the "Titanic" (the most famous 1997 one) in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titanic = cast[(cast['title'] == 'Titanic') & (cast['year'] == 1997)] +titanic = titanic[titanic['n'].notnull()] +titanic.sort_values('n') +``` + +
+ +EXERCISE: + +
    +
  • List the supporting roles (having n=2) played by Brad Pitt in the 1990s, in order by year.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +brad = cast[cast['name'] == 'Brad Pitt'] +brad = brad[brad['year'] // 10 == 199] +brad = brad[brad['n'] == 2] +brad.sort_values('year') +``` + +# Acknowledgement + + +> The optional exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. + +--- diff --git a/_solved/pandas_03b_indexing.md b/_solved/pandas_03b_indexing.md new file mode 100644 index 0000000..717bff4 --- /dev/null +++ b/_solved/pandas_03b_indexing.md @@ -0,0 +1,357 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

03 - Pandas: Indexing and selecting data - part II

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +``` + +```{code-cell} ipython3 +# redefining the example objects + +# series +population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, + 'United Kingdom': 64.9, 'Netherlands': 16.9}) + +# dataframe +data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], + 'population': [11.3, 64.3, 81.3, 16.9, 64.9], + 'area': [30510, 671308, 357050, 41526, 244820], + 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} +countries = pd.DataFrame(data) +countries +``` + +
+REMEMBER:

+ +So as a summary, `[]` provides the following convenience shortcuts: + +* **Series**: selecting a **label**: `s[label]` +* **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` +* **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]` + +
+ ++++ + +# Changing the DataFrame index + ++++ + +We have mostly worked with DataFrames with the default *0, 1, 2, ... N* row labels (except for the time series data). But, we can also set one of the columns as the index. + +Setting the index to the country names: + +```{code-cell} ipython3 +countries = countries.set_index('country') +countries +``` + +Reversing this operation, is `reset_index`: + +```{code-cell} ipython3 +countries.reset_index('country') +``` + +# Selecting data based on the index + ++++ + +
+ATTENTION!:

+ +One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between: + +* selection by **label** (using the row and column names) +* selection by **position** (using integers) + +
+ ++++ + +## Systematic indexing with `loc` and `iloc` + ++++ + +When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes: + +* `loc`: selection by label +* `iloc`: selection by position + +Both `loc` and `iloc` use the following pattern: `df.loc[ , ]`. + +This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask. + ++++ + +Selecting a single element: + +```{code-cell} ipython3 +countries.loc['Germany', 'area'] +``` + +But the row or column indexer can also be a list, slice, boolean array (see next section), .. + +```{code-cell} ipython3 +countries.loc['France':'Germany', ['area', 'population']] +``` + +
+NOTE: + +* Unlike slicing in numpy, the end label is **included**! + +
+ ++++ + +--- +Selecting by position with `iloc` works similar as **indexing numpy arrays**: + +```{code-cell} ipython3 +countries.iloc[0:2,1:3] +``` + +--- + +The different indexing methods can also be used to **assign data**: + +```{code-cell} ipython3 +countries2 = countries.copy() +countries2.loc['Belgium':'Germany', 'population'] = 10 +``` + +```{code-cell} ipython3 +countries2 +``` + +
+REMEMBER:

+ +Advanced indexing with **loc** and **iloc** + +* **loc**: select by label: `df.loc[row_indexer, column_indexer]` +* **iloc**: select by position: `df.iloc[row_indexer, column_indexer]` + +
+ ++++ + +
+EXERCISE: + +

+

    +
  • Add the population density as column to the DataFrame.
  • +
+

+Note: the population column is expressed in millions. +
+ +```{code-cell} ipython3 +:clear_cell: true + +countries['density'] = countries['population']*1000000 / countries['area'] +``` + +
+EXERCISE: + +
    +
  • Select the capital and the population column of those countries where the density is larger than 300
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +countries.loc[countries['density'] > 300, ['capital', 'population']] +``` + +
+ +EXERCISE: + +
    +
  • Add a column 'density_ratio' with the ratio of the population density to the average population density for all countries.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +countries['density_ratio'] = countries['density'] / countries['density'].mean() +countries +``` + +
+ +EXERCISE: + +
    +
  • Change the capital of the UK to Cambridge
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +countries.loc['United Kingdom', 'capital'] = 'Cambridge' +countries +``` + +
+EXERCISE: + +
    +
  • Select all countries whose population density is between 100 and 300 people/km²
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +countries[(countries['density'] > 100) & (countries['density'] < 300)] +``` + +# Alignment on the index + ++++ + +
+ +**WARNING**: **Alignment!** (unlike numpy) + +* Pay attention to **alignment**: operations between series will align on the index: + +
+ +```{code-cell} ipython3 +population = countries['population'] +s1 = population[['Belgium', 'France']] +s2 = population[['France', 'Germany']] +``` + +```{code-cell} ipython3 +s1 +``` + +```{code-cell} ipython3 +s2 +``` + +```{code-cell} ipython3 +s1 + s2 +``` + +# Pitfall: chained indexing (and the 'SettingWithCopyWarning') + +```{code-cell} ipython3 +df = countries.copy() +``` + +When updating values in a DataFrame, you can run into the infamous "SettingWithCopyWarning" and issues with chained indexing. + +Assume we want to cap the population and replace all values above 50 with 50. We can do this using the basic `[]` indexing operation twice ("chained indexing"): + +```{code-cell} ipython3 +df[df['population'] > 50]['population'] = 50 +``` + +However, we get a warning, and we can also see that the original dataframe did not change: + +```{code-cell} ipython3 +df +``` + +The warning message explains that we should use `.loc[row_indexer,col_indexer] = value` instead. That is what we just learned in this notebook, so we can do: + +```{code-cell} ipython3 +df.loc[df['population'] > 50, 'population'] = 50 +``` + +And now the dataframe actually changed: + +```{code-cell} ipython3 +df +``` + +To explain *why* the original `df[df['population'] > 50]['population'] = 50` didn't work, we can do the "chained indexing" in two explicit steps: + +```{code-cell} ipython3 +temp = df[df['population'] > 50] +temp['population'] = 50 +``` + +For Python, there is no real difference between the one-liner or this two-liner. And when writing it as two lines, you can see we make a temporary, filtered dataframe (called `temp` above). So here, with `temp['population'] = 50`, we are actually updating `temp` but not the original `df`. + ++++ + +
+ +REMEMBER!

+ +What to do when encountering the *value is trying to be set on a copy of a slice from a DataFrame* error? + +* Use `loc` instead of chained indexing **if possible**! +* Or `copy` explicitly if you don't want to change the original data. + +
+ ++++ + +# Exercises using the Titanic dataset + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +
+ +EXERCISE: + +* Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers. Do this now using `.loc`. + +
+ +```{code-cell} ipython3 +:clear_cell: true + +df.loc[df['Sex'] == 'male', 'Age'].mean() +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.loc[df['Sex'] == 'female', 'Age'].mean() +``` + +We will later see an easier way to calculate both averages at the same time with groupby. + +```{code-cell} ipython3 + +``` diff --git a/_solved/pandas_04_time_series_data.md b/_solved/pandas_04_time_series_data.md new file mode 100644 index 0000000..97abdb9 --- /dev/null +++ b/_solved/pandas_04_time_series_data.md @@ -0,0 +1,444 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

04 - Pandas: Working with time series data

+ +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +# %matplotlib notebook +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt + +plt.style.use('ggplot') +``` + +# Introduction: `datetime` module + ++++ + +Standard Python contains the `datetime` module to handle date and time data: + +```{code-cell} ipython3 +import datetime +``` + +```{code-cell} ipython3 +dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30) +dt +``` + +```{code-cell} ipython3 +print(dt) # .day,... +``` + +```{code-cell} ipython3 +print(dt.strftime("%d %B %Y")) +``` + +# Dates and times in pandas + ++++ + +## The ``Timestamp`` object + ++++ + +Pandas has its own date and time objects, which are compatible with the standard `datetime` objects, but provide some more functionality to work with. + +The `Timestamp` object can also be constructed from a string: + +```{code-cell} ipython3 +ts = pd.Timestamp('2016-12-19') +ts +``` + +Like with `datetime.datetime` objects, there are several useful attributes available on the `Timestamp`. For example, we can get the month (experiment with tab completion!): + +```{code-cell} ipython3 +ts.month +``` + +There is also a `Timedelta` type, which can e.g. be used to add intervals of time: + +```{code-cell} ipython3 +ts + pd.Timedelta('5 days') +``` + +## Parsing datetime strings + ++++ + +![](http://imgs.xkcd.com/comics/iso_8601.png) + ++++ + +Unfortunately, when working with real world data, you encounter many different `datetime` formats. Most of the time when you have to deal with them, they come in text format, e.g. from a `CSV` file. To work with those data in Pandas, we first have to *parse* the strings to actual `Timestamp` objects. + ++++ + +
+REMEMBER:

+ +To convert string formatted dates to Timestamp objects: use the `pandas.to_datetime` function + +
+ +```{code-cell} ipython3 +pd.to_datetime("2016-12-09") +``` + +```{code-cell} ipython3 +pd.to_datetime("09/12/2016") +``` + +```{code-cell} ipython3 +pd.to_datetime("09/12/2016", dayfirst=True) +``` + +```{code-cell} ipython3 +pd.to_datetime("09/12/2016", format="%d/%m/%Y") +``` + +A detailed overview of how to specify the `format` string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior + ++++ + +## `Timestamp` data in a Series or DataFrame column + +```{code-cell} ipython3 +s = pd.Series(['2016-12-09 10:00:00', '2016-12-09 11:00:00', '2016-12-09 12:00:00']) +``` + +```{code-cell} ipython3 +s +``` + +The `to_datetime` function can also be used to convert a full series of strings: + +```{code-cell} ipython3 +ts = pd.to_datetime(s) +``` + +```{code-cell} ipython3 +ts +``` + +Notice the data type of this series has changed: the `datetime64[ns]` dtype. This indicates that we have a series of actual datetime values. + ++++ + +The same attributes as on single `Timestamp`s are also available on a Series with datetime data, using the **`.dt`** accessor: + +```{code-cell} ipython3 +ts.dt.hour +``` + +```{code-cell} ipython3 +ts.dt.weekday +``` + +To quickly construct some regular time series data, the [``pd.date_range``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function comes in handy: + +```{code-cell} ipython3 +pd.Series(pd.date_range(start="2016-01-01", periods=10, freq='3H')) +``` + +# Time series data: `Timestamp` in the index + ++++ + +## River discharge example data + ++++ + +For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the [Waterinfo website](https://www.waterinfo.be/). + +```{code-cell} ipython3 +data = pd.read_csv("../data/vmm_flowdata.csv") +``` + +```{code-cell} ipython3 +data.head() +``` + +We already know how to parse a date column with Pandas: + +```{code-cell} ipython3 +data['Time'] = pd.to_datetime(data['Time']) +``` + +With `set_index('datetime')`, we set the column with datetime values as the index, which can be done by both `Series` and `DataFrame`. + +```{code-cell} ipython3 +data = data.set_index("Time") +``` + +```{code-cell} ipython3 +data +``` + +The steps above are provided as built-in functionality of `read_csv`: + +```{code-cell} ipython3 +data = pd.read_csv("../data/vmm_flowdata.csv", index_col=0, parse_dates=True) +``` + +
+REMEMBER:

+ +`pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function... + +
+ ++++ + +## The DatetimeIndex + ++++ + +When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available: + +```{code-cell} ipython3 +data.index +``` + +Similar to a Series with datetime data, there are some attributes of the timestamp values available: + +```{code-cell} ipython3 +data.index.day +``` + +```{code-cell} ipython3 +data.index.dayofyear +``` + +```{code-cell} ipython3 +data.index.year +``` + +The `plot` method will also adapt its labels (when you zoom in, you can see the different levels of detail of the datetime labels): + +```{code-cell} ipython3 +# %matplotlib notebook +``` + +```{code-cell} ipython3 +data.plot() +``` + +We have too much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections. + ++++ + +## Selecting data from a time series + ++++ + +We can use label based indexing on a timeseries as expected: + +```{code-cell} ipython3 +data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")] +``` + +But, for convenience, indexing a time series also works with strings: + +```{code-cell} ipython3 +data["2012-01-01 09:00":"2012-01-01 19:00"] +``` + +A nice feature is **"partial string" indexing**, where we can do implicit slicing by providing a partial datetime string. + +E.g. all data of 2013: + +```{code-cell} ipython3 +data['2013'] +``` + +Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice. + ++++ + +Or all data of January up to March 2012: + +```{code-cell} ipython3 +data['2012-01':'2012-03'] +``` + +
+ +EXERCISE: + +
    +
  • select all data starting from 2012
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data['2012':] +``` + +
+ +EXERCISE: + +
    +
  • select all data in January for all different years
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data[data.index.month == 1] +``` + +
+ +EXERCISE: + +
    +
  • select all data in April, May and June for all different years
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data[data.index.month.isin([4, 5, 6])] +``` + +
+ +EXERCISE: + +
    +
  • select all 'daytime' data (between 8h and 20h) for all days
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data[(data.index.hour > 8) & (data.index.hour < 20)] +``` + +## The power of pandas: `resample` + ++++ + +A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data). + +The time series has a frequency of 1 hour. I want to change this to daily: + +```{code-cell} ipython3 +data.resample('D').mean().head() +``` + +Other mathematical methods can also be specified: + +```{code-cell} ipython3 +data.resample('D').max().head() +``` + +
+REMEMBER:

+ +The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

+ +These strings can also be combined with numbers, eg `'10D'`... + +
+ + +```{code-cell} ipython3 +data.resample('M').mean().plot() # 10D +``` + +
+ +EXERCISE: + +
    +
  • plot the monthly standard deviation of the columns
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +data.resample('M').std().plot() # 'A' +``` + +
+ +EXERCISE: + +
    +
  • plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

  • +
+ +**Note** remember the `agg` when using `groupby` to derive multiple statistics at the same time? + +
+ +```{code-cell} ipython3 +:clear_cell: true + +subset = data['2011':'2012']['L06_347'] +subset.resample('M').agg(['mean', 'median']).plot() +``` + +
+ +EXERCISE: + +
    +
  • plot the monthly mininum and maximum daily average value of the 'LS06_348' column
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +daily = data['LS06_348'].resample('D').mean() # daily averages calculated +``` + +```{code-cell} ipython3 +:clear_cell: true + +daily.resample('M').agg(['min', 'max']).plot() # monthly minimum and maximum values of these daily averages +``` + +
+EXERCISE: + +
    +
  • Make a bar plot of the mean of the stations in year of 2013
  • +
+ +
+ +```{code-cell} ipython3 +:clear_cell: true + +data['2013'].mean().plot(kind='barh') +``` diff --git a/_solved/pandas_05_combining_datasets.md b/_solved/pandas_05_combining_datasets.md new file mode 100644 index 0000000..d6b34d4 --- /dev/null +++ b/_solved/pandas_05_combining_datasets.md @@ -0,0 +1,209 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Pandas: Combining datasets Part I - concat

+ +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +``` + +Combining data is essential functionality in a data analysis workflow. + +Data is distributed in multiple files, different information needs to be merged, new data is calculated, .. and needs to be added together. Pandas provides various facilities for easily combining together Series and DataFrame objects + +```{code-cell} ipython3 +# redefining the example objects + +# series +population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, + 'United Kingdom': 64.9, 'Netherlands': 16.9}) + +# dataframe +data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], + 'population': [11.3, 64.3, 81.3, 16.9, 64.9], + 'area': [30510, 671308, 357050, 41526, 244820], + 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} +countries = pd.DataFrame(data) +countries +``` + +# Adding columns + ++++ + +As we already have seen before, adding a single column is very easy: + +```{code-cell} ipython3 +pop_density = countries['population']*1e6 / countries['area'] +``` + +```{code-cell} ipython3 +pop_density +``` + +```{code-cell} ipython3 +countries['pop_density'] = pop_density +``` + +```{code-cell} ipython3 +countries +``` + +Adding multiple columns at once is also possible. For example, the following method gives us a DataFrame of two columns: + +```{code-cell} ipython3 +countries["country"].str.split(" ", expand=True) +``` + +We can add both at once to the dataframe: + +```{code-cell} ipython3 +countries[['first', 'last']] = countries["country"].str.split(" ", expand=True) +``` + +```{code-cell} ipython3 +countries +``` + +# Concatenating data + ++++ + +The ``pd.concat`` function does all of the heavy lifting of combining data in different ways. + +``pd.concat`` takes a list or dict of Series/DataFrame objects and concatenates them in a certain direction (`axis`) with some configurable handling of “what to do with the other axes”. + ++++ + +## Combining rows - ``pd.concat`` + ++++ + +![](../img/schema-concat0.svg) + ++++ + +Assume we have some similar data as in `countries`, but for a set of different countries: + +```{code-cell} ipython3 +data = {'country': ['Nigeria', 'Rwanda', 'Egypt', 'Morocco', ], + 'population': [182.2, 11.3, 94.3, 34.4], + 'area': [923768, 26338 , 1010408, 710850], + 'capital': ['Abuja', 'Kigali', 'Cairo', 'Rabat']} +countries_africa = pd.DataFrame(data) +countries_africa +``` + +We now want to combine the rows of both datasets: + +```{code-cell} ipython3 +pd.concat([countries, countries_africa]) +``` + +If we don't want the index to be preserved: + +```{code-cell} ipython3 +pd.concat([countries, countries_africa], ignore_index=True) +``` + +When the two dataframes don't have the same set of columns, by default missing values get introduced: + +```{code-cell} ipython3 +pd.concat([countries, countries_africa[['country', 'capital']]], ignore_index=True) +``` + +We can also pass a dictionary of objects instead of a list of objects. Now the keys of the dictionary are preserved as an additional index level: + +```{code-cell} ipython3 +pd.concat({'europe': countries, 'africa': countries_africa}) +``` + +## Combining columns - ``pd.concat`` with ``axis=1`` + ++++ + +![](../img/schema-concat1.svg) + ++++ + +Assume we have another DataFrame for the same countries, but with some additional statistics: + +```{code-cell} ipython3 +data = {'country': ['Belgium', 'France', 'Netherlands'], + 'GDP': [496477, 2650823, 820726], + 'area': [8.0, 9.9, 5.7]} +country_economics = pd.DataFrame(data).set_index('country') +country_economics +``` + +```{code-cell} ipython3 +pd.concat([countries, country_economics], axis=1) +``` + +`pd.concat` matches the different objects based on the index: + +```{code-cell} ipython3 +countries2 = countries.set_index('country') +``` + +```{code-cell} ipython3 +countries2 +``` + +```{code-cell} ipython3 +pd.concat([countries2, country_economics], axis=1) +``` + +# Joining data with `pd.merge` + ++++ + +Using `pd.concat` above, we combined datasets that had the same columns or the same index values. But, another typical case if where you want to add information of second dataframe to a first one based on one of the columns. That can be done with [`pd.merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html). + +Let's look again at the titanic passenger data, but taking a small subset of it to make the example easier to grasp: + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +df = df.loc[:9, ['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked']] +``` + +```{code-cell} ipython3 +df +``` + +Assume we have another dataframe with more information about the 'Embarked' locations: + +```{code-cell} ipython3 +locations = pd.DataFrame({'Embarked': ['S', 'C', 'Q', 'N'], + 'City': ['Southampton', 'Cherbourg', 'Queenstown', 'New York City'], + 'Country': ['United Kindom', 'France', 'Ireland', 'United States']}) +``` + +```{code-cell} ipython3 +locations +``` + +We now want to add those columns to the titanic dataframe, for which we can use `pd.merge`, specifying the column on which we want to merge the two datasets: + +```{code-cell} ipython3 +pd.merge(df, locations, on='Embarked', how='left') +``` + +In this case we use `how='left` (a "left join") because we wanted to keep the original rows of `df` and only add matching values from `locations` to it. Other options are 'inner', 'outer' and 'right' (see the [docs](http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra) for more on this). diff --git a/_solved/pandas_06_groupby_operations.md b/_solved/pandas_06_groupby_operations.md new file mode 100644 index 0000000..81c4f60 --- /dev/null +++ b/_solved/pandas_06_groupby_operations.md @@ -0,0 +1,636 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

06 - Pandas: "Group by" operations

+ +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +plt.style.use('seaborn-whitegrid') +``` + +# Some 'theory': the groupby operation (split-apply-combine) + +```{code-cell} ipython3 +df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'], + 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]}) +df +``` + +### Recap: aggregating functions + ++++ + +When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example: + +```{code-cell} ipython3 +df['data'].sum() +``` + +However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups. + +For example, in the above dataframe `df`, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following: + +```{code-cell} ipython3 +for key in ['A', 'B', 'C']: + print(key, df[df['key'] == key]['data'].sum()) +``` + +This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with. + +What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this. + ++++ + +### Groupby: applying functions per group + ++++ + +The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets** + +This operation is also referred to as the "split-apply-combine" operation, involving the following steps: + +* **Splitting** the data into groups based on some criteria +* **Applying** a function to each group independently +* **Combining** the results into a data structure + + + +Similar to SQL `GROUP BY` + ++++ + +Instead of doing the manual filtering as above + + + df[df['key'] == "A"].sum() + df[df['key'] == "B"].sum() + ... + +pandas provides the `groupby` method to do exactly this: + +```{code-cell} ipython3 +df.groupby('key').sum() +``` + +```{code-cell} ipython3 +df.groupby('key').aggregate(np.sum) # 'sum' +``` + +And many more methods are available. + +```{code-cell} ipython3 +df.groupby('key')['data'].sum() +``` + +# Application of the groupby concept on the titanic data + ++++ + +We go back to the titanic passengers survival data: + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +
+ +EXERCISE: + +
    +
  • Using groupby(), calculate the average age for each sex.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df.groupby('Sex')['Age'].mean() +``` + +
+ +EXERCISE: + +
    +
  • Calculate the average survival ratio for all passengers.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +# df['Survived'].sum() / len(df['Survived']) +df['Survived'].mean() +``` + +
+ +EXERCISE: + +
    +
  • Calculate this survival ratio for all passengers younger than 25 (remember: filtering/boolean indexing).
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df25 = df[df['Age'] < 25] +df25['Survived'].sum() / len(df25['Survived']) +``` + +
+ +EXERCISE: + +
    +
  • What is the difference in the survival ratio between the sexes?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df.groupby('Sex')['Survived'].mean() +``` + +
+ +EXERCISE: + +
    +
  • Make a bar plot of the survival ratio for the different classes ('Pclass' column).
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df.groupby('Pclass')['Survived'].mean().plot(kind='bar') #and what if you would compare the total number of survivors? +``` + +
+ +EXERCISE: + +
    +
  • Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: false + +df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10)) +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0) +``` + +If you are ready, more groupby exercises can be found below. + ++++ + +# Some more theory + ++++ + +## Specifying the grouper + ++++ + +In the previous example and exercises, we always grouped by a single column by passing its name. But, a column name is not the only value you can pass as the grouper in `df.groupby(grouper)`. Other possibilities for `grouper` are: + +- a list of strings (to group by multiple columns) +- a Series (similar to a string indicating a column in df) or array +- function (to be applied on the index) +- levels=[], names of levels in a MultiIndex + +```{code-cell} ipython3 +df.groupby(df['Age'] < 18)['Survived'].mean() +``` + +```{code-cell} ipython3 +df.groupby(['Pclass', 'Sex'])['Survived'].mean() +``` + +## The size of groups - value counts + ++++ + +Oftentimes you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column). + +To get the size of the groups, we can use `size`: + +```{code-cell} ipython3 +df.groupby('Pclass').size() +``` + +```{code-cell} ipython3 +df.groupby('Embarked').size() +``` + +Another way to obtain such counts, is to use the Series `value_counts` method: + +```{code-cell} ipython3 +df['Embarked'].value_counts() +``` + +# [OPTIONAL] Additional exercises using the movie data + ++++ + +These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder. + ++++ + +`cast` dataset: different roles played by actors/actresses in films + +- title: title of the movie +- year: year it was released +- name: name of the actor/actress +- type: actor/actress +- n: the order of the role (n=1: leading role) + +```{code-cell} ipython3 +cast = pd.read_csv('../data/cast.csv') +cast.head() +``` + +`titles` dataset: + +* title: title of the movie +* year: year of release + +```{code-cell} ipython3 +titles = pd.read_csv('../data/titles.csv') +titles.head() +``` + +
+ +EXERCISE: + +
    +
  • Using `groupby()`, plot the number of films that have been released each decade in the history of cinema.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles['decade'] = titles['year'] // 10 * 10 +``` + +```{code-cell} ipython3 +:clear_cell: true + +titles.groupby('decade').size().plot(kind='bar', color='green') +``` + +
+ +EXERCISE: + +
    +
  • Use `groupby()` to plot the number of 'Hamlet' movies made each decade.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles['decade'] = titles['year'] // 10 * 10 +hamlet = titles[titles['title'] == 'Hamlet'] +hamlet.groupby('decade').size().plot(kind='bar', color="orange") +``` + +
+ +EXERCISE: + +
    +
  • For each decade, plot all movies of which the title contains "Hamlet".
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles['decade'] = titles['year'] // 10 * 10 +hamlet = titles[titles['title'].str.contains('Hamlet')] +hamlet.groupby('decade').size().plot(kind='bar', color="lightblue") +``` + +
+ +EXERCISE: + +
    +
  • List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast1990 = cast[cast['year'] >= 1990] +cast1990 = cast1990[cast1990['n'] == 1] +cast1990.groupby('name').size().nlargest(10) +``` + +```{code-cell} ipython3 +:clear_cell: true + +cast1990['name'].value_counts().head(10) +``` + +
+ +EXERCISE: + +
    +
  • In a previous exercise, the number of 'Hamlet' films released each decade was checked. Not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet' and an overview of the titles that start with 'Hamlet', each time providing the amount of occurrences in the data set for each of the movies
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +hamlets = titles[titles['title'].str.contains('Hamlet')] +hamlets['title'].value_counts() +``` + +```{code-cell} ipython3 +:clear_cell: true + +hamlets = titles[titles['title'].str.startswith('Hamlet')] +hamlets['title'].value_counts() +``` + +
+ +EXERCISE: + +
    +
  • List the 10 movie titles with the longest name.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +title_longest = titles['title'].str.len().nlargest(10) +title_longest +``` + +```{code-cell} ipython3 +:clear_cell: true + +pd.options.display.max_colwidth = 210 +titles.loc[title_longest.index] +``` + +
+ +EXERCISE: + +
    +
  • How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast1950 = cast[cast['year'] // 10 == 195] +cast1950 = cast1950[cast1950['n'] == 1] +cast1950.groupby(['year', 'type']).size() +``` + +
+ +EXERCISE: + +
    +
  • What are the 11 most common character names in movie history?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast.character.value_counts().head(11) +``` + +
+ +EXERCISE: + +
    +
  • Plot how many roles Brad Pitt has played in each year of his career.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast[cast.name == 'Brad Pitt'].year.value_counts().sort_index().plot() +``` + +
+ +EXERCISE: + +
    +
  • What are the 10 most occurring movie titles that start with the words 'The Life'?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +titles[titles['title'].str.startswith('The Life')]['title'].value_counts().head(10) +``` + +
+ +EXERCISE: + +
    +
  • Which actors or actresses were most active in the year 2010 (i.e. appeared in the most movies)?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast[cast.year == 2010].name.value_counts().head(10) +``` + +
+ +EXERCISE: + +
    +
  • Determine how many roles are listed for each of 'The Pink Panther' movies.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +pink = cast[cast['title'] == 'The Pink Panther'] +pink.groupby(['year'])[['n']].max() +``` + +
+ +EXERCISE: + +
    +
  • List, in order by year, each of the movies in which 'Frank Oz' has played more than 1 role.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +oz = cast[cast['name'] == 'Frank Oz'] +oz_roles = oz.groupby(['year', 'title']).size() +oz_roles[oz_roles > 1] +``` + +
+ +EXERCISE: + +
    +
  • List each of the characters that Frank Oz has portrayed at least twice.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +oz = cast[cast['name'] == 'Frank Oz'] +oz_roles = oz.groupby(['character']).size() +oz_roles[oz_roles > 1].sort_values() +``` + +
+ +EXERCISE: + +
    +
  • Add a new column to the `cast` DataFrame that indicates the number of roles for each movie. [Hint](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation)
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast['n_total'] = cast.groupby('title')['n'].transform('max') # transform will return an element for each row, so the max value is given to the whole group +cast.head() +``` + +
+ +EXERCISE: + +
    +
  • Calculate the ratio of leading actor and actress roles to the total number of leading roles per decade.
  • +

+ +**Tip**: you can do a groupby twice in two steps, first calculating the numbers, and secondly, the ratios. +
+ +```{code-cell} ipython3 +:clear_cell: true + +leading = cast[cast['n'] == 1] +sums_decade = leading.groupby([cast['year'] // 10 * 10, 'type']).size() +sums_decade +``` + +```{code-cell} ipython3 +:clear_cell: true + +#sums_decade.groupby(level='year').transform(lambda x: x / x.sum()) +ratios_decade = sums_decade / sums_decade.groupby(level='year').transform('sum') +ratios_decade +``` + +```{code-cell} ipython3 +:clear_cell: true + +ratios_decade[:, 'actor'].plot() +ratios_decade[:, 'actress'].plot() +``` + +
+ +EXERCISE: + +
    +
  • In which years the most films were released?
  • +

+
+ +```{code-cell} ipython3 +:clear_cell: true + +t = titles +t.year.value_counts().head(3) +``` + +
+ +EXERCISE: + +
    +
  • How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?
  • +

+
+ +```{code-cell} ipython3 +:clear_cell: true + +cast1950 = cast[cast['year'] // 10 == 195] +cast1950 = cast1950[cast1950['n'] == 1] +cast1950['type'].value_counts() +``` + +```{code-cell} ipython3 +:clear_cell: true + +cast2000 = cast[cast['year'] // 10 == 200] +cast2000 = cast2000[cast2000['n'] == 1] +cast2000['type'].value_counts() +``` diff --git a/_solved/pandas_07_reshaping_data.md b/_solved/pandas_07_reshaping_data.md new file mode 100644 index 0000000..9091b81 --- /dev/null +++ b/_solved/pandas_07_reshaping_data.md @@ -0,0 +1,465 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

07 - Pandas: Reshaping data

+ +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +``` + +# Pivoting data + ++++ + +## Cfr. excel + ++++ + +People who know Excel, probably know the **Pivot** functionality: + ++++ + +![](../img/pivot_excel.png) + ++++ + +The data of the table: + +```{code-cell} ipython3 +excelample = pd.DataFrame({'Month': ["January", "January", "January", "January", + "February", "February", "February", "February", + "March", "March", "March", "March"], + 'Category': ["Transportation", "Grocery", "Household", "Entertainment", + "Transportation", "Grocery", "Household", "Entertainment", + "Transportation", "Grocery", "Household", "Entertainment"], + 'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]}) +``` + +```{code-cell} ipython3 +excelample +``` + +```{code-cell} ipython3 +excelample_pivot = excelample.pivot(index="Category", columns="Month", values="Amount") +excelample_pivot +``` + +Interested in *Grand totals*? + +```{code-cell} ipython3 +# sum columns +excelample_pivot.sum(axis=1) +``` + +```{code-cell} ipython3 +# sum rows +excelample_pivot.sum(axis=0) +``` + +## Pivot is just reordering your data: + ++++ + +Small subsample of the titanic dataset: + +```{code-cell} ipython3 +df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0], + 'Pclass': [3, 1, 1, 2, 3, 2], + 'Sex': ['male', 'female', 'male', 'female', 'female', 'male'], + 'Survived': [0, 1, 0, 1, 0, 1]}) +``` + +```{code-cell} ipython3 +df +``` + +```{code-cell} ipython3 +df.pivot(index='Pclass', columns='Sex', values='Fare') +``` + +```{code-cell} ipython3 +df.pivot(index='Pclass', columns='Sex', values='Survived') +``` + +So far, so good... + ++++ + +Let's now use the full titanic dataset: + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +And try the same pivot (*no worries about the try-except, this is here just used to catch a loooong error*): + +```{code-cell} ipython3 +try: + df.pivot(index='Sex', columns='Pclass', values='Fare') +except Exception as e: + print("Exception!", e) +``` + +This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: `duplicated` values for the columns in the selection. As an example, consider the following rows of our three columns of interest: + +```{code-cell} ipython3 +df.loc[[1, 3], ["Sex", 'Pclass', 'Fare']] +``` + +Since `pivot` is just restructering data, where would both values of `Fare` for the same combination of `Sex` and `Pclass` need to go? + +Well, they need to be combined, according to an `aggregation` functionality, which is supported by the function`pivot_table` + ++++ + +
+ +NOTE: + +
    +
  • Pivot is purely restructering: a single value for each index/column combination is required.
  • +
+ +
+ ++++ + +# Pivot tables - aggregating while pivoting + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.pivot_table(index='Sex', columns='Pclass', values='Fare') +``` + +
+ +REMEMBER: + +* By default, `pivot_table` takes the **mean** of all values that would end up into one cell. However, you can also specify other aggregation functions using the `aggfunc` keyword. + +
+ +```{code-cell} ipython3 +df.pivot_table(index='Sex', columns='Pclass', + values='Fare', aggfunc='max') +``` + +```{code-cell} ipython3 +df.pivot_table(index='Sex', columns='Pclass', + values='Fare', aggfunc='count') +``` + +
+ +REMEMBER: + +
    +
  • There is a shortcut function for a pivot_table with a aggfunc='count' as aggregation: crosstab
  • +
+
+ +```{code-cell} ipython3 +pd.crosstab(index=df['Sex'], columns=df['Pclass']) +``` + ++++ {"clear_cell": false} + +
+ +EXERCISE: + +
    +
  • Make a pivot table with the survival rates for Pclass vs Sex.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df.pivot_table(index='Pclass', columns='Sex', + values='Survived', aggfunc='mean') +``` + +```{code-cell} ipython3 +:clear_cell: true + +fig, ax1 = plt.subplots() +df.pivot_table(index='Pclass', columns='Sex', + values='Survived', aggfunc='mean').plot(kind='bar', + rot=0, + ax=ax1) +ax1.set_ylabel('Survival ratio') +``` + ++++ {"clear_cell": false} + +
+ +EXERCISE: + +
    +
  • Make a table of the median Fare payed by aged/underaged vs Sex.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df['Underaged'] = df['Age'] <= 18 +``` + +```{code-cell} ipython3 +:clear_cell: true + +df.pivot_table(index='Underaged', columns='Sex', + values='Fare', aggfunc='median') +``` + +# Melt - from pivot table to long or tidy format + ++++ + +The `melt` function performs the inverse operation of a `pivot`. This can be used to make your frame longer, i.e. to make a *tidy* version of your data. + +```{code-cell} ipython3 +pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index() +pivoted.columns.name = None +``` + +```{code-cell} ipython3 +pivoted +``` + +Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the `melt` function: + +```{code-cell} ipython3 +pd.melt(pivoted) +``` + +As you can see above, the `melt` function puts all column labels in one column, and all values in a second column. + +In this case, this is not fully what we want. We would like to keep the 'Sex' column separately: + +```{code-cell} ipython3 +pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare') +``` + +# Reshaping with `stack` and `unstack` + ++++ + +The docs say: + +> Pivot a level of the (possibly hierarchical) column labels, returning a +DataFrame (or Series in the case of an object with a single level of +column labels) having a hierarchical index with a new inner-most level +of row labels. + +Indeed... + + +Before we speak about `hierarchical index`, first check it in practice on the following dummy example: + +```{code-cell} ipython3 +df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], + 'B':['a', 'b', 'a', 'b'], + 'C':range(4)}) +df +``` + +To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index: + +```{code-cell} ipython3 +df = df.set_index(['A', 'B']) # Indeed, you can combine two indices +df +``` + +```{code-cell} ipython3 +result = df['C'].unstack() +result +``` + +```{code-cell} ipython3 +df = result.stack().reset_index(name='C') +df +``` + +
+ +REMEMBER: + +
    +
  • stack: make your data longer and smaller
  • +
  • unstack: make your data shorter and wider
  • +
+
+ ++++ + +## Mimick pivot table + ++++ + +To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a `groupby` and `stack/unstack`. + +```{code-cell} ipython3 +df = pd.read_csv("../data/titanic.csv") +``` + +```{code-cell} ipython3 +df.head() +``` + +```{code-cell} ipython3 +df.pivot_table(index='Pclass', columns='Sex', + values='Survived', aggfunc='mean') +``` + +
+ +EXERCISE: + +
    +
  • Get the same result as above based on a combination of `groupby` and `unstack`
  • +
  • First use `groupby` to calculate the survival ratio for all groups`unstack`
  • +
  • Then, use `unstack` to reshape the output of the groupby operation
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack() +``` + +# [OPTIONAL] Exercises: use the reshaping methods with the movie data + ++++ + +These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder. + +```{code-cell} ipython3 +cast = pd.read_csv('../data/cast.csv') +cast.head() +``` + +```{code-cell} ipython3 +titles = pd.read_csv('../data/titles.csv') +titles.head() +``` + +
+ +EXERCISE: + +
    +
  • Plot the number of actor roles each year and the number of actress roles each year over the whole period of available movie data.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +grouped = cast.groupby(['year', 'type']).size() +table = grouped.unstack('type') +table.plot() +``` + +```{code-cell} ipython3 +:clear_cell: true + +cast.pivot_table(index='year', columns='type', values="character", aggfunc='count').plot() +# for values in using the , take a column with no Nan values in order to count effectively all values -> at this stage: aha-erlebnis about crosstab function(!) +``` + +```{code-cell} ipython3 +:clear_cell: true + +pd.crosstab(index=cast['year'], columns=cast['type']).plot() +``` + +
+ +EXERCISE: + +
    +
  • Plot the number of actor roles each year and the number of actress roles each year. Use kind='area' as plot type
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +pd.crosstab(index=cast['year'], columns=cast['type']).plot(kind='area') +``` + +
+ +EXERCISE: + +
    +
  • Plot the fraction of roles that have been 'actor' roles each year over the whole period of available movie data.
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +grouped = cast.groupby(['year', 'type']).size() +table = grouped.unstack('type') +(table['actor'] / (table['actor'] + table['actress'])).plot(ylim=[0,1]) +``` + +
+ +EXERCISE: + +
    +
  • Define a year as a "Superman year" when films of that year feature more Superman characters than Batman characters. How many years in film history have been Superman years?
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +c = cast +c = c[(c.character == 'Superman') | (c.character == 'Batman')] +c = c.groupby(['year', 'character']).size() +c = c.unstack() +c = c.fillna(0) +c.head() +``` + +```{code-cell} ipython3 +:clear_cell: true + +d = c.Superman - c.Batman +print('Superman years:') +print(len(d[d > 0.0])) +``` diff --git a/_solved/visualization_01_matplotlib.md b/_solved/visualization_01_matplotlib.md new file mode 100644 index 0000000..b3300ff --- /dev/null +++ b/_solved/visualization_01_matplotlib.md @@ -0,0 +1,443 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Matplotlib: Introduction

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + ++++ + +# Matplotlib + ++++ + +[Matplotlib](http://matplotlib.org/) is a Python package used widely throughout the scientific Python community to produce high quality 2D publication graphics. It transparently supports a wide range of output formats including PNG (and other raster formats), PostScript/EPS, PDF and SVG and has interfaces for all of the major desktop GUI (graphical user interface) toolkits. It is a great package with lots of options. + +However, matplotlib is... + +> The 800-pound gorilla — and like most 800-pound gorillas, this one should probably be avoided unless you genuinely need its power, e.g., to make a **custom plot** or produce a **publication-ready** graphic. + +> (As we’ll see, when it comes to statistical visualization, the preferred tack might be: “do as much as you easily can in your convenience layer of choice [nvdr e.g. directly from Pandas, or with seaborn], and then use matplotlib for the rest.”) + +(quote used from [this](https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/) blogpost) + +And that's we mostly did, just use the `.plot` function of Pandas. So, why do we learn matplotlib? Well, for the *...then use matplotlib for the rest.*; at some point, somehow! + +Matplotlib comes with a convenience sub-package called ``pyplot`` which, for consistency with the wider matplotlib community, should always be imported as ``plt``: + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt +``` + +## - dry stuff - The matplotlib `Figure`, `axes` and `axis` + +At the heart of **every** plot is the figure object. The "Figure" object is the top level concept which can be drawn to one of the many output formats, or simply just to screen. Any object which can be drawn in this way is known as an "Artist" in matplotlib. + +Lets create our first artist using pyplot, and then show it: + +```{code-cell} ipython3 +fig = plt.figure() +plt.show() +``` + +On its own, drawing the figure artist is uninteresting and will result in an empty piece of paper (that's why we didn't see anything above). + +By far the most useful artist in matplotlib is the **Axes** artist. The Axes artist represents the "data space" of a typical plot, a rectangular axes (the most common, but not always the case, e.g. polar plots) will have 2 (confusingly named) **Axis** artists with tick labels and tick marks. + +There is no limit on the number of Axes artists which can exist on a Figure artist. Let's go ahead and create a figure with a single Axes artist, and show it using pyplot: + +```{code-cell} ipython3 +ax = plt.axes() +``` + +Matplotlib's ``pyplot`` module makes the process of creating graphics easier by allowing us to skip some of the tedious Artist construction. For example, we did not need to manually create the Figure artist with ``plt.figure`` because it was implicit that we needed a figure when we created the Axes artist. + +Under the hood matplotlib still had to create a Figure artist, its just we didn't need to capture it into a variable. We can access the created object with the "state" functions found in pyplot called **``gcf``** and **``gca``**. + ++++ + +## - essential stuff - `pyplot` versus Object based + ++++ + +Some example data: + +```{code-cell} ipython3 +x = np.linspace(0, 5, 10) +y = x ** 2 +``` + +Observe the following difference: + ++++ + +**1. pyplot style: plt...** (you will see this a lot for code online!) + +```{code-cell} ipython3 +plt.plot(x, y, '-') +``` + +**2. creating objects** + +```{code-cell} ipython3 +fig, ax = plt.subplots() +ax.plot(x, y, '-') +``` + +Although a little bit more code is involved, the advantage is that we now have **full control** of where the plot axes are placed, and we can easily add more than one axis to the figure: + +```{code-cell} ipython3 +fig, ax1 = plt.subplots() +ax1.plot(x, y, '-') +ax1.set_ylabel('y') + +ax2 = fig.add_axes([0.2, 0.5, 0.4, 0.3]) # inset axes +ax2.set_xlabel('x') +ax2.plot(x, y*2, 'r-') +``` + +
+ +REMEMBER: + +
    +
  • Use the object oriented power of Matplotlib!
  • +
  • Get yourself used to writing fig, ax = plt.subplots()
  • +
+
+ +```{code-cell} ipython3 +fig, ax = plt.subplots() +ax.plot(x, y, '-') +# ... +``` + +## An small cheat-sheet reference for some common elements + +```{code-cell} ipython3 +x = np.linspace(-1, 0, 100) + +fig, ax = plt.subplots(figsize=(10, 7)) + +# Adjust the created axes so that its topmost extent is 0.8 of the figure. +fig.subplots_adjust(top=0.9) + +ax.plot(x, x**2, color='0.4', label='power 2') +ax.plot(x, x**3, color='0.8', linestyle='--', label='power 3') + +ax.vlines(x=-0.75, ymin=0., ymax=0.8, color='0.4', linestyle='-.') +ax.axhline(y=0.1, color='0.4', linestyle='-.') +ax.fill_between(x=[-1, 1.1], y1=[0.65], y2=[0.75], color='0.85') + +fig.suptitle('Figure title', fontsize=18, + fontweight='bold') +ax.set_title('Axes title', fontsize=16) + +ax.set_xlabel('The X axis') +ax.set_ylabel('The Y axis $y=f(x)$', fontsize=16) + +ax.set_xlim(-1.0, 1.1) +ax.set_ylim(-0.1, 1.) + +ax.text(0.5, 0.2, 'Text centered at (0.5, 0.2)\nin data coordinates.', + horizontalalignment='center', fontsize=14) + +ax.text(0.5, 0.5, 'Text centered at (0.5, 0.5)\nin Figure coordinates.', + horizontalalignment='center', fontsize=14, + transform=ax.transAxes, color='grey') + +ax.legend(loc='upper right', frameon=True, ncol=2, fontsize=14) +``` + +For more information on legend positioning, check [this post](http://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot) on stackoverflow! + ++++ + +## I do not like the style... + ++++ + +**...understandable** + ++++ + +Matplotlib had a bad reputation in terms of its default styling as figures created with earlier versions of Matplotlib were very Matlab-lookalike and mostly not really catchy. + +Since Matplotlib 2.0, this has changed: https://matplotlib.org/users/dflt_style_changes.html! + +However... +> *Des goûts et des couleurs, on ne discute pas...* + +(check [this link](https://fr.wiktionary.org/wiki/des_go%C3%BBts_et_des_couleurs,_on_ne_discute_pas) if you're not french-speaking) + +To account different tastes, Matplotlib provides a number of styles that can be used to quickly change a number of settings: + +```{code-cell} ipython3 +plt.style.available +``` + +```{code-cell} ipython3 +x = np.linspace(0, 10) + +with plt.style.context('seaborn'): # 'seaborn', ggplot', 'bmh', 'grayscale', 'seaborn-whitegrid', 'seaborn-muted' + fig, ax = plt.subplots() + ax.plot(x, np.sin(x) + x + np.random.randn(50)) + ax.plot(x, np.sin(x) + 0.5 * x + np.random.randn(50)) + ax.plot(x, np.sin(x) + 2 * x + np.random.randn(50)) +``` + +We should not start discussing about colors and styles, just pick **your favorite style**! + +```{code-cell} ipython3 +plt.style.use('seaborn-whitegrid') +``` + +or go all the way and define your own custom style, see the [official documentation](https://matplotlib.org/3.1.1/tutorials/introductory/customizing.html) or [this tutorial](https://colcarroll.github.io/yourplotlib/#/). + ++++ + +
+ +REMEMBER: + +
    +
  • If you just want quickly a good-looking plot, use one of the available styles (plt.style.use('...'))
  • +
  • Otherwise, the object-oriented way of working makes it possible to change everything!
  • +
+
+ ++++ + +## Interaction with Pandas + ++++ + +What we have been doing while plotting with Pandas: + +```{code-cell} ipython3 +import pandas as pd +``` + +```{code-cell} ipython3 +flowdata = pd.read_csv('../data/vmm_flowdata.csv', + index_col='Time', + parse_dates=True) +``` + +```{code-cell} ipython3 +flowdata.plot() +``` + +### Pandas versus matplotlib + ++++ + +#### Comparison 1: single plot + +```{code-cell} ipython3 +flowdata.plot(figsize=(16, 6)) # shift tab this! +``` + +Making this with matplotlib... + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(16, 6)) +ax.plot(flowdata) +ax.legend(["L06_347", "LS06_347", "LS06_348"]) +``` + +is still ok! + ++++ + +#### Comparison 2: with subplots + +```{code-cell} ipython3 +axs = flowdata.plot(subplots=True, sharex=True, + figsize=(16, 8), colormap='viridis', # Dark2 + fontsize=15, rot=0) +``` + +Mimicking this in matplotlib (just as a reference): + +```{code-cell} ipython3 +from matplotlib import cm +import matplotlib.dates as mdates + +colors = [cm.viridis(x) for x in np.linspace(0.0, 1.0, len(flowdata.columns))] # list comprehension to set up the colors + +fig, axs = plt.subplots(3, 1, figsize=(16, 8)) + +for ax, col, station in zip(axs, colors, flowdata.columns): + ax.plot(flowdata.index, flowdata[station], label=station, color=col) + ax.legend() + if not ax.is_last_row(): + ax.xaxis.set_ticklabels([]) + ax.xaxis.set_major_locator(mdates.YearLocator()) + else: + ax.xaxis.set_major_locator(mdates.YearLocator()) + ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y')) + ax.set_xlabel('Time') + ax.tick_params(labelsize=15) +``` + +Is already a bit harder ;-) + ++++ + +### Best of both worlds... + +```{code-cell} ipython3 +fig, ax = plt.subplots() #prepare a matplotlib figure + +flowdata.plot(ax=ax) # use pandas for the plotting +``` + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(15, 5)) #prepare a matplotlib figure + +flowdata.plot(ax=ax) # use pandas for the plotting + +# Provide further adaptations with matplotlib: +ax.set_xlabel("") +ax.grid(which="major", linewidth='0.5', color='0.8') +fig.suptitle('Flow station time series', fontsize=15) +``` + +```{code-cell} ipython3 +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 6)) #provide with matplotlib 2 axis + +flowdata[["L06_347", "LS06_347"]].plot(ax=ax1) # plot the two timeseries of the same location on the first plot +flowdata["LS06_348"].plot(ax=ax2, color='0.2') # plot the other station on the second plot + +# further adapt with matplotlib +ax1.set_ylabel("L06_347") +ax2.set_ylabel("LS06_348") +ax2.legend() +``` + +
+ + Remember: + +
    +
  • You can do anything with matplotlib, but at a cost... stackoverflow
  • + +
  • The preformatting of Pandas provides mostly enough flexibility for quick analysis and draft reporting. It is not for paper-proof figures or customization
  • +
+
+ +If you take the time to make your perfect/spot-on/greatest-ever matplotlib-figure: Make it a reusable function! + +
+ ++++ + +An example of such a reusable function to plot data: + +```{code-cell} ipython3 +%%file plotter.py +#this writes a file in your directory, check it(!) + +import numpy as np +import matplotlib.pyplot as plt +import matplotlib.dates as mdates + +from matplotlib import cm +from matplotlib.ticker import MaxNLocator + +def vmm_station_plotter(flowdata, label="flow (m$^3$s$^{-1}$)"): + colors = [cm.viridis(x) for x in np.linspace(0.0, 1.0, len(flowdata.columns))] # list comprehension to set up the color sequence + + fig, axs = plt.subplots(3, 1, figsize=(16, 8)) + + for ax, col, station in zip(axs, colors, flowdata.columns): + ax.plot(flowdata.index, flowdata[station], label=station, color=col) # this plots the data itself + + ax.legend(fontsize=15) + ax.set_ylabel(label, size=15) + ax.yaxis.set_major_locator(MaxNLocator(4)) # smaller set of y-ticks for clarity + + if not ax.is_last_row(): # hide the xticklabels from the none-lower row x-axis + ax.xaxis.set_ticklabels([]) + ax.xaxis.set_major_locator(mdates.YearLocator()) + else: # yearly xticklabels from the lower x-axis in the subplots + ax.xaxis.set_major_locator(mdates.YearLocator()) + ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y')) + ax.tick_params(axis='both', labelsize=15, pad=8) # enlarge the ticklabels and increase distance to axis (otherwise overlap) + return fig, axs +``` + +```{code-cell} ipython3 +from plotter import vmm_station_plotter +# fig, axs = vmm_station_plotter(flowdata) +``` + +```{code-cell} ipython3 +fig, axs = vmm_station_plotter(flowdata, + label="NO$_3$ (mg/l)") +fig.suptitle('Ammonium concentrations in the Maarkebeek', fontsize='17') +fig.savefig('ammonium_concentration.pdf') +``` + +
+ + NOTE: + +
    +
  • Let your hard work pay off, write your own custom functions!
  • +
+ +
+ ++++ + +
+ + Remember: + +`fig.savefig()` to save your Figure object! + +
+ ++++ + +# Need more matplotlib inspiration? + ++++ + +For more in-depth material: +* http://www.labri.fr/perso/nrougier/teaching/matplotlib/ +* notebooks in matplotlib section: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb#4.-Visualization-with-Matplotlib +* main reference: [matplotlib homepage](http://matplotlib.org/) + ++++ + +
+ + Remember(!) + + +
+ + +
diff --git a/_solved/visualization_02_plotnine.md b/_solved/visualization_02_plotnine.md new file mode 100644 index 0000000..a57372e --- /dev/null +++ b/_solved/visualization_02_plotnine.md @@ -0,0 +1,509 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Plotnine: Introduction

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + +```{code-cell} ipython3 +import pandas as pd +``` + +# Plotnine + +http://plotnine.readthedocs.io/en/stable/ + ++++ + +* Built on top of Matplotlib, but providing + 1. High level functions + 2. Implementation of the [Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448), which became famous due to the `ggplot2` R package + 3. The syntax is highly similar to the `ggplot2` R package +* Works well with Pandas + +```{code-cell} ipython3 +import plotnine as p9 +``` + +## Introduction + ++++ + +We will use the Titanic example data set: + +```{code-cell} ipython3 +titanic = pd.read_csv('../data/titanic.csv') +``` + +```{code-cell} ipython3 +titanic.head() +``` + +Let's consider following question: +>*For each class at the Titanic, how many people survived and how many died?* + ++++ + +Hence, we should define the *size* of respectively the zeros (died) and ones (survived) groups of column `Survived`, also grouped by the `Pclass`. In Pandas terminology: + +```{code-cell} ipython3 +survived_stat = titanic.groupby(["Pclass", "Survived"]).size().rename('count').reset_index() +survived_stat +# Remark: the `rename` syntax is to provide the count column a column name +``` + +Providing this data in a bar chart with pure Pandas is still partly supported: + +```{code-cell} ipython3 +survived_stat.plot(x='Survived', y='count', kind='bar') +## A possible other way of plotting this could be using groupby again: +#survived_stat.groupby('Pclass').plot(x='Survived', y='count', kind='bar') # (try yourself by uncommenting) +``` + +but with mixed results... + ++++ + +Plotting libraries focussing on the **grammar of graphics** are really targeting these *grouped* plots. For example, the plotting of the resulting counts can be expressed in the grammar of graphics: + +```{code-cell} ipython3 +(p9.ggplot(survived_stat, + p9.aes(x='Survived', y='count', fill='factor(Survived)')) + + p9.geom_bar(stat='identity', position='dodge') + + p9.facet_wrap(facets='Pclass')) +``` + +Moreover, these `count` operations are embedded in the typical Grammar of Graphics packages and we can do these operations directly on the original `titanic` data set in a single coding step: + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='Survived', fill='factor(Survived)')) + + p9.geom_bar(stat='count', position='dodge') + + p9.facet_wrap(facets='Pclass')) +``` + +
+ + Remember: + +
    +
  • The Grammar of Graphics is especially suitbale for these so-called tidy dataframe representations (check here for more about `tidy` data)
  • +
  • plotnine is a library that supports the Grammar of graphics
  • +
+
+ +
+ ++++ + +## Building a plotnine graph + ++++ + +Building plots with plotnine is typically an iterative process. As illustrated in the introduction, a graph is setup by layering different elements on top of each other using the `+` operator. putting everything together in brackets `()` provides Python-compatible syntax. + ++++ + +#### data + ++++ + +* Bind the plot to a specific data frame using the data argument: + +```{code-cell} ipython3 +(p9.ggplot(data=titanic)) +``` + +We haven 't defined anything else, so just an empty *figure* is available. + ++++ + +#### aesthestics + ++++ + + +* Define aesthetics (**aes**), by **selecting variables** used in the plot and linking them to presentation such as plotting size, shape color, etc. You can interpret this as: **how** the variable will influence the plotted objects/geometries: + ++++ + +The most important `aes` are: `x`, `y`, `alpha`, `color`, `colour`, `fill`, `linetype`, `shape`, `size` and `stroke` + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare'))) +``` + +#### geometry + ++++ + +* Still nothing plotted yet, as we have to define what kind of [**geometry**](http://plotnine.readthedocs.io/en/stable/api.html#geoms) will be used for the plot. The easiest is probably using points: + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() +) +``` + +
+ +EXERCISE: + +
    +
  • Starting from the code of the last figure, adapt the code in such a way that the Sex variable defines the color of the points in the graph.
  • +
  • As both sex categories overlap, use an alternative geometry, so called geom_jitter
  • +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare', color='Sex')) + + p9.geom_jitter() +) +``` + +These are the basic elements to have a graph, but other elements can be added to the graph: + ++++ + +#### labels + ++++ + +* Change the [**labels**](http://plotnine.readthedocs.io/en/stable/api.html#Labels): + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() + + p9.xlab("Cabin class") +) +``` + +#### facets + ++++ + +* Use the power of `groupby` and define [**facets**](http://plotnine.readthedocs.io/en/stable/api.html#facets) to group the plot by a grouping variable: + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() + + p9.xlab("Cabin class") + + p9.facet_wrap('Sex')#, dir='v') +) +``` + +#### scales + ++++ + +* Defining [**scale**](http://plotnine.readthedocs.io/en/stable/api.html#scales) for colors, axes,... + +For example, a log-version of the y-axis could support the interpretation of the lower numbers: + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() + + p9.xlab("Cabin class") + + p9.facet_wrap('Sex') + + p9.scale_y_log10() +) +``` + +#### theme + ++++ + +* Changing [**theme** ](http://plotnine.readthedocs.io/en/stable/api.html#themes): + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() + + p9.xlab("Cabin class") + + p9.facet_wrap('Sex') + + p9.scale_y_log10() + + p9.theme_bw() +) +``` + +or changing specific [theming elements](http://plotnine.readthedocs.io/en/stable/api.html#Themeables), e.g. text size: + +```{code-cell} ipython3 +(p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() + + p9.xlab("Cabin class") + + p9.facet_wrap('Sex') + + p9.scale_y_log10() + + p9.theme_bw() + + p9.theme(text=p9.element_text(size=14)) +) +``` + +#### more... + ++++ + +* adding [**statistical derivatives**](http://plotnine.readthedocs.io/en/stable/api.html#stats) +* changing the [**plot coordinate**](http://plotnine.readthedocs.io/en/stable/api.html#coordinates) system + ++++ + +
+ + Remember: + +
    +
  • Start with defining your data, aes variables and a geometry
  • +
  • Further extend your plot with scale_*, theme_*, xlab/ylab, facet_*
  • +
+
+ +
+ ++++ + +## plotnine is built on top of Matplotlib + ++++ + +As plotnine is built on top of Matplotlib, we can still retrieve the matplotlib `figure` object from plotnine for eventual customization: + +```{code-cell} ipython3 +myplot = (p9.ggplot(titanic, + p9.aes(x='factor(Pclass)', y='Fare')) + + p9.geom_point() +) +``` + +The trick is to use the `draw()` function in plotnine: + +```{code-cell} ipython3 +my_plt_version = myplot.draw() +``` + +```{code-cell} ipython3 +my_plt_version.axes[0].set_title("Titanic fare price per cabin class") +ax2 = my_plt_version.add_axes([0.5, 0.5, 0.3, 0.3], label="ax2") +my_plt_version +``` + +
+ + Remember: + +Similar to Pandas handling above, we can set up a matplotlib `Figure` with plotnine. Use `draw()` and the Matplotlib `Figure` is returned. + +
+ ++++ + +## (OPTIONAL SECTION) Some more plotnine functionalities to remember... + ++++ + +**Histogram**: Getting the univariaite distribution of the `Age` + +```{code-cell} ipython3 +(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age')) + + p9.geom_histogram(bins=30)) +``` + +
+ +EXERCISE: + +
    +
  • Make a histogram of the age, grouped by the Sex of the passengers
  • +
  • Make sure both graphs are underneath each other instead of next to each other to enhance comparison
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age')) + + p9.geom_histogram(bins=30) + + p9.facet_wrap('Sex', nrow=2) +) +``` + +**boxplot/violin plot**: Getting the univariaite distribution of `Age` per `Sex` + +```{code-cell} ipython3 +(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age')) + + p9.geom_boxplot()) +``` + +Actually, a *violinplot* provides more inside to the distribution: + +```{code-cell} ipython3 +(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age')) + + p9.geom_violin() +) +``` + +
+ +EXERCISE: + +
    +
  • Make a violin plot of the Age for each `Sex`
  • +
  • Add `jitter` to the plot to see the actual data points
  • +
  • Adjust the transparency of the jitter dots to improve readability
  • + +
+
+ +```{code-cell} ipython3 +:clear_cell: true + +(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age')) + + p9.geom_violin() + + p9.geom_jitter(alpha=0.2) +) +``` + +**regressions** + ++++ + +plotnine supports a number of statistical functions with the [`geom_smooth` function]:(http://plotnine.readthedocs.io/en/stable/generated/plotnine.stats.stat_smooth.html#plotnine.stats.stat_smooth) + +The available methods are: +``` +* 'auto' # Use loess if (n<1000), glm otherwise +* 'lm', 'ols' # Linear Model +* 'wls' # Weighted Linear Model +* 'rlm' # Robust Linear Model +* 'glm' # Generalized linear Model +* 'gls' # Generalized Least Squares +* 'lowess' # Locally Weighted Regression (simple) +* 'loess' # Locally Weighted Regression +* 'mavg' # Moving Average +* 'gpr' # Gaussian Process Regressor +``` + +each of these functions are provided by existing Python libraries and integrated in plotnine, so make sure to have these dependencies installed (read the error message!) + +```{code-cell} ipython3 +(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), + p9.aes(x='Fare', y='Age', color="Sex")) + + p9.geom_point() + + p9.geom_rug(alpha=0.2) + + p9.geom_smooth(method='lm') +) +``` + +```{code-cell} ipython3 +(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), + p9.aes(x='Fare', y='Age', color="Sex")) + + p9.geom_point() + + p9.geom_rug(alpha=0.2) + + p9.geom_smooth(method='lm') + + p9.facet_wrap("Survived") + + p9.scale_color_brewer(type="qual") +) +``` + +# Need more plotnine inspiration? + ++++ + +
+ + Remember + +[plotnine gallery](http://plotnine.readthedocs.io/en/stable/gallery.html) and [great documentation](http://plotnine.readthedocs.io/en/stable/api.html) +

+Important resources to start from! + +
+ ++++ + + + ++++ + +# What is `tidy`? + ++++ + +If you're wondering what *tidy* data representations are, you can read the scientific paper by Hadley Wickham, http://vita.had.co.nz/papers/tidy-data.pdf. + +Here, we just introduce the main principle very briefly: + ++++ + +Compare: + +#### un-tidy + +| WWTP | Treatment A | Treatment B | +|:------|-------------|-------------| +| Destelbergen | 8. | 6.3 | +| Landegem | 7.5 | 5.2 | +| Dendermonde | 8.3 | 6.2 | +| Eeklo | 6.5 | 7.2 | + +*versus* + +#### tidy + +| WWTP | Treatment | pH | +|:------|:-------------:|:-------------:| +| Destelbergen | A | 8. | +| Landegem | A | 7.5 | +| Dendermonde | A | 8.3 | +| Eeklo | A | 6.5 | +| Destelbergen | B | 6.3 | +| Landegem | B | 5.2 | +| Dendermonde | B | 6.2 | +| Eeklo | B | 7.2 | + ++++ + +This is sometimes also referred as *short* versus *long* format for a specific variable... Plotnine (and other grammar of graphics libraries) work better on `tidy` data, as it better supports `groupby`-like transactions! + ++++ + +
+ + Remember: + +

+ + A tidy data set is setup as follows: + +
    +
  • Each variable forms a column and contains values
  • +
  • Each observation forms a row
  • +
  • Each type of observational unit forms a table.
  • +
+
diff --git a/_solved/visualization_03_landscape.md b/_solved/visualization_03_landscape.md new file mode 100644 index 0000000..730925f --- /dev/null +++ b/_solved/visualization_03_landscape.md @@ -0,0 +1,799 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +

Python's Visualization Landscape

+ + +> *DS Data manipulation, analysis and visualisation in Python* +> *December, 2019* + +> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* + +--- + ++++ + +--- +**Remark:** + +The packages used in this notebook are not provided by default in the conda environment of the course. In case you want to try these featutes yourself, make sure to install these packages with conda. + +To make some of the more general plotting packages available: + +``` +conda install -c conda-forge bokeh plotly altair vega +``` + +an additional advice will appear about the making the vega nbextension available. This can be activated with the command: + +``` +jupyter nbextension enable vega --py --sys-prefix +``` + +and use the interaction between plotly and pandas, install `cufflinks` as well + +``` +pip install cufflinks --upgrade +``` + +To run the large data set section, additional package installations are required: + +``` +conda install -c bokeh datashader holoviews +``` +--- + ++++ + +## What have we done so far? + ++++ + +What we have encountered until now: + +* [matplotlib](https://matplotlib.org/) +* [pandas .plot](https://pandas.pydata.org/pandas-docs/stable/visualization.html) +* [plotnine](https://github.com/has2k1/plotnine) +* a bit of [seaborn](https://seaborn.pydata.org/) + +```{code-cell} ipython3 +import numpy as np +import pandas as pd + +import matplotlib.pylab as plt +import plotnine as p9 +import seaborn as sns +``` + +### To 'grammar of graphics' or not to 'grammar of graphics'? + ++++ + +#### Introduction + ++++ + +There is `titanic` again... + +```{code-cell} ipython3 +titanic = pd.read_csv("../data/titanic.csv") +``` + +Pandas plot... + +```{code-cell} ipython3 +fig, ax = plt.subplots() +plt.style.use('ggplot') +survival_rate = titanic.groupby("Pclass")['Survived'].mean() +survival_rate.plot(kind='bar', color='grey', + rot=0, figsize=(6, 4), ax=ax) +ylab = ax.set_ylabel("Survival rate") +xlab = ax.set_xlabel("Cabin class") +``` + +Plotnine plot... + +```{code-cell} ipython3 +(p9.ggplot(titanic, p9.aes(x="factor(Pclass)", + y="Survived")) #add color/fill + + p9.geom_bar(stat='stat_summary', width=0.5) + + p9.theme(figure_size=(5, 3)) + + p9.ylab("Survival rate") + + p9.xlab("Cabin class") +) +``` + +An important difference is the *imperative* approach from `matplotlib` versus the *declarative* approach from `plotnine`: + ++++ + +| imperative | declarative | +|------------|-------------| +| Specify **how** something should be done | Specify **what** should be done | +| **Manually specify** the individual plotting steps | Individual plotting steps based on **declaration** | +| e.g. `for ax in axes: ax.plot(...` | e.g. `+ facet_wrap('my_variable)` | + ++++ + +
(seaborn lands somewhere in between)
+ ++++ + +Which approach to use, is also a matter of personal preference.... + ++++ + +Although, take following elements into account: +* When your data consists of only **1 factor variable**, such as + +| ID | variable 1 | variable 2 | variabel ... | +|------------|-------------| ---- | ----- | +| 1 | 0.2 | 0.8 | ... | +| 2 | 0.3 | 0.1 | ... | +| 3 | 0.9 | 0.6 | ... | +| 4 | 0.1 | 0.7 | ... | +| ... | ... | ... | ...| + +the added value of using a grammar of graphics approach is LOW. + +* When working with **timeseries data** from sensors or continuous logging, such as + +| datetime | station 1 | station 2 | station ... | +|------------|-------------| ---- | ----- | +| 2017-12-20T17:50:46Z | 0.2 | 0.8 | ... | +| 2017-12-20T17:50:52Z | 0.3 | 0.1 | ... | +| 2017-12-20T17:51:03Z | 0.9 | 0.6 | ... | +| 2017-12-20T17:51:40Z | 0.1 | 0.7 | ... | +| ... | ... | ... | ...| + +the added value of using a grammar of graphics approach is LOW. + +* When working with different experiments, different conditions, (factorial) **experimental designs**, such as + +| ID | substrate | addition (ml) | measured_value | +|----|-----------| ----- | ------ | +| 1 | Eindhoven | 0.3 | 7.2 | +| 2 | Eindhoven | 0.6 | 6.7 | +| 3 | Eindhoven | 0.9 | 5.2 | +| 4 | Destelbergen | 0.3 | 7.2 | +| 5 | Destelbergen | 0.6 | 6.8 | +| ... | ... | ... | ...| + +the added value of using a grammar of graphics approach is HIGH. Represent you're data [`tidy`](http://www.jeannicholashould.com/tidy-data-in-python.html) to achieve maximal benefit! + ++++ + +
+ + Remember: + +
    +
  • These packages will support you towards static, publication quality figures in a variety of hardcopy formats
  • +
  • In general, start with a high-level function and finish with matplotlib
  • +
+
+ +
+ ++++ + +Still... + +> *I've been wasting too much time on this one stupid detail for this publication graph* + +![](https://imgs.xkcd.com/comics/is_it_worth_the_time.png) + +```{code-cell} ipython3 +fig.savefig("my_plot_with_one_issue.pdf") +``` + +
+ + Notice: + +
    +
  • In the end... there is still Inkscape to the rescue!
  • +
+
+ +
+ ++++ + +### Seaborn + +```{code-cell} ipython3 +plt.style.use('seaborn-white') +``` + +> Seaborn is a library for making attractive and **informative statistical** graphics in Python. It is built **on top of matplotlib** and tightly integrated with the PyData stack, including **support for numpy and pandas** data structures and statistical routines from scipy and statsmodels. + ++++ + +Seaborn provides a set of particularly interesting plot functions: + ++++ + +#### scatterplot matrix + ++++ + +We've already encountered the [`pairplot`](https://seaborn.pydata.org/examples/scatterplot_matrix.html), a typical quick explorative plot function + +```{code-cell} ipython3 +# the discharge data for a number of measurement stations as example +flow_data = pd.read_csv("../data/vmm_flowdata.csv", parse_dates=True, index_col=0) +flow_data = flow_data.dropna() +flow_data['year'] = flow_data.index.year +flow_data.head() +``` + +```{code-cell} ipython3 +# pairplot +sns.pairplot(flow_data, vars=["L06_347", "LS06_347", "LS06_348"], + hue='year', palette=sns.color_palette("Blues_d"), + diag_kind='kde', dropna=True) +``` + +#### heatmap + ++++ + +Let's just start from a Ghent data set: The city of Ghent provides data about migration in the different districts as open data, https://data.stad.gent/data/58 + +```{code-cell} ipython3 +district_migration = pd.read_csv("https://datatank.stad.gent/4/bevolking/wijkmigratieperduizend.csv", + sep=";", index_col=0) +district_migration.index.name = "wijk" +district_migration.head() +``` + +```{code-cell} ipython3 +# cleaning the column headers +district_migration.columns = [year[-4:] for year in district_migration.columns] +district_migration.head() +``` + +```{code-cell} ipython3 +#adding a total column +district_migration['TOTAAL'] = district_migration.sum(axis=1) +``` + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(10, 10)) +sns.heatmap(district_migration, annot=True, fmt=".1f", linewidths=.5, + cmap="PiYG", ax=ax, vmin=-20, vmax=20) +ylab = ax.set_ylabel("") +ax.set_title("Migration of Ghent districts", size=14) +``` + +#### jointplot + ++++ + +[jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot) provides a very convenient function to check the combined distribution of two variables in a DataFrame (bivariate plot) + ++++ + +Using the default options on the flow_data dataset + +```{code-cell} ipython3 +g = sns.jointplot(data=flow_data, + x='LS06_347', y='LS06_348') +``` + +```{code-cell} ipython3 +g = sns.jointplot(data=flow_data, + x='LS06_347', y='LS06_348', + kind="reg", space=0) +``` + +more options, applied on the migration data set: + +```{code-cell} ipython3 +g = sns.jointplot(data=district_migration.transpose(), + x='Oud Gentbrugge', y='Nieuw Gent - UZ', + kind="kde", height=7, space=0) # kde +``` + +
+ + Notice!: + +
    +
  • Watch out with the interpretation. The representations (`kde`, `regression`) is based on a very limited set of data points!
  • +
+
+ +
+ ++++ + +Adding the data points itself, provides at least this info to the user: + +```{code-cell} ipython3 +g = (sns.jointplot( + data=district_migration.transpose(), + x='Oud Gentbrugge', y='Nieuw Gent - UZ', + kind="scatter", height=7, space=0, stat_func=None, + marginal_kws=dict(bins=20, rug=True) + ).plot_joint(sns.kdeplot, zorder=0, + n_levels=5, cmap='Reds')) +g.savefig("my_great_plot.pdf") +``` + +#### jointplot + ++++ + +With [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) and [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html#seaborn.relplot), Seaborn provides similarities with the Grammar of Graphics + +```{code-cell} ipython3 +sns.catplot(data=titanic, x="Survived", + col="Pclass", kind="count") +``` + +
+ + Remember - Check the package galleries: + + +
+ +
+ ++++ + +## Interactivity and the web matter these days! + ++++ + +### Bokeh + ++++ + +> *[Bokeh](https://bokeh.pydata.org/en/latest/) is a Python interactive visualization library that targets modern web browsers for presentation* + +```{code-cell} ipython3 +from bokeh.plotting import figure, output_file, show +``` + +By default, Bokeh will open a new webpage to plot the figure. Still, the **integration with notebooks** is provided as well: + +```{code-cell} ipython3 +from bokeh.io import output_notebook +``` + +```{code-cell} ipython3 +output_notebook() +``` + +```{code-cell} ipython3 +p = figure() +p.line(x=[1, 2, 3], y=[4,6,2]) +show(p) +``` + +
+ + Notice!: + +
    +
  • Bokeh does not support eps, pdf export of plots directly. Exporting to svg is available but still limited, see documentation
  • . +
+ +
+ ++++ + +To accomodate the users of **Pandas**, a `pd.DataFrame` can also be used as the input for a Bokeh plot: + +```{code-cell} ipython3 +from bokeh.models import ColumnDataSource +source_data = ColumnDataSource(data=flow_data) +``` + +```{code-cell} ipython3 +flow_data.head() +``` + +Useful to know when you want to use the index as well: +> *If the DataFrame has a named index column, then CDS will also have a column with this name. However, if the index name (or any subname of a MultiIndex) is None, then the CDS will have a column generically named index for the index.* + +```{code-cell} ipython3 +p = figure(x_axis_type="datetime", plot_height=300, plot_width=900) +p.line(x='Time', y='L06_347', source=source_data) +show(p) +``` + +The setup of the graph, is by adding new elements to the figure object, e.g. adding annotations: + +```{code-cell} ipython3 +from bokeh.models import ColumnDataSource, BoxAnnotation, Label +``` + +```{code-cell} ipython3 +p = figure(x_axis_type="datetime", plot_height=300, plot_width=900) +p.line(x='Time', y='L06_347', source=source_data) +p.circle(x='Time', y='L06_347', source=source_data, fill_alpha= 0.3, line_alpha=0.3) + +alarm_box = BoxAnnotation(bottom=10, fill_alpha=0.3, + fill_color='#ff6666') # arbitrary value; this is NOT the real-case value +p.add_layout(alarm_box) + +alarm_label = Label(text="Flood risk", x_units='screen', + x= 10, y=10, text_color="#330000") +p.add_layout(alarm_label) + +show(p) +``` + +Also [this `jointplot`](https://demo.bokehplots.com/apps/selection_histogram) and [this gapminder reproduction](https://demo.bokehplots.com/apps/gapminder) is based on Bokeh! + ++++ + +
+ + More Bokeh? + + + +
+ ++++ + +### Plotly + ++++ + +> [plotly.py](https://plot.ly/python/) is an interactive, browser-based graphing library for Python + +```{code-cell} ipython3 +import plotly +``` + +In the last years, plotly has been developed a lot and provides now a lot of functionalities for interactive plotting, see https://plot.ly/python/#fundamentals. It consists of two main components: __plotly__ provides all the basic components (so called `plotly.graph_objects`) to create plots and __plotly express__ provides a more high-level wrapper around `plotly.graph_objects` for rapid data exploration and figure generation. The latter focuses on _tidy_ data representation. + +As an example: create a histogram using the plotly `graph_objects`: + +```{code-cell} ipython3 +import plotly.graph_objects as go + +fig = go.Figure(data=[go.Histogram(x=titanic['Fare'].values)]) +fig.show() +``` + +Can be done in plotly express as well, supporting direct interaction with a Pandas DataFrame: + +```{code-cell} ipython3 +import plotly.express as px + +fig = px.histogram(titanic, x="Fare") +fig.show() +``` + +
+ + Notice!: + +
    +
  • Prior versions of plotly.py contained functionality for creating figures in both "online" and "offline" modes. Version 4 of plotly is "offline"-only. Make sure you check the latest documentation and watch out with outdated stackoverflow suggestions. The previous commercial/online version is rebranded into chart studio. +
  • +
+ +
+ ++++ + +As mentioned in the example, the interaction of plotly with Pandas is supported: + ++++ + +.1. Indirectly, by using the `plotly` specific [dictionary](https://plot.ly/python/creating-and-updating-figures/#figures-as-dictionaries) syntax: + +```{code-cell} ipython3 +import plotly.graph_objects as go + +df = flow_data[["L06_347", "LS06_348"]] + +fig = go.Figure({ + "data": [{'x': df.index, + 'y': df[col], + 'name': col} for col in df.columns], # remark, we use a list comprehension here ;-) + "layout": {"title": {"text": "Streamflow data"}} +}) +fig.show() +``` + +.2. or using the `plotly` object oriented approach with [graph objects](https://plot.ly/python/creating-and-updating-figures/#figures-as-graph-objects): + +```{code-cell} ipython3 +df = flow_data[["L06_347", "LS06_348"]] + +fig = go.Figure() + +for col in df.columns: + fig.add_trace(go.Scatter( + x=df.index, + y=df[col], + name=col)) + +fig.layout=go.Layout( + title=go.layout.Title(text="Streamflow data") + ) +fig.show() +``` + +.3. or using the `plotly express` functionalities: + +```{code-cell} ipython3 +df = flow_data[["L06_347", "LS06_348"]].reset_index() # reset index, as plotly express can not use the index directly +df = df.melt(id_vars="Time") # from wide to long format +df.head() +``` + +As mentioned, plotly express targets __tidy__ data (cfr. plotnine,...), so we converted the data to tidy/long format before plotting: + +```{code-cell} ipython3 +import plotly.express as px + +fig = px.line(df, x='Time', y='value', color="variable", title="Streamflow data") +fig.show() +``` + +.4. or by installing an additional package, `cufflinks`, which enables Pandas plotting with `iplot` instead of `plot`: + +```{code-cell} ipython3 +import cufflinks as cf + +df = flow_data[["L06_347", "LS06_348"]] +fig = df.iplot(kind='scatter', asFigure=True) +fig.show() +``` + +`cufflinks` applied on the data set of district migration: + +```{code-cell} ipython3 +district_migration.transpose().iplot(kind='box', asFigure=True).show() +``` + +
+ + Plotly + +
    +
  • Check the package gallery for plot examples.
  • +
  • Plotly express provides high level plotting functionalities and plotly graph objects the low level components. +
  • More information about the cufflinks connection with Pandas is available here.
  • +
+
+ +
+ ++++ + +
+ + For R users...: +

+Both plotly and Bokeh provide interactivity (sliders,..), but are not the full equivalent of [`Rshiny`](https://shiny.rstudio.com/).
A similar functionality of Rshiny is provided by [`dash`](https://plot.ly/products/dash/), created by the same company as plotly. +
+ +
+ ++++ + +## You like web-development and Javascript? + ++++ + +### Altair + +> *[Altair](https://altair-viz.github.io/) is a declarative statistical visualization library for Python, based on Vega-Lite.* + +```{code-cell} ipython3 +import altair as alt +``` + +Reconsider the titanic example of the start fo this notebook: + +```{code-cell} ipython3 +fig, ax = plt.subplots() +plt.style.use('ggplot') +survival_rate = titanic.groupby("Pclass")['Survived'].mean() +survival_rate.plot(kind='bar', color='grey', + rot=0, figsize=(6, 4), ax=ax) +ylab = ax.set_ylabel("Survival rate") +xlab = ax.set_xlabel("Cabin class") +``` + +Translating this to `Altair` syntax: + +```{code-cell} ipython3 +alt.Chart(titanic).mark_bar().encode( + x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')), + y=alt.Y('mean(Survived):Q', + axis=alt.Axis(format='%', + title='survival_rate')) +) +``` + +Similar to `plotnine` with `aesthetic`, expressing the influence of a varibale on the plot building can be `encoded`: + +```{code-cell} ipython3 +alt.Chart(titanic).mark_bar().encode( + x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')), + y=alt.Y('mean(Survived):Q', + axis=alt.Axis(format='%', + title='survival_rate')), + column="Sex" +) +``` + +The typical ingedrients of the **grammar of graphics** are available again: + +```{code-cell} ipython3 +(alt.Chart(titanic) # Link with the data + .mark_circle().encode( # defining a geometry + x="Fare:Q", # provide aesthetics by linking variables to channels + y="Age:Q", + column="Pclass:O", + color="Sex:N", +)) +# scales,... can be adjusted as well +``` + +For information on this `...:Q`, `...:N`,`...:O`, see the [data type section](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types) of the documentation: + +Data Type | Shorthand Code | Description +----------|-----------------|--------------- +quantitative | Q | a continuous real-valued quantity +ordinal | O | a discrete ordered quantity +nominal | N | a discrete unordered category +temporal | T | a time or date value + ++++ + +
+ + Remember + +
    +
  • Altair provides a pure-Python Grammar of Graphics implementation!
  • +
  • Altair is built on top of the Vega-Lite visualization grammar, which can be interpreted as a language to specify a graph (from data to figure).
  • +
  • Altair easily integrates with web-technology (HTML/Javascript)
  • +
+
+ +
+ ++++ + +## You're data sets are HUGE? + ++++ + +When you're working with a lot of records, the visualization of the individual points does not always make sense as there are simply to many dots overlapping eachother (check [this](https://bokeh.github.io/datashader-docs/user_guide/1_Plotting_Pitfalls.html) notebook for a more detailed explanation). + ++++ + +Consider the open data set: +> Bird tracking - GPS tracking of Lesser Black-backed Gulls and Herring Gulls breeding at the southern North Sea coast https://www.gbif.org/dataset/83e20573-f7dd-4852-9159-21566e1e691e with > 8e6 records + ++++ + +Working with such a data set on a local machine is not straightforward anymore, as this data set will consume a lot of memory to be handled by the default plotting libraries. Moreover, visualizing every single dot is not useful anymore at coarser zoom levels. + ++++ + +The package [datashader](https://bokeh.github.io/datashader-docs/index.html) provides a solution for this size of data sets and works together with other packages such as `bokeh` and `holoviews`. + ++++ + +We download just a single year (e.g. 2018) of data from [the gull data set](zenodo.org/record/3541812#.XfZYcNko-V6) and store it in the `data` folder. The 2018 data file has around 4.8 million records. + +```{code-cell} ipython3 +import pandas as pd, holoviews as hv +from colorcet import fire +from datashader.geo import lnglat_to_meters +from holoviews.element.tiles import EsriImagery +from holoviews.operation.datashader import rasterize, shade + +df = pd.read_csv('../data/HG_OOSTENDE-gps-2018.csv', usecols=['location-long', 'location-lat']) +df.columns = ['longitude', 'latitude'] +df.loc[:,'longitude'], df.loc[:,'latitude'] = lnglat_to_meters(df.longitude, df.latitude) +``` + +```{code-cell} ipython3 +hv.extension('bokeh') + +map_tiles = EsriImagery().opts(alpha=1.0, width=800, height=800, bgcolor='black') +points = hv.Points(df, ['longitude', 'latitude']) +rasterized = shade(rasterize(points, x_sampling=1, y_sampling=1, width=800, height=800), cmap=fire) + +map_tiles * rasterized +``` + +
+ + When not to use datashader + +
    +
  • Plotting less than 1e5 or 1e6 data points
  • +
  • When every datapoint matters; standard Bokeh will render all of them
  • +
  • For full interactivity (hover tools) with every datapoint
  • +
+
+ + When to use datashader + +
    +
  • Actual big data; when Bokeh/Matplotlib have trouble
  • +
  • When the distribution matters more than individual points
  • +
  • When you find yourself sampling or binning to better understand the distribution
  • +
+
+ +([source](http://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/A2%20-%20Visualizing%20Big%20Data%20with%20Datashader.ipynb)) + +
+ ++++ + +
+ + More alternatives for large data set visualisation that are worthwhile exploring:: + +
    +
  • vaex which also provides on the fly binning or aggregating the data on a grid to be represented.
  • +
  • Glumpy and Vispy, which both rely on OpenGL to achieve high performance
  • +
+
+ +
+ ++++ + +## You want to dive deeper into Python viz? + ++++ + +For a overview of the status of Python visualisation packages and tools, have a look at the [pyviz](https://pyviz.org) website. + +```{code-cell} ipython3 +from IPython.display import Image +Image('https://raw.githubusercontent.com/rougier/python-visualization-landscape/master/landscape.png') +``` + +or check the interactive version [here](https://rougier.github.io/python-visualization-landscape/landscape-colors.html). + ++++ + +further reading: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833 + ++++ + +## Acknowledgements + + +https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017 diff --git a/_solved/workflow_example_evaluation.md b/_solved/workflow_example_evaluation.md new file mode 100644 index 0000000..eb51a7e --- /dev/null +++ b/_solved/workflow_example_evaluation.md @@ -0,0 +1,364 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.11.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +```{code-cell} ipython3 +%matplotlib inline +``` + +```{code-cell} ipython3 +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +``` + +# Phase 1: testing/mungling/... (notebook `.ipynb`) + +```{code-cell} ipython3 +data = pd.read_csv("../data/vmm_flowdata.csv", parse_dates=True, index_col=0).dropna() +``` + +```{code-cell} ipython3 +data.head() +``` + +## Implementing a model evaluation criteria + ++++ + +Root mean squared error (**numpy** based) - testing of function + +```{code-cell} ipython3 +modelled = data["L06_347"].values +observed = data["LS06_347"].values +``` + +```{code-cell} ipython3 +residuals = observed - modelled +``` + +```{code-cell} ipython3 +np.sqrt((residuals**2).mean()) +``` + +Converting this to a small function, to easily reuse the code - **[add docstring(!)](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html)**: + +```{code-cell} ipython3 +def root_mean_square_error(observed, modelled): + ''' + Root Mean Square Error (RMSE) + + Parameters + ----------- + observed : np.ndarray or pd.DataFrame + observed/measured values of the variable + observed : np.ndarray or pd.DataFrame + simulated values of the variable + + Notes + ------- + * range: [0, inf] + * optimum: 0 + ''' + residuals = observed - modelled + return np.sqrt((residuals**2).mean()) +``` + +Test the created function: + +```{code-cell} ipython3 +root_mean_square_error(data["L06_347"].values, data["LS06_347"].values) +``` + +```{code-cell} ipython3 +#root_mean_square_error() # remove the comment, SHIFT-TAB inside the brackets and see your own docstring +``` + +Very brief basic/minimal setup of a docstring: + + '''very brief one-line function description + A more extended description about the function... + ...which can take multiple lines if required + + Parameters + ----------- + inputname1 : dtype of inputname1 + description of the first input + inputname2 : dtype of inputname2 + description of the second input + ... + + Returns + ------- + out1 : dtype of output + description of the first output + ... + + Notes + ------- + Some information about your function,... + ''' + ++++ + +## Making a plot function + ++++ + +When making the plot, I still want the degrees of freedom to change the colors, linewidt,.. of the figure when using my figure: + ++++ + +Compare: + +```{code-cell} ipython3 +fig, axs = plt.subplots() +axs.scatter(data["L06_347"].values, data["LS06_347"].values, + color="#992600", s=50, edgecolor='white') +axs.set_aspect('equal') +``` + +with: + +```{code-cell} ipython3 +fig, axs = plt.subplots() +axs.scatter(data["L06_347"].values, data["LS06_347"].values, + color="#009999", s=150, edgecolor='0.3') +axs.set_aspect('equal') +``` + +When making a plot function, you want to keep this flexibility. + +Some options: +* use the `args, kwargs` construction, which provides the option to pipe a flexible amount of inputs from your function input towards the plot function +* Adapt everything on a `ax` object in order to make result further adaptable afterwards (# you don't have to return the ax, but you actually can) + +```{code-cell} ipython3 +def dummy_plot_wrapper(ax, *args, **kwargs): + """small example function to illustrate some plot concepts""" + x = np.linspace(1, 5, 30) + ax.plot(x, x**2, *args, **kwargs) +``` + +With this setup, you have the following degrees of freedom: + ++++ + +

+- without usage of additional arguments, but adapting the ax object further outside my function: +

+ +```{code-cell} ipython3 +fig, ax = plt.subplots() +dummy_plot_wrapper(ax) +ax.set_ylabel('Putting the label should not \nbe inside my custom function') +``` + +Working on the ax-object inside a function, also provides flexibility to use the same function (or two functions) to fill different subplots of matplotlib: + +```{code-cell} ipython3 +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 6)) +dummy_plot_wrapper(ax1, 'r-') +dummy_plot_wrapper(ax2, 'b--') +``` + +

+- adding additional style features with providing additional arguments: +

+ ++++ + +(As we pipe the arguments to the plot() function of matplotlib, the choices of the additional arguments are the plot options of matplotlib itself: http://matplotlib.org/api/lines_api.html#matplotlib.lines.Line2D) + +```{code-cell} ipython3 +fig, ax = plt.subplots() +dummy_plot_wrapper(ax, linestyle='--', linewidth=3, color="#990000") +ax.set_ylabel('Putting the label should not \nbe nside my custom function') +``` + +

+- adding additional style features with providing additional arguments and adapting the graph afterwards: +

+ +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(10, 6)) +dummy_plot_wrapper(ax, linewidth=2, color="#67a9cf", + marker='o', linestyle='--', + markeredgecolor='#ef8a62', + markersize=15, + markeredgewidth=2) + +# removing the spines of the graph afterwards +for key, spine in ax.spines.items(): + spine.set_visible(False) +ax.yaxis.set_ticks_position('left') +ax.xaxis.set_ticks_position('bottom') +``` + +If you use an option frequently after plotting a graph (but maybe not always), it could be an option to add it with a named argument to your function: + +```{code-cell} ipython3 +def dummy_plot_wrapper(ax, remove_spines=None, *args, **kwargs): + """small example function to illustrate some plot concepts + + Parameters + ---------- + ax : plt.ax object + an axis to put the data on + remove_spines : None | list of 'left', 'bottom', 'right', 'top' + will remove the spines according to the defined sides inside the list + *args, **kwargs : + commands provided to the 2D line plot of matplotlib + """ + x = np.linspace(1, 5, 30) + ax.plot(x, x**2, *args, **kwargs) + + if remove_spines and isinstance(remove_spines, list): + for key, spine in ax.spines.items(): + if key in remove_spines: + spine.set_visible(False) +``` + +So, we added this flexibility to our own graph: + +```{code-cell} ipython3 +fig, ax = plt.subplots() +dummy_plot_wrapper(ax, linewidth=2, color="#67a9cf") # no information about removing spines, just as before -> default is used +``` + +```{code-cell} ipython3 +fig, ax = plt.subplots() +dummy_plot_wrapper(ax, remove_spines=['right', 'top'], + linewidth=2, color="#67a9cf") +``` + +# Phase 2: I've got something useful here... + ++++ + +When satisfied about the function behavior: move it to a python (`.py`) file... + ++++ + +## Writing the useful elements into a function (towards python file `.py`) + ++++ + +Check the file [spreaddiagram](spreaddiagram.py) as an example... + + + +**Some advice:** + +* Keep the functionalities small: + * A single function has a single task + * Keep the number of lines restricted (< 50 lines), unless you have good reasons +* Write [**docstrings**](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html)(!) +* Make your function more flexible with arguments and **named arguments** + ++++ + +## Using your function for real (anywhere: new notebooks `.ipynb`, new `.py` files) + ++++ + +Loading from my custom function + +```{code-cell} ipython3 +from spreaddiagram import spread_diagram, bias, root_mean_square_error +``` + +Using my new function: + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(8, 8)) + +spread_diagram(ax, data["L06_347"].values, + data["LS06_347"].values, + infobox = False, + color="#67a9cf", + s=40) +ax.set_ylabel("Modelled", fontsize=15) +ax.set_xlabel("Observed", fontsize=15) +ax.spines["right"].set_visible(False) +ax.spines["top"].set_visible(False) +ax.yaxis.set_ticks_position('left') +ax.xaxis.set_ticks_position('bottom') +``` + +**Remark**: when you have to select colors: http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3 + ++++ + +In many occassions, the story will end here and you will further use/adapt the function... + +**Advice:** use [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) and keep track of your changes! This is not only for IT-guy, with [Github desktop](https://desktop.github.com/), version control becomes accessible for everyone + +However, sometimes you need further adaptation: + ++++ + +# Phase 3 (optional): It is a recurrent task (towards cmd/bash functionality) + ++++ + +When using this on regular basis (e.g. you frequently get output text files from a model), it is worthwile to make the same functionality available outside python as well (as command line function or inside bash scripts)! + ++++ + +A minimal working template: + +```{code-cell} ipython3 +%%file puretest.py + +import sys + + +def main(argv=None): + # first argument argv[0] is always the python file name itself + print('Working on the', argv[0], 'file, with the argument', argv[1]) + + +if __name__ == "__main__": + sys.exit(main(sys.argv)) +``` + +**Want to dive into the command line options?** +* example with more advanced arguments: https://github.com/inbo/inbo-pyutils/blob/master/gbif/gbif_name_match/gbif_species_name_match.py +* pure python library to support you on the argument parsing: https://docs.python.org/3/library/argparse.html +* library for more advanced support (eacy creation of cmd interface): http://click.pocoo.org/5/ + ++++ + +# Phase 4 (optional): You need more python power (towards python package) + ++++ + +* When working together with other people on the code, +* when requiring more advance management of the code, +* when installation on new machines should be more easy +* when you want to make your code installable by others +* ... + ++++ + +**Create a package from your code...** + ++++ + +As an example: https://github.com/inbo/data-validator + ++++ + +* Actually it is not that much more as a set of files in a folder accompanied with a `setup.py` file +* register on [pypi](https://pypi.python.org/pypi) and people can install your code with: `pip install your_awesome_package_name` +* Take advantage of **unit testing**, **code coverage**,... the enlightning path of code development! From 532ddac8f02cb576d96d36ec9d70d8102620cd58 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 28 May 2021 18:13:47 +0200 Subject: [PATCH 2/2] Add version with edits from Stijn --- _solved/case1_bike_count.md | 306 +++++---- ...se3_bacterial_resistance_lab_experiment.md | 266 ++++---- _solved/case4_air_quality_analysis.md | 43 +- _solved/pandas_03a_selecting_data.md | 13 +- _solved/pandas_04_time_series_data.md | 8 +- _solved/pandas_06_groupby_operations.md | 14 +- _solved/pandas_07_reshaping_data.md | 22 +- _solved/visualization_01_matplotlib.md | 66 +- _solved/visualization_03_landscape.md | 617 ++++++++---------- 9 files changed, 703 insertions(+), 652 deletions(-) diff --git a/_solved/case1_bike_count.md b/_solved/case1_bike_count.md index 4bb9333..56bd79a 100644 --- a/_solved/case1_bike_count.md +++ b/_solved/case1_bike_count.md @@ -14,73 +14,67 @@ kernelspec:

CASE - Bike count data

-> *DS Data manipulation, analysis and visualisation in Python* -> *December, 2019* - -> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* +> *DS Data manipulation, analysis and visualization in Python* +> *May/June, 2021* +> +> *© 2021, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* --- +++ - + +++ -In this case study, we will make use of the freely available bike count data of the city of Ghent. At the Coupure Links, next to the Faculty of Bioscience Engineering, a counter keeps track of the number of passing cyclists in both directions. - -Those data are available on the open data portal of the city: https://data.stad.gent/data/236 +In this case study, we will make use of the openly available bike count data of the city of Ghent (Belgium). At the Coupure Links, next to the Faculty of Bioscience Engineering, a counter keeps track of the number of passing cyclists in both directions. ```{code-cell} ipython3 import pandas as pd import matplotlib.pyplot as plt plt.style.use('seaborn-whitegrid') - -%matplotlib notebook ``` -## Reading and processing the data +# Reading and processing the data +++ -### Read csv data from URL +## Read csv data +++ -The data are avaible in CSV, JSON and XML format. We will make use of the CSV data. The link to download the data can be found on the webpage. For the first dataset, this is: - - link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv" - -A limit defines the size of the requested data set, by adding a limit parameter `limit` to the URL : - - link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit=100000" +The data were previously available on the open data portal of the city, and we downloaded them in the `CSV` format, and provided the original file as `data/fietstellingencoupure.csv`. -Those datasets contain the historical data of the bike counters, and consist of the following columns: +This data set contains the historical data of the bike counters, and consists of the following columns: - The first column `datum` is the date, in `dd/mm/yy` format - The second column `tijd` is the time of the day, in `hh:mm` format - The third and fourth column `ri Centrum` and `ri Mariakerke` are the counts at that point in time (counts between this timestamp and the previous) -```{code-cell} ipython3 -limit = 200000 -link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit={}".format(limit) -``` ++++
- EXERCISE: -
    -
  • Read the csv file from the url into a DataFrame `df`, the delimiter of the data is `;`
  • -
  • Inspect the first and last 5 rows, and check the number of observations
  • -
  • Inspect the data types of the different columns
  • -
+**EXERCISE** + +- Read the csv file from the url into a DataFrame `df`, the delimiter of the data is `;` +- Inspect the first and last 5 rows, and check the number of observations +- Inspect the data types of the different columns + +
Hints + +- With the cursor on a function, you can combine the SHIFT + TAB keystrokes to see the documentation of a function. +- Both the `sep` and `delimiter` argument will work to define the delimiter. +- Methods like `head`/`tail` have round brackets `()`, attributes like `dtypes` not. +
+
```{code-cell} ipython3 :clear_cell: true -df = pd.read_csv(link, sep=';') +df = pd.read_csv("data/fietstellingencoupure.csv", sep=';') ``` ```{code-cell} ipython3 @@ -107,35 +101,34 @@ len(df) df.dtypes ``` -
- - Remark: If the download is very slow, consider to reset the limit variable to a lower value as most execises will just work with the first 100000 records as well. - -
+## Data processing +++ -### Data processing +As explained above, the first and second column (respectively `datum` and `tijd`) indicate the date and hour of the day. To obtain a time series, we have to combine those two columns into one series of actual `datetime` values. +++ -As explained above, the first and second column (respectively `datum` and `tijd`) indicate the date and hour of the day. To obtain a time series, we have to combine those two columns into one series of actual datetime values. +
-+++ +**EXERCISE** -
+Pre-process the data: - EXERCISE: Preprocess the data +* Combine the 'datum' and 'tijd' columns into one Pandas Series of string `datetime` values, call this new variable `combined`. +* Parse the string `datetime` values to `datetime` objects. +* Set the resulting `datetime` column as the index of the `df` DataFrame. +* Remove the original 'datum' and 'tijd' columns using the `drop` method, and call the new dataframe `df2`. +* Rename the columns in the DataFrame 'ri Centrum', 'ri Mariakerke' to resp. 'direction_centre', 'direction_mariakerke' using the `rename` method. -
    -
  • Combine the 'datum' and 'tijd' columns into one Series of string datetime values (Hint: concatenating strings can be done with the addition operation)
  • -
  • Parse the string datetime values (Hint: specifying the format will make this a lot faster)
  • -
  • Set the resulting dates as the index
  • -
  • Remove the original 'tijd' and 'tijd' columns (Hint: check the drop method)
  • -
  • Rename the 'ri Centrum', 'ri Mariakerke' to 'direction_centre', 'direction_mariakerke' (Hint: check the rename function)
  • -
+
Hints -
+- Concatenating strings can be done with the addition operation `+`. +- When converting strings to a `datetime` with `pd.to_datetime`, specifying the format will make the conversion a lot faster. +- `drop` can remove both rows and columns using the names of the index or column name. Make sure to define `columns=` argument to remove columns. +- `rename` can be used for both rows/columns. It needs a dictionary with the current names as keys and the new names as values. + + ```{code-cell} ipython3 :clear_cell: true @@ -153,20 +146,21 @@ df.index = pd.to_datetime(combined, format="%d/%m/%Y %H:%M") ```{code-cell} ipython3 :clear_cell: true -df = df.drop(columns=['datum', 'tijd']) +df2 = df.drop(columns=['datum', 'tijd']) ``` ```{code-cell} ipython3 :clear_cell: true -df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'}) +df2 = df2.rename(columns={'ri Centrum': 'direction_centre', + 'ri Mariakerke':'direction_mariakerke'}) ``` ```{code-cell} ipython3 -df.head() +df2.head() ``` -Having the data available with an interpreted datetime, provides us the possibility of having time aware plotting: +Having the data available with an interpreted `datetime`, provides us the possibility of having time aware plotting: ```{code-cell} ipython3 fig, ax = plt.subplots(figsize=(10, 6)) @@ -183,11 +177,15 @@ df.plot(colormap='coolwarm', ax=ax) When we just want to interpret the dates, without specifying how the dates are formatted, Pandas makes an attempt as good as possible: +```{code-cell} ipython3 +combined = df['datum'] + ' ' + df['tijd'] +``` + ```{code-cell} ipython3 %timeit -n 1 -r 1 pd.to_datetime(combined, dayfirst=True) ``` -However, when we already know the format of the dates (and if this is consistent throughout the full dataset), we can use this information to interpret the dates: +However, when we already know the format of the dates (and if this is consistent throughout the full data set), we can use this information to interpret the dates: ```{code-cell} ipython3 %timeit pd.to_datetime(combined, format="%d/%m/%Y %H:%M") @@ -195,7 +193,7 @@ However, when we already know the format of the dates (and if this is consistent
- Remember: Whenever possible, specify the date format to interpret the dates to datetime values! + Remember: Whenever possible, specify the date format to interpret the dates to `datetime` values!
@@ -203,44 +201,61 @@ However, when we already know the format of the dates (and if this is consistent ### Write the data set cleaning as a function -In order to make it easier to reuse the code for the preprocessing we have now implemented, let's convert the code to a Python function +In order to make it easier to reuse the code for the pre-processing we have implemented, let's convert the code to a Python function: +++
-EXERCISE: +**EXERCISE** -
    -
  • Write a function process_bike_count_data(df) that performs the processing steps as done above for an input DataFrame and return the updated DataFrame
  • -
+Write a function `process_bike_count_data(df)` that performs the processing steps as done above for an input Pandas DataFrame and returns the updated DataFrame. -
+
Hints + +- Want to know more about proper documenting your Python functions? Check out the official guide of [numpydoc](https://numpydoc.readthedocs.io/en/latest/format.html). The `Parameters` and `Returns` sections should always be explained. + +
```{code-cell} ipython3 :clear_cell: true def process_bike_count_data(df): - """ - Process the provided dataframe: parse datetimes and rename columns. + """Process the provided dataframe: parse datetimes and rename columns. + Parameters + ---------- + df : pandas.DataFrame + DataFrame as read from the raw `fietstellingen`, + containing the `datum`, `tijd`, `ri Centrum` + and `ri Mariakerke` columns. + + Returns + ------- + df2 : pandas.DataFrame + DataFrame with the datetime info as index and the + `direction_centre` and `direction_mariakerke` columns + with the counts. """ - df.index = pd.to_datetime(df['datum'] + ' ' + df['tijd'], format="%d/%m/%Y %H:%M") - df = df.drop(columns=['datum', 'tijd']) - df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'}) - return df + df.index = pd.to_datetime(df['datum'] + ' ' + df['tijd'], + format="%d/%m/%Y %H:%M") + df2 = df.drop(columns=['datum', 'tijd']) + df2 = df2.rename(columns={'ri Centrum': 'direction_centre', + 'ri Mariakerke':'direction_mariakerke'}) + return df2 ``` ```{code-cell} ipython3 -df_raw = pd.read_csv(link, sep=';') +df_raw = pd.read_csv("data/fietstellingencoupure.csv", sep=';') df_preprocessed = process_bike_count_data(df_raw) +df_preprocessed.head() ``` -### Store our collected dataset as an interim data product +### Store our collected data set as an interim data product +++ -As we finished our data-collection step, we want to save this result as a interim data output of our small investigation. As such, we do not have to re-download all the files each time something went wrong, but can restart from our interim step. +As we finished our data-collection step, we want to save this result as an interim data output of our small investigation. As such, we do not have to re-download all the files each time something went wrong, but can restart from our interim step. ```{code-cell} ipython3 df_preprocessed.to_csv("bike_count_interim.csv") @@ -250,7 +265,7 @@ df_preprocessed.to_csv("bike_count_interim.csv") +++ -We now have a cleaned-up dataset of the bike counts at Coupure Links. Next, we want to get an impression of the characteristics and properties of the data +We now have a cleaned-up data set of the bike counts at Coupure Links in Ghent (Belgium). Next, we want to get an impression of the characteristics and properties of the data +++ @@ -268,20 +283,19 @@ df = pd.read_csv("bike_count_interim.csv", index_col=0, parse_dates=True) +++ -The number of bikers are counted for intervals of approximately 15 minutes. But let's check if this is indeed the case. -For this, we want to calculate the difference between each of the consecutive values of the index. We can use the `Series.diff()` method: +The number of bikers are counted for intervals of approximately 15 minutes. But let's check if this is indeed the case. Calculate the difference between each of the consecutive values of the index. We can use the `Series.diff()` method: ```{code-cell} ipython3 pd.Series(df.index).diff() ``` -Again, the count of the possible intervals is of interest: +The count of the possible intervals is of interest: ```{code-cell} ipython3 pd.Series(df.index).diff().value_counts() ``` -There are a few records that is not exactly 15min. But given it are only a few ones, we will ignore this for the current case study and just keep them as such for this explorative study. +There are a few records that are not exactly 15min. But given it are only a few ones, we will ignore this for the current case study and just keep them for this explorative study. Bonus question: do you know where the values of `-1 days +23:15:01` and `01:15:00` are coming from? @@ -295,33 +309,35 @@ df.describe()
-EXERCISE: +**EXERCISE** -
    -
  • Create a new Series, df_both which contains the sum of the counts of both directions
  • -
+Create a new Pandas Series `df_both` which contains the sum of the counts of both directions. -
+
Hints -_Tip:_ check the purpose of the `axis` argument of the `sum` function +- Check the purpose of the `axis` argument of the `sum` method. -
+ ```{code-cell} ipython3 :clear_cell: true df_both = df.sum(axis=1) +df_both ```
-EXERCISE: +**EXERCISE** -
    -
  • Using the df_both from the previous exercise, create a new Series df_quiet which contains only those intervals for which less than 5 cyclists passed in both directions combined
  • -
+Using the `df_both` from the previous exercise, create a new Series `df_quiet` which contains only those intervals for which less than 5 cyclists passed in both directions combined -
+
Hints + +- Use the `[]` to select data. You can use conditions (so-called _boolean indexing_) returning True/False inside the brackets. + +
+ ```{code-cell} ipython3 :clear_cell: true @@ -331,13 +347,17 @@ df_quiet = df_both[df_both < 5]
-EXERCISE: +**EXERCISE** -
    -
  • Using the original data, select only the intervals for which less than 3 cyclists passed in one or the other direction. Hence, less than 3 cyclists towards the centre or less than 3 cyclists towards Mariakerke.
  • -
+Using the original data `df`, select only the intervals for which less than 3 cyclists passed in one or the other direction. Hence, less than 3 cyclists towards the center or less than 3 cyclists towards Mariakerke. -
+
Hints + +- To combine conditions use the `|` (or) or the `&` (and) operators. +- Make sure to use `()` around each individual condition. + +
+ ```{code-cell} ipython3 :clear_cell: true @@ -351,13 +371,16 @@ df[(df['direction_centre'] < 3) | (df['direction_mariakerke'] < 3)]
-EXERCISE: +**EXERCISE** -
    -
  • What is the average number of bikers passing each 15 min?
  • -
+What is the average number of bikers passing each 15 min? + +
Hints -
+- As the time series is already 15min level, this is just the same as taking the mean. + + + ```{code-cell} ipython3 :clear_cell: true @@ -367,15 +390,16 @@ df.mean()
-EXERCISE: +**EXERCISE** -
    -
  • What is the average number of bikers passing each hour?
  • -
+What is the average number of bikers passing each hour? -_Tip:_ you can use `resample` to first calculate the number of bikers passing each hour. +
Hints -
+- Use `resample` to first calculate the number of bikers passing each hour. +- `resample` requires an aggregation function that defines how to combine the values within each group (in this case all values within each hour). + + ```{code-cell} ipython3 :clear_cell: true @@ -385,13 +409,15 @@ df.resample('H').sum().mean()
-EXERCISE: +**EXERCISE** -
    -
  • What are the 10 highest peak values observed during any of the intervals for the direction towards the centre of Ghent?
  • -
+What are the 10 highest peak values observed during any of the intervals for the direction towards the center of Ghent? -
+
Hints + +- Pandas provides the `nsmallest` and `nlargest` methods to derive N smallest/largest values of a column. + +
```{code-cell} ipython3 :clear_cell: true @@ -403,13 +429,17 @@ df['direction_centre'].nlargest(10)
-EXERCISE: +**EXERCISE** -
    -
  • What is the maximum number of cyclist that passed on a single day calculated on both directions combined?
  • -
+What is the maximum number of cyclist that passed on a single day calculated on both directions combined? -
+
Hints + +- Combine both directions by taking the sum. +- Next, `resample` to daily values +- Get the maximum value or ask for the n largest to see the dates as well. + +
```{code-cell} ipython3 :clear_cell: true @@ -430,10 +460,12 @@ df_daily.max() ``` ```{code-cell} ipython3 +:clear_cell: false + df_daily.nlargest(10) ``` -2013-06-05 was the first time more than 10,000 bikers passed on one day. Apparanlty, this was not just by coincidence... http://www.nieuwsblad.be/cnt/dmf20130605_022 +The high number of bikers passing on 2013-06-05 was not by coincidence: http://www.nieuwsblad.be/cnt/dmf20130605_022 ;-) +++ @@ -443,13 +475,16 @@ df_daily.nlargest(10)
-EXERCISE: +**EXERCISE** -
    -
  • How does the long-term trend look like? Calculate monthly sums and plot the result.
  • -
+How does the long-term trend look like? Calculate monthly sums and plot the result. -
+
Hints + +- The symbol for monthly resampling is `M`. +- Use the `plot` method of Pandas, which will generate a line plot of each numeric column by default. + +
```{code-cell} ipython3 :clear_cell: true @@ -460,13 +495,15 @@ df_monthly.plot()
-EXERCISE: +**EXERCISE** -
    -
  • Let's have a look at some short term patterns. For the data of the first 3 weeks of January 2014, calculate the hourly counts and visualize them.
  • -
+Let's have a look at some short term patterns. For the data of the first 3 weeks of January 2014, calculate the hourly counts and visualize them. -
+
Hints + +- Slicing is done using `[]`, you can use string representation of dates to select from a `datetime` index: e.g. `'2010-01-01':'2020-12-31'` + +
```{code-cell} ipython3 :clear_cell: true @@ -492,14 +529,17 @@ df_hourly['2014-01-01':'2014-01-20'].plot()
-EXERCISE: +**EXERCISE** -
    -
  • Select a subset of the data set from 2013-12-31 12:00:00 untill 2014-01-01 12:00:00, store as variable newyear and plot this subset
  • -
  • Use a rolling function (check documentation of the function!) to smooth the data of this period and make a plot of the smoothed version
  • -
+- Select a subset of the data set from 2013-12-31 12:00:00 until 2014-01-01 12:00:00 and assign the result to a new variable `newyear` +- Plot the selected data `newyear`. +- Use a `rolling` function with a window of 10 values (check documentation of the function) to smooth the data of this period and make a plot of the smoothed version. -
+
Hints + +- Just like `resample`, `rolling` requires an aggregate statistic (e.g. mean, median,...) to combine the values within the window. + +
```{code-cell} ipython3 :clear_cell: true @@ -519,7 +559,7 @@ newyear.plot() newyear.rolling(10, center=True).mean().plot(linewidth=2) ``` -A more advanced usage of matplotlib to create a combined plot: +A more advanced usage of Matplotlib to create a combined plot: ```{code-cell} ipython3 :clear_cell: true @@ -543,7 +583,7 @@ Looking at the data in the above exercises, there seems to be clearly a: - weekly pattern - yearly pattern -Such patterns can easily be calculated and visualized in pandas using the DatetimeIndex attributes `weekday` combined with `groupby` functionality. Below a taste of the possibilities, and we will learn about this in the proceeding notebooks: +Such patterns can easily be calculated and visualized in pandas using the `DatetimeIndex` attributes `dayofweek` combined with `groupby` functionality. Below a taste of the possibilities, and we will learn about this in the proceeding notebooks: +++ @@ -554,7 +594,7 @@ df_daily = df.resample('D').sum() ``` ```{code-cell} ipython3 -df_daily.groupby(df_daily.index.weekday).mean().plot(kind='bar') +df_daily.groupby(df_daily.index.dayofweek).mean().plot(kind='bar') ``` **Daily pattern:** diff --git a/_solved/case3_bacterial_resistance_lab_experiment.md b/_solved/case3_bacterial_resistance_lab_experiment.md index 8050eb5..8f78937 100644 --- a/_solved/case3_bacterial_resistance_lab_experiment.md +++ b/_solved/case3_bacterial_resistance_lab_experiment.md @@ -11,13 +11,13 @@ kernelspec: name: python3 --- -

CASE - Bacterial resistance experiment

+

CASE - Bacterial resistance experiment

-> *DS Data manipulation, analysis and visualisation in Python* -> *December, 2019* - -> *© 2017, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* +> *DS Data manipulation, analysis and visualization in Python* +> *May/June, 2021* +> +> *© 2021, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* --- @@ -39,8 +39,8 @@ Check the full paper on the [web version](http://rsbl.royalsocietypublishing.org ```{code-cell} ipython3 import pandas as pd +import seaborn as sns import matplotlib.pyplot as plt -import plotnine as p9 ``` ## Reading and processing the data @@ -55,7 +55,7 @@ For the exercises, two sheets of the excel file will be used: | Variable name | Description | |---------------:|:-------------| -|**AB_r** | Antibotic resistance | +|**AB_r** | Antibiotic resistance | |**Bacterial_genotype** | Bacterial genotype | |**Phage_t** | Phage treatment | |**OD_0h** | Optical density at the start of the experiment (0h) | @@ -79,14 +79,15 @@ For the exercises, two sheets of the excel file will be used: Reading the `main experiment` data set from the corresponding sheet: ```{code-cell} ipython3 -main_experiment = pd.read_excel("../data/Dryad_Arias_Hall_v3.xlsx", sheet_name="Main experiment") +main_experiment = pd.read_excel("data/Dryad_Arias_Hall_v3.xlsx", + sheet_name="Main experiment") main_experiment ``` Read the `Falcor` data and subset the columns of interest: ```{code-cell} ipython3 -falcor = pd.read_excel("../data/Dryad_Arias_Hall_v3.xlsx", sheet_name="Falcor", +falcor = pd.read_excel("data/Dryad_Arias_Hall_v3.xlsx", sheet_name="Falcor", skiprows=1) falcor = falcor[["Phage", "Bacterial_genotype", "log10 Mc", "log10 UBc", "log10 LBc"]] falcor.head() @@ -96,7 +97,7 @@ falcor.head() +++ -*(If you're wondering what `tidy` data representations are, check again the `visualization_02_plotnine.ipynb` notebook)* +*(If you're wondering what `tidy` data representations are, check again the `pandas_07_reshaping_data.ipynb` notebook)* +++ @@ -104,7 +105,7 @@ Actually, the columns `OD_0h`, `OD_20h` and `OD_72h` are representing the same v +++ -Before making any changes to the data, we will add an identifier column for each of the current rows to make sure we keep the connection in between the entries of a row when converting from wide to ong format. +Before making any changes to the data, we will add an identifier column for each of the current rows to make sure we keep the connection in between the entries of a row when converting from wide to long format. ```{code-cell} ipython3 main_experiment["experiment_ID"] = ["ID_" + str(idx) for idx in range(len(main_experiment))] @@ -115,12 +116,14 @@ main_experiment EXERCISE: -
    -
  • Convert the columns `OD_0h`, `OD_20h` and `OD_72h` to a long format with the values stored in a column `optical_density` and the time in the experiment as `experiment_time_h`. Save the variable as tidy_experiment
  • +Convert the columns `OD_0h`, `OD_20h` and `OD_72h` to a long format with the values stored in a column `optical_density` and the time in the experiment as `experiment_time_h`. Save the variable as tidy_experiment -
+
Hints -__Tip__: Have a look at `pandas_07_reshaping_data.ipynb` to find out the required function +- Have a look at `pandas_07_reshaping_data.ipynb` to find out the required function. +- Remember to check the documentation of a function using the `SHIFT` + `TAB` keystroke combination when the cursor is on the function of interest. + +
@@ -145,79 +148,81 @@ tidy_experiment.head() EXERCISE: -
    -
  • Make a histogram to check the distribution of the `optical_density`
  • -
  • Change the border color of the bars to `white` and the fillcolor to `lightgrey`
  • -
  • Change the overall theme to any of the available themes
  • -
+* Make a histogram using the [Seaborn package](https://seaborn.pydata.org/index.html) to visualize the distribution of the `optical_density` +* Change the overall theme to any of the available Seaborn themes +* Change the border color of the bars to `white` and the fill color of the bars to `grey` + +
Hints + +- See https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-histograms. +- There are five preset seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`. +- Make sure to set the theme before creating the graph. +- Seaborn relies on Matplotlib to plot the individual bars, so the available parameters (`**kwargs`) to adjust the bars that can be passed (e.g. `color` and `edgecolor`) are enlisted in the [matplotlib.axes.Axes.bar](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.axes.Axes.bar.html) documentation. + +
+ -_Tip_: plotnine required data, aesthetics and a geometry. Add color additions to the geometry as parameters of the method and theme options as additional statements (`+`)
```{code-cell} ipython3 :clear_cell: true -(p9.ggplot(tidy_experiment, p9.aes(x='optical_density')) - + p9.geom_histogram(bins=30, color='white', fill='lightgrey') - + p9.theme_bw() -) +sns.set_style("white") +sns.displot(tidy_experiment, x="optical_density", + color='grey', edgecolor='white') ```
-EXERCISE: +**EXERCISE** -
    -
  • Use a `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h`)
  • +Use a Seaborn `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h` in the x-axis). -
+
Hints -_Tip_: within plotnine, searching for a specific geometry always starts with typing `geom_` + TAB-button -
+- See https://seaborn.pydata.org/tutorial/categorical.html#violinplots. +- Whereas the previous exercise focuses on the distribution of data (`distplot`), this exercise focuses on distributions _for each category of..._ and needs the categorical functions of Seaborn (`catplot`). + + ```{code-cell} ipython3 :clear_cell: true -(p9.ggplot(tidy_experiment, p9.aes(x='experiment_time_h', - y='optical_density')) - + p9.geom_violin() -) +sns.catplot(data=tidy_experiment, x="experiment_time_h", + y="optical_density", kind="violin") ```
-EXERCISE: +**EXERCISE** -
    -
  • For each `Phage_t` in an individual subplot, use a `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h`)
  • -
+For each `Phage_t` in an individual subplot, use a `violin plot` to check the distribution of the `optical_density` in each of the experiment time phases (`experiment_time_h`) -_Tip_: remember `facet_wrap`? - -
+
Hints +- The technical term for splitting in subplots using a categorical variable is 'faceting' (or sometimes also 'small multiple'), see https://seaborn.pydata.org/tutorial/categorical.html#showing-multiple-relationships-with-facets +- You want to wrap the number of columns on 2 subplots, look for a function argument in the documentation of the `catplot` function. + +
```{code-cell} ipython3 :clear_cell: true -(p9.ggplot(tidy_experiment, p9.aes(x='experiment_time_h', - y='optical_density')) - + p9.geom_violin() - + p9.facet_wrap('Phage_t') -) +sns.catplot(data=tidy_experiment, x="experiment_time_h", y="optical_density", + col="Phage_t", col_wrap=2, kind="violin") ```
-EXERCISE: +**EXERCISE** -
    -
  • Create a summary table of the average `optical_density` with the `Bacterial_genotype` in the rows and the `experiment_time_h` in the columns
  • -
+Create a summary table of the __average__ `optical_density` with the `Bacterial_genotype` in the rows and the `experiment_time_h` in the columns -_Tip_: no plotnine required here +
Hints -
+- No Seaborn required here, rely on Pandas `pivot_table()` function to reshape tables. + + ```{code-cell} ipython3 :clear_cell: true @@ -231,23 +236,29 @@ pd.pivot_table(tidy_experiment, values='optical_density', ```{code-cell} ipython3 :clear_cell: true +# advanced/optional solution tidy_experiment.groupby(['Bacterial_genotype', 'experiment_time_h'])['optical_density'].mean().unstack() ```
-EXERCISE: +**EXERCISE** -
    -
  • Calculate for each combination of `Bacterial_genotype`, `Phage_t` and `experiment_time_h` the mean `optical_density` and store the result as a dataframe called `density_mean`
  • -
  • Based on `density_mean`, make a barplot of the mean values for each `Bacterial_genotype`, with for each Bacterial_genotype an individual bar per `Phage_t` in a different color (grouped bar chart).
  • -
  • Use the `experiment_time_h` to split into subplots. As we mainly want to compare the values within each subplot, make sure the scales in each of the subplots are adapted to the data range, and put the subplots on different rows.
  • -
  • (OPTIONAL) change the color scale of the bars to a color scheme provided by colorbrewer
  • +- Calculate for each combination of `Bacterial_genotype`, `Phage_t` and `experiment_time_h` the mean `optical_density` and store the result as a DataFrame called `density_mean` (tip: use `reset_index()` to convert the resulting Series to a DataFrame). +- Based on `density_mean`, make a _barplot_ of the (mean) values for each `Bacterial_genotype`, with for each `Bacterial_genotype` an individual bar and with each `Phage_t` in a different color/hue (i.e. grouped bar chart). +- Use the `experiment_time_h` to split into subplots. As we mainly want to compare the values within each subplot, make sure the scales in each of the subplots are adapted to its own data range, and put the subplots on different rows. +- Adjust the size and aspect ratio of the Figure to your own preference. +- Change the color scale of the bars to another Seaborn palette. + +
    Hints -
- -
+- _Calculate for each combination of..._ should remind you to the `groupby` functionality of Pandas to calculate statistics for each group. +- The exercise is still using the `catplot` function of Seaborn with `bar`s. Variables are used to vary the `hue` and `row`. +- Each subplot its own range is the same as not sharing axes (`sharey` argument). +- Seaborn in fact has six variations of matplotlib’s palette, called `deep`, `muted`, `pastel`, `bright`, `dark`, and `colorblind`. See https://seaborn.pydata.org/tutorial/color_palettes.html#qualitative-color-palettes + + ```{code-cell} ipython3 :clear_cell: true @@ -257,38 +268,49 @@ density_mean = (tidy_experiment .mean().reset_index()) ``` -```{code-cell} ipython3 -density_mean.head() -``` - ```{code-cell} ipython3 :clear_cell: true -(p9.ggplot(density_mean, p9.aes(x='Bacterial_genotype', - y='optical_density', - fill='Phage_t')) - + p9.geom_bar(stat='identity', position='dodge') - + p9.facet_wrap('experiment_time_h', scales='free', nrow=3) - + p9.scale_fill_brewer(type='qual', palette=8) -) +sns.catplot(data=density_mean, kind="bar", + x='Bacterial_genotype', + y='optical_density', + hue='Phage_t', + row="experiment_time_h", + sharey=False, + aspect=3, height=3, + palette="colorblind") ``` -## Reproduce the graphs of the original paper +## (Optional) Reproduce chart of the original paper +++ +Check Figure 2 of the original journal paper in the 'correction' part of the pdf: + + + +```{code-cell} ipython3 +falcor.head() +``` +
-EXERCISE: +**EXERCISE** -
    -
  • Check Figure 2 of the original journal paper in the 'correction' part of the pdf:
  • - -
  • Reproduce the graph using the `falcor` data and the plotnine package (don't bother yet about the style or the order on the x axis). The 'log10 mutation rate' on the figure corresponds to the `log10 Mc` column.
  • -
  • Check the documentation to find out how to add errorbars to the graph. The upper and lower bound for the error bars are given in the `log10 UBc` and `log10 LBc` columns.
  • -
  • Make sure the `WT(2)` and `MUT(2)` are used as respectively `WT` and `MUT`.
  • -
-
+We will first reproduce 'Figure 2' without the error bars: + +- Make sure the `WT(2)` and `MUT(2)` categories are used as respectively `WT` and `MUT` by adjusting them with Pandas first. +- Use the __falcor__ data and the Seaborn package. The 'log10 mutation rate' on the figure corresponds to the `log10 Mc` column. + + +
Hints + +- To replace values using a mapping (dictionary with the keys the current values and the values the new values), use the Pandas `replace` method. +- This is another example of a `catplot`, using `point`s to represent the data. +- The `join` argument defines if individual points need to be connected or not. +- One combination appears multiple times, so make sure to not yet use confidence intervals by setting `ci` to `Null`. + +
```{code-cell} ipython3 :clear_cell: true @@ -300,61 +322,65 @@ falcor["Bacterial_genotype"] = falcor["Bacterial_genotype"].replace({'WT(2)': 'W ```{code-cell} ipython3 :clear_cell: true -(p9.ggplot(falcor, p9.aes(x='Bacterial_genotype', y='log10 Mc')) - + p9.geom_point() - + p9.facet_wrap('Phage', nrow=3) - + p9.geom_errorbar(p9.aes(ymin='log10 LBc', ymax='log10 UBc'), width=.2) - + p9.theme_bw() -) +sns.catplot(data=falcor, kind="point", + x='Bacterial_genotype', + y='log10 Mc', + row="Phage", + join=False, ci=None, + aspect=3, height=3, + color="black") +``` + +```{code-cell} ipython3 +falcor.head() ``` +Seaborn supports confidence intervals by different estimators when multiple values are combined (see [this example](https://seaborn.pydata.org/examples/pointplot_anova.html)). In this particular case, the error estimates are already provided and are not symmetrical. Hence, we need to find a method to use the lower `log10 LBc` and upper `log10 UBc` confidence intervals. + +Stackoverflow can help you with this, see [this thread](https://stackoverflow.com/questions/38385099/adding-simple-error-bars-to-seaborn-factorplot) to solve the following exercise. + ++++ +
-EXERCISE (OPTIONAL): +**EXERCISE** -
    -
  • Check Figure 1 of the original journal paper:
  • -
  • Reproduce the graph using the `tidy_experiment` data and the plotnine package. Notice that the plot shows the optical density at the end of the experiment (72h).
  • -
  • Take the `geom_` that closest represents the original.
  • -
  • Check the documentation for further tuning, e.g. `as_labeller`...
  • -
-
+Reproduce 'Figure 2' with the error bars using the information from [this Stackoverflow thread](https://stackoverflow.com/questions/38385099/adding-simple-error-bars-to-seaborn-factorplot). You do not have to adjust the order of the categories in the x-axis. + +
Hints + +- Do not use the `catplot` function, but first create the layout of the graph by `FacetGrid` on the `Phage` variable. +- Next, map a custom `errorbar` function to the FactgGrid as the example from Stackoverflow. +- Adjust/Simplify the `errorbar` custom function for your purpose. +- Matplotlib uses the `capsize` to draw upper and lower lines of the intervals. +
```{code-cell} ipython3 :clear_cell: true -end_of_experiment = tidy_experiment[tidy_experiment["experiment_time_h"] == "OD_72h"].copy() +falcor["Bacterial_genotype"] = falcor["Bacterial_genotype"].replace({'WT(2)': 'WT', + 'MUT(2)': 'MUT'}) ``` ```{code-cell} ipython3 :clear_cell: true -# The Nan-values of the PhageR_72h when no phage represent survival (1) -end_of_experiment["PhageR_72h"] = end_of_experiment["PhageR_72h"].fillna(0.) +def errorbar(x, y, low, high, **kws): + """Utility function to link falcor data representation with the errorbar representation""" + plt.errorbar(x, y, (y - low, high - y), capsize=3, fmt="o", color="black", ms=4) ``` ```{code-cell} ipython3 :clear_cell: true -# precalculate the median value -end_of_experiment["Phage_median"] = end_of_experiment.groupby(["Phage_t", "Bacterial_genotype"])['optical_density'].transform('median') - -p9.options.figure_size = (8, 10) -(p9.ggplot(end_of_experiment, p9.aes(x='Bacterial_genotype', - y='optical_density')) - + p9.geom_jitter(mapping=p9.aes(color='factor(PhageR_72h)'), - width=0.2, height=0., size=2, fill='white') - + p9.facet_wrap("Phage_t", nrow=4, - labeller=p9.as_labeller({'C_noPhage' : '(a) no phage', 'L' : '(b) phage $\lambda$', - 'T4' : '(c) phage T4', 'T7': '(d) phage T7'})) - + p9.theme_bw() - + p9.xlab("Bacterial genotype") - + p9.ylab("Bacterial density (OD)") - + p9.theme(strip_text=p9.element_text(size=11)) - + p9.geom_crossbar(inherit_aes=False, alpha=0.5, - mapping=p9.aes(x='Bacterial_genotype', y='Phage_median', - ymin='Phage_median', ymax='Phage_median')) - + p9.scale_color_manual(values=["black", "red"], guide=False) -) +sns.set_style("ticks") +g = sns.FacetGrid(falcor, row="Phage", aspect=3, height=3) +g.map(errorbar, + "Bacterial_genotype", "log10 Mc", + "log10 LBc", "log10 UBc") +``` + +```{code-cell} ipython3 + ``` diff --git a/_solved/case4_air_quality_analysis.md b/_solved/case4_air_quality_analysis.md index 9ffed58..b4b9f42 100644 --- a/_solved/case4_air_quality_analysis.md +++ b/_solved/case4_air_quality_analysis.md @@ -49,7 +49,7 @@ See http://www.eea.europa.eu/themes/air/interactive/no2 We processed the individual data files in the previous notebook, and saved it to a csv file `../data/airbase_data_processed.csv`. Let's import the file here (if you didn't finish the previous notebook, a version of the dataset is also available in `../data/airbase_data.csv`): ```{code-cell} ipython3 -alldata = pd.read_csv('../data/airbase_data.csv', index_col=0, parse_dates=True) +alldata = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True) ``` We only use the data from 1999 onwards: @@ -158,7 +158,7 @@ data_tidy.head() ```{code-cell} ipython3 :clear_cell: true -data_tidy['no2'].isnull().sum() +data_tidy['no2'].isna().sum() ``` ```{code-cell} ipython3 @@ -435,7 +435,7 @@ Start with only visualizing the different in diurnal profile for the BETR801 sta **Hints:**
    -
  • Add a column 'weekend' defining if a value of the index is in the weekend (i.e. weekdays 5 and 6) or not
  • +
  • Add a column 'weekend' defining if a value of the index is in the weekend (i.e. days of the week 5 and 6) or not
  • Add a column 'hour' with the hour of the day for each row.
  • You can groupby on multiple items at the same time.
  • @@ -445,7 +445,7 @@ Start with only visualizing the different in diurnal profile for the BETR801 sta ```{code-cell} ipython3 :clear_cell: true -data['weekend'] = data.index.weekday.isin([5, 6]) +data['weekend'] = data.index.dayofweek.isin([5, 6]) data['weekend'] = data['weekend'].replace({True: 'weekend', False: 'weekday'}) data['hour'] = data.index.hour ``` @@ -644,10 +644,10 @@ ax2.set_title('BETR801')
    • Make a selection of the original dataset of the data in January 2009, call the resulting variable subset
    • -
    • Add a new column, called 'weekday', to the variable subset which defines for each data point the day of the week
    • +
    • Add a new column, called 'dayofweek', to the variable subset which defines for each data point the day of the week
    • From the subset DataFrame, select only Monday (= day 0) and Sunday (=day 6) and remove the others (so, keep this as variable subset)
    • -
    • Change the values of the weekday column in subset according to the following mapping: {0:"Monday", 6:"Sunday"}
    • -
    • With plotnine, make a scatter plot of the measurements at 'BETN029' vs 'FR04037', with the color variation based on the weekday. Add a linear regression to this plot.
    • +
    • Change the values of the dayofweek column in subset according to the following mapping: {0:"Monday", 6:"Sunday"}
    • +
    • With plotnine, make a scatter plot of the measurements at 'BETN029' vs 'FR04037', with the color variation based on the dayofweek. Add a linear regression to this plot.

    **Note**: If you run into the **SettingWithCopyWarning** and do not know what to do, recheck [pandas_03b_indexing](pandas_03b_indexing.ipynb) @@ -658,21 +658,21 @@ ax2.set_title('BETR801') :clear_cell: true subset = data['2009-01'].copy() -subset["weekday"] = subset.index.weekday -subset = subset[subset['weekday'].isin([0, 6])] +subset["dayofweek"] = subset.index.dayofweek +subset = subset[subset['dayofweek'].isin([0, 6])] ``` ```{code-cell} ipython3 :clear_cell: true -subset["weekday"] = subset["weekday"].replace(to_replace={0:"Monday", 6:"Sunday"}) +subset["dayofweek"] = subset["dayofweek"].replace(to_replace={0:"Monday", 6:"Sunday"}) ``` ```{code-cell} ipython3 :clear_cell: true (pn.ggplot(subset, - pn.aes(x="BETN029", y="FR04037", color="weekday")) + pn.aes(x="BETN029", y="FR04037", color="dayofweek")) + pn.geom_point() + pn.stat_smooth(method='lm')) ``` @@ -713,7 +713,7 @@ ax = exceedances.plot(kind='bar') EXERCISE:
      -
    • Visualize the typical week profile for station 'BETR801' as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).


    • +
    • Visualize the typical week profile for station 'BETR801' as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain day of the week).


    @@ -726,7 +726,7 @@ The boxplot method of a DataFrame expects the data for the different boxes in di +++ -Calculating daily means and add weekday information: +Calculating daily means and add dayofweek information: ```{code-cell} ipython3 :clear_cell: true @@ -737,8 +737,8 @@ data_daily = data.resample('D').mean() ```{code-cell} ipython3 :clear_cell: true -# add a weekday column -data_daily['weekday'] = data_daily.index.weekday +# add a dayofweek column +data_daily['dayofweek'] = data_daily.index.dayofweek data_daily.head() ``` @@ -749,7 +749,7 @@ Plotting with plotnine: # plotnine (pn.ggplot(data_daily["2012"], - pn.aes(x='factor(weekday)', y='BETR801')) + pn.aes(x='factor(dayofweek)', y='BETR801')) + pn.geom_boxplot()) ``` @@ -760,8 +760,9 @@ Reshaping and plotting with pandas: # when using pandas to plot, the different boxplots should be different columns # therefore, pivot table so that the weekdays are the different columns -data_daily['week'] = data_daily.index.week -data_pivoted = data_daily['2012'].pivot_table(columns='weekday', index='week', values='BETR801') +data_daily['week'] = data_daily.index.isocalendar().week +data_pivoted = data_daily['2012'].pivot_table(columns='dayofweek', index='week', + values='BETR801') data_pivoted.head() data_pivoted.boxplot(); ``` @@ -770,5 +771,9 @@ data_pivoted.boxplot(); :clear_cell: true # An alternative method using `groupby` and `unstack` -data_daily['2012'].groupby(['weekday', 'week'])['BETR801'].mean().unstack(level=0).boxplot(); +data_daily['2012'].groupby(['dayofweek', 'week'])['BETR801'].mean().unstack(level=0).boxplot(); +``` + +```{code-cell} ipython3 + ``` diff --git a/_solved/pandas_03a_selecting_data.md b/_solved/pandas_03a_selecting_data.md index fa397f3..96e7396 100644 --- a/_solved/pandas_03a_selecting_data.md +++ b/_solved/pandas_03a_selecting_data.md @@ -272,6 +272,11 @@ Tip: try it first on a single string (and for this, check the `split` method of df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0]) ``` +```{code-cell} ipython3 +# alternative solution with pandas' string methods +df['Surname'] = df['Name'].str.split(",").str.get(0) +``` +
    EXERCISE: @@ -423,13 +428,13 @@ inception = cast[cast['title'] == 'Inception'] ```{code-cell} ipython3 :clear_cell: true -len(inception[inception['n'].isnull()]) +len(inception[inception['n'].isna()]) ``` ```{code-cell} ipython3 :clear_cell: true -inception['n'].isnull().sum() +inception['n'].isna().sum() ```
    @@ -444,7 +449,7 @@ inception['n'].isnull().sum() ```{code-cell} ipython3 :clear_cell: true -len(inception[inception['n'].notnull()]) +len(inception[inception['n'].notna()]) ```
    @@ -460,7 +465,7 @@ len(inception[inception['n'].notnull()]) :clear_cell: true titanic = cast[(cast['title'] == 'Titanic') & (cast['year'] == 1997)] -titanic = titanic[titanic['n'].notnull()] +titanic = titanic[titanic['n'].notna()] titanic.sort_values('n') ``` diff --git a/_solved/pandas_04_time_series_data.md b/_solved/pandas_04_time_series_data.md index 97abdb9..53b4628 100644 --- a/_solved/pandas_04_time_series_data.md +++ b/_solved/pandas_04_time_series_data.md @@ -116,7 +116,7 @@ pd.to_datetime("09/12/2016", dayfirst=True) pd.to_datetime("09/12/2016", format="%d/%m/%Y") ``` -A detailed overview of how to specify the `format` string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior +A detailed overview of how to specify the `format` string, see the table in the python documentation: https://docs.python.org/3.8/library/datetime.html#strftime-and-strptime-behavior +++ @@ -151,7 +151,7 @@ ts.dt.hour ``` ```{code-cell} ipython3 -ts.dt.weekday +ts.dt.dayofweek ``` To quickly construct some regular time series data, the [``pd.date_range``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function comes in handy: @@ -393,10 +393,10 @@ data.resample('M').std().plot() # 'A' EXERCISE:
      -
    • plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

    • +
    • Plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

    -**Note** remember the `agg` when using `groupby` to derive multiple statistics at the same time? +__Note__ Did you know agg to derive multiple statistics at the same time?
    diff --git a/_solved/pandas_06_groupby_operations.md b/_solved/pandas_06_groupby_operations.md index 81c4f60..824a630 100644 --- a/_solved/pandas_06_groupby_operations.md +++ b/_solved/pandas_06_groupby_operations.md @@ -197,7 +197,7 @@ df.groupby('Pclass')['Survived'].mean().plot(kind='bar') #and what if you would EXERCISE:
      -
    • Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.
    • +
    • Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is divided in separate classes using the `pd.cut` function as provided below.
    @@ -244,7 +244,7 @@ df.groupby(['Pclass', 'Sex'])['Survived'].mean() +++ -Oftentimes you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column). +Often you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column). To get the size of the groups, we can use `size`: @@ -279,7 +279,7 @@ These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://gith - n: the order of the role (n=1: leading role) ```{code-cell} ipython3 -cast = pd.read_csv('../data/cast.csv') +cast = pd.read_csv('data/cast.csv') cast.head() ``` @@ -289,7 +289,7 @@ cast.head() * year: year of release ```{code-cell} ipython3 -titles = pd.read_csv('../data/titles.csv') +titles = pd.read_csv('data/titles.csv') titles.head() ``` @@ -556,7 +556,7 @@ oz_roles[oz_roles > 1].sort_values() ```{code-cell} ipython3 :clear_cell: true -cast['n_total'] = cast.groupby('title')['n'].transform('max') # transform will return an element for each row, so the max value is given to the whole group +cast['n_total'] = cast.groupby(['title', 'year'])['n'].transform('max') # transform will return an element for each row, so the max value is given to the whole group cast.head() ``` @@ -634,3 +634,7 @@ cast2000 = cast[cast['year'] // 10 == 200] cast2000 = cast2000[cast2000['n'] == 1] cast2000['type'].value_counts() ``` + +```{code-cell} ipython3 + +``` diff --git a/_solved/pandas_07_reshaping_data.md b/_solved/pandas_07_reshaping_data.md index 9091b81..612348a 100644 --- a/_solved/pandas_07_reshaping_data.md +++ b/_solved/pandas_07_reshaping_data.md @@ -107,7 +107,7 @@ So far, so good... Let's now use the full titanic dataset: ```{code-cell} ipython3 -df = pd.read_csv("../data/titanic.csv") +df = pd.read_csv("data/titanic.csv") ``` ```{code-cell} ipython3 @@ -150,7 +150,7 @@ Well, they need to be combined, according to an `aggregation` functionality, whi # Pivot tables - aggregating while pivoting ```{code-cell} ipython3 -df = pd.read_csv("../data/titanic.csv") +df = pd.read_csv("data/titanic.csv") ``` ```{code-cell} ipython3 @@ -329,7 +329,7 @@ df To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a `groupby` and `stack/unstack`. ```{code-cell} ipython3 -df = pd.read_csv("../data/titanic.csv") +df = pd.read_csv("data/titanic.csv") ``` ```{code-cell} ipython3 @@ -365,12 +365,12 @@ df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack() These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder. ```{code-cell} ipython3 -cast = pd.read_csv('../data/cast.csv') +cast = pd.read_csv('data/cast.csv') cast.head() ``` ```{code-cell} ipython3 -titles = pd.read_csv('../data/titles.csv') +titles = pd.read_csv('data/titles.csv') titles.head() ``` @@ -432,8 +432,8 @@ pd.crosstab(index=cast['year'], columns=cast['type']).plot(kind='area') :clear_cell: true grouped = cast.groupby(['year', 'type']).size() -table = grouped.unstack('type') -(table['actor'] / (table['actor'] + table['actress'])).plot(ylim=[0,1]) +table = grouped.unstack('type').fillna(0) +(table['actor'] / (table['actor'] + table['actress'])).plot(ylim=[0, 1]) ```
    @@ -463,3 +463,11 @@ d = c.Superman - c.Batman print('Superman years:') print(len(d[d > 0.0])) ``` + +```{code-cell} ipython3 + +``` + +```{code-cell} ipython3 + +``` diff --git a/_solved/visualization_01_matplotlib.md b/_solved/visualization_01_matplotlib.md index b3300ff..4a615af 100644 --- a/_solved/visualization_01_matplotlib.md +++ b/_solved/visualization_01_matplotlib.md @@ -14,10 +14,10 @@ kernelspec:

    Matplotlib: Introduction

    -> *DS Data manipulation, analysis and visualisation in Python* -> *December, 2019* +> *DS Data manipulation, analysis and visualization in Python* +> *May/June, 2021* -> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* +> *© 2021, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* --- @@ -61,15 +61,25 @@ On its own, drawing the figure artist is uninteresting and will result in an emp By far the most useful artist in matplotlib is the **Axes** artist. The Axes artist represents the "data space" of a typical plot, a rectangular axes (the most common, but not always the case, e.g. polar plots) will have 2 (confusingly named) **Axis** artists with tick labels and tick marks. +![](../img/matplotlib_fundamentals.png) + There is no limit on the number of Axes artists which can exist on a Figure artist. Let's go ahead and create a figure with a single Axes artist, and show it using pyplot: ```{code-cell} ipython3 ax = plt.axes() ``` +```{code-cell} ipython3 +type(ax) +``` + +```{code-cell} ipython3 +type(ax.xaxis), type(ax.yaxis) +``` + Matplotlib's ``pyplot`` module makes the process of creating graphics easier by allowing us to skip some of the tedious Artist construction. For example, we did not need to manually create the Figure artist with ``plt.figure`` because it was implicit that we needed a figure when we created the Axes artist. -Under the hood matplotlib still had to create a Figure artist, its just we didn't need to capture it into a variable. We can access the created object with the "state" functions found in pyplot called **``gcf``** and **``gca``**. +Under the hood matplotlib still had to create a Figure artist, its just we didn't need to capture it into a variable. +++ @@ -166,6 +176,12 @@ ax.text(0.5, 0.5, 'Text centered at (0.5, 0.5)\nin Figure coordinates.', ax.legend(loc='upper right', frameon=True, ncol=2, fontsize=14) ``` +Adjusting specific parts of a plot is a matter of accessing the correct element of the plot: + +![](https://matplotlib.org/stable/_images/anatomy.png) + ++++ + For more information on legend positioning, check [this post](http://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot) on stackoverflow! +++ @@ -236,15 +252,19 @@ import pandas as pd ``` ```{code-cell} ipython3 -flowdata = pd.read_csv('../data/vmm_flowdata.csv', +flowdata = pd.read_csv('data/vmm_flowdata.csv', index_col='Time', parse_dates=True) ``` ```{code-cell} ipython3 -flowdata.plot() +out = flowdata.plot() # print type() ``` +Under the hood, it creates an Matplotlib Figure with an Axes object. + ++++ + ### Pandas versus matplotlib +++ @@ -252,7 +272,7 @@ flowdata.plot() #### Comparison 1: single plot ```{code-cell} ipython3 -flowdata.plot(figsize=(16, 6)) # shift tab this! +flowdata.plot(figsize=(16, 6)) # SHIFT + TAB this! ``` Making this with matplotlib... @@ -275,7 +295,7 @@ axs = flowdata.plot(subplots=True, sharex=True, fontsize=15, rot=0) ``` -Mimicking this in matplotlib (just as a reference): +Mimicking this in matplotlib (just as a reference, it is basically what Pandas is doing under the hood): ```{code-cell} ipython3 from matplotlib import cm @@ -288,7 +308,7 @@ fig, axs = plt.subplots(3, 1, figsize=(16, 8)) for ax, col, station in zip(axs, colors, flowdata.columns): ax.plot(flowdata.index, flowdata[station], label=station, color=col) ax.legend() - if not ax.is_last_row(): + if not ax.get_subplotspec().is_last_row(): ax.xaxis.set_ticklabels([]) ax.xaxis.set_major_locator(mdates.YearLocator()) else: @@ -305,9 +325,9 @@ Is already a bit harder ;-) ### Best of both worlds... ```{code-cell} ipython3 -fig, ax = plt.subplots() #prepare a matplotlib figure +fig, ax = plt.subplots() #prepare a Matplotlib figure -flowdata.plot(ax=ax) # use pandas for the plotting +flowdata.plot(ax=ax) # use Pandas for the plotting ``` ```{code-cell} ipython3 @@ -375,7 +395,7 @@ def vmm_station_plotter(flowdata, label="flow (m$^3$s$^{-1}$)"): ax.set_ylabel(label, size=15) ax.yaxis.set_major_locator(MaxNLocator(4)) # smaller set of y-ticks for clarity - if not ax.is_last_row(): # hide the xticklabels from the none-lower row x-axis + if not ax.get_subplotspec().is_last_row(): # hide the xticklabels from the none-lower row x-axis ax.xaxis.set_ticklabels([]) ax.xaxis.set_major_locator(mdates.YearLocator()) else: # yearly xticklabels from the lower x-axis in the subplots @@ -397,13 +417,11 @@ fig.suptitle('Ammonium concentrations in the Maarkebeek', fontsize='17') fig.savefig('ammonium_concentration.pdf') ``` -
    +
    - NOTE: +**NOTE** -
      -
    • Let your hard work pay off, write your own custom functions!
    • -
    +- Let your hard work pay off, write your own custom functions!
    @@ -411,7 +429,7 @@ fig.savefig('ammonium_concentration.pdf')
    - Remember: +**Remember** `fig.savefig()` to save your Figure object! @@ -432,12 +450,12 @@ For more in-depth material:
    - Remember(!) - - -
    +**Remember** +- matplotlib gallery is an important resource to start from
    + +```{code-cell} ipython3 + +``` diff --git a/_solved/visualization_03_landscape.md b/_solved/visualization_03_landscape.md index 730925f..0cab14e 100644 --- a/_solved/visualization_03_landscape.md +++ b/_solved/visualization_03_landscape.md @@ -14,10 +14,10 @@ kernelspec:

    Python's Visualization Landscape

    -> *DS Data manipulation, analysis and visualisation in Python* -> *December, 2019* +> *DS Data manipulation, analysis and visualization in Python* +> *May/June, 2021* -> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* +> *© 2021, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)* --- @@ -26,24 +26,17 @@ kernelspec: --- **Remark:** -The packages used in this notebook are not provided by default in the conda environment of the course. In case you want to try these featutes yourself, make sure to install these packages with conda. +Some Python visualization packages used in this notebook are not provided by default in the conda environment of the course. In case you want to try these features yourself, make sure to install these packages with conda. To make some of the more general plotting packages available: ``` -conda install -c conda-forge bokeh plotly altair vega -``` - -an additional advice will appear about the making the vega nbextension available. This can be activated with the command: +conda install -c conda-forge bokeh plotly altair hvplot holoviews +``` +To have support of plotly inside the Jupyter Lab environment ``` -jupyter nbextension enable vega --py --sys-prefix -``` - -and use the interaction between plotly and pandas, install `cufflinks` as well - -``` -pip install cufflinks --upgrade +jupyter labextension install jupyterlab-plotly@4.14.3 ``` To run the large data set section, additional package installations are required: @@ -63,57 +56,56 @@ What we have encountered until now: * [matplotlib](https://matplotlib.org/) * [pandas .plot](https://pandas.pydata.org/pandas-docs/stable/visualization.html) -* [plotnine](https://github.com/has2k1/plotnine) -* a bit of [seaborn](https://seaborn.pydata.org/) +* [seaborn](https://seaborn.pydata.org/) ```{code-cell} ipython3 import numpy as np import pandas as pd -import matplotlib.pylab as plt +import matplotlib.pyplot as plt import plotnine as p9 import seaborn as sns ``` -### To 'grammar of graphics' or not to 'grammar of graphics'? +### When should I use Seaborn versus Matplotlib? +++ -#### Introduction - -+++ - -There is `titanic` again... +There is `titanic` data again... ```{code-cell} ipython3 -titanic = pd.read_csv("../data/titanic.csv") +titanic = pd.read_csv("data/titanic.csv") ``` -Pandas plot... +Pandas/Matplotlib plot... ```{code-cell} ipython3 -fig, ax = plt.subplots() -plt.style.use('ggplot') -survival_rate = titanic.groupby("Pclass")['Survived'].mean() -survival_rate.plot(kind='bar', color='grey', - rot=0, figsize=(6, 4), ax=ax) -ylab = ax.set_ylabel("Survival rate") -xlab = ax.set_xlabel("Cabin class") +with plt.style.context('seaborn-whitegrid'): # context manager for styling the figure + + fig, ax = plt.subplots() + + survival_rate = titanic.groupby("Pclass")['Survived'].mean() + survival_rate.plot(kind='bar', color='grey', + rot=0, figsize=(6, 4), ax=ax) + + ylab = ax.set_ylabel("Survival rate") + xlab = ax.set_xlabel("Cabin class") ``` -Plotnine plot... +Using Seaborn: ```{code-cell} ipython3 -(p9.ggplot(titanic, p9.aes(x="factor(Pclass)", - y="Survived")) #add color/fill - + p9.geom_bar(stat='stat_summary', width=0.5) - + p9.theme(figure_size=(5, 3)) - + p9.ylab("Survival rate") - + p9.xlab("Cabin class") -) +with sns.axes_style("whitegrid"): # context manager for styling the figure + + g = sns.catplot(data=titanic, + x="Pclass", y="Survived", + kind="bar", estimator=np.mean, + ci=None, color="grey") + + g.set_axis_labels("Cabin class", "Survival rate") ``` -An important difference is the *imperative* approach from `matplotlib` versus the *declarative* approach from `plotnine`: +An important difference is the *imperative* approach from `matplotlib` versus the *declarative* approach from `seaborn`: +++ @@ -121,20 +113,15 @@ An important difference is the *imperative* approach from `matplotlib` versus th |------------|-------------| | Specify **how** something should be done | Specify **what** should be done | | **Manually specify** the individual plotting steps | Individual plotting steps based on **declaration** | -| e.g. `for ax in axes: ax.plot(...` | e.g. `+ facet_wrap('my_variable)` | - -+++ - -
    (seaborn lands somewhere in between)
    +| e.g. `for ax in axes: ax.plot(...` | e.g. `, col=my_variable` | +++ -Which approach to use, is also a matter of personal preference.... +Which approach to use, is sometimes just a matter of personal preference... Although, take following elements into account: +++ -Although, take following elements into account: -* When your data consists of only **1 factor variable**, such as +* When your data consists of only **1 categorical variable**, such as | ID | variable 1 | variable 2 | variabel ... | |------------|-------------| ---- | ----- | @@ -144,7 +131,7 @@ Although, take following elements into account: | 4 | 0.1 | 0.7 | ... | | ... | ... | ... | ...| -the added value of using a grammar of graphics approach is LOW. +the added value of using Seaborn approach is LOW. Pandas `.plot()` will probably suffice. * When working with **timeseries data** from sensors or continuous logging, such as @@ -156,11 +143,11 @@ the added value of using a grammar of graphics approach is LOW. | 2017-12-20T17:51:40Z | 0.1 | 0.7 | ... | | ... | ... | ... | ...| -the added value of using a grammar of graphics approach is LOW. +the added value of using a grammar of graphics approach is LOW. Pandas `.plot()` will probably suffice. * When working with different experiments, different conditions, (factorial) **experimental designs**, such as -| ID | substrate | addition (ml) | measured_value | +| ID | origin | addition (ml) | measured_value | |----|-----------| ----- | ------ | | 1 | Eindhoven | 0.3 | 7.2 | | 2 | Eindhoven | 0.6 | 6.7 | @@ -169,19 +156,18 @@ the added value of using a grammar of graphics approach is LOW. | 5 | Destelbergen | 0.6 | 6.8 | | ... | ... | ... | ...| -the added value of using a grammar of graphics approach is HIGH. Represent you're data [`tidy`](http://www.jeannicholashould.com/tidy-data-in-python.html) to achieve maximal benefit! +the added value of using Seaborn approach is HIGH. Represent your data [`tidy`](http://www.jeannicholashould.com/tidy-data-in-python.html) to achieve maximal benefit! + +* When you want to visualize __distributions__ of data or __regressions__ between variables, the added value of using Seaborn approach is HIGH. +++
    - Remember: +**Remember** -
      -
    • These packages will support you towards static, publication quality figures in a variety of hardcopy formats
    • -
    • In general, start with a high-level function and finish with matplotlib
    • -
    -
    +- These packages will support you towards __static, publication quality__ figures in a variety of __hardcopy__ formats +- In general, start with a _high-level_ function and adjust the details with Matplotlib
    @@ -199,167 +185,183 @@ fig.savefig("my_plot_with_one_issue.pdf")
    - Notice: +**Notice** -
      -
    • In the end... there is still Inkscape to the rescue!
    • -
    -
    +- In the end... there is still Inkscape to the rescue!
    +++ -### Seaborn - -```{code-cell} ipython3 -plt.style.use('seaborn-white') -``` - -> Seaborn is a library for making attractive and **informative statistical** graphics in Python. It is built **on top of matplotlib** and tightly integrated with the PyData stack, including **support for numpy and pandas** data structures and statistical routines from scipy and statsmodels. +## The 'Grammar of graphics' +++ -Seaborn provides a set of particularly interesting plot functions: +Seaborn provides a high level abstraction to create charts and is highly related to the concept of the so-called (layered) `Grammar of Graphics`, a visualization framework originally described [by Leland Wilkinson](https://www.springer.com/gp/book/9780387245447), which became famous due to the [ggplot2](https://ggplot2.tidyverse.org/) R package. -+++ +The `Grammar of Graphics` is especially targeted for so-called __tidy__ `DataFrame` representations and has currently implementations in different programming languages, e.g. -#### scatterplot matrix +- [ggplot2](https://ggplot2.tidyverse.org/) for R +- [vega-lite API](https://vega.github.io/vega-lite-api/) for Javascript -+++ +Each chart requires the definition of: -We've already encountered the [`pairplot`](https://seaborn.pydata.org/examples/scatterplot_matrix.html), a typical quick explorative plot function +1. data +1. a geometry (e.g. points, lines, bars,...) +1. a translation of the variables in the data to the elements of the geometry (aka `aesthetics` or `encoding`) -```{code-cell} ipython3 -# the discharge data for a number of measurement stations as example -flow_data = pd.read_csv("../data/vmm_flowdata.csv", parse_dates=True, index_col=0) -flow_data = flow_data.dropna() -flow_data['year'] = flow_data.index.year -flow_data.head() -``` +And additional elements can be added or adjusted to create more complex charts. -```{code-cell} ipython3 -# pairplot -sns.pairplot(flow_data, vars=["L06_347", "LS06_347", "LS06_348"], - hue='year', palette=sns.color_palette("Blues_d"), - diag_kind='kde', dropna=True) -``` +In the Python visualization ecosystem, both `Plotnine` as well as `Altair` provide an implementation of the `Grammar of Graphics` -#### heatmap +| Plotnine | Altair | +|------------|-------------| +| Works well with Pandas | Works well with Pandas | +| Built on top of [Matplotlib](https://matplotlib.org/) | Built on top of [Vega-lite](https://vega.github.io/vega-lite/) | +| Python-clone of the R package `ggplot` | Plot specification to define a vega-lite 'JSON string' | +| Static plots | Web/interactive plots | +++ -Let's just start from a Ghent data set: The city of Ghent provides data about migration in the different districts as open data, https://data.stad.gent/data/58 +### Plotnine -```{code-cell} ipython3 -district_migration = pd.read_csv("https://datatank.stad.gent/4/bevolking/wijkmigratieperduizend.csv", - sep=";", index_col=0) -district_migration.index.name = "wijk" -district_migration.head() -``` ++++ -```{code-cell} ipython3 -# cleaning the column headers -district_migration.columns = [year[-4:] for year in district_migration.columns] -district_migration.head() -``` +> _[Plotnine](https://plotnine.readthedocs.io/en/stable/) is an implementation of a grammar of graphics in Python, it is based on `ggplot2`. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot._ -```{code-cell} ipython3 -#adding a total column -district_migration['TOTAAL'] = district_migration.sum(axis=1) -``` +The syntax of the package will feel _very familiar_ to users familiar with the R package ggplot, but might feel _odd_ for Python developers. + +The main ingredients (data, geometry, aesthetics) of the `Grammar of Graphics` framework need to be defined to create a chart: ```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 10)) -sns.heatmap(district_migration, annot=True, fmt=".1f", linewidths=.5, - cmap="PiYG", ax=ax, vmin=-20, vmax=20) -ylab = ax.set_ylabel("") -ax.set_title("Migration of Ghent districts", size=14) -``` +import plotnine as p9 -#### jointplot +myplot = (p9.ggplot(titanic) # 1. DATA + + p9.geom_bar( # 2. GEOMETRY, geom_* + stat='stat_summary', + mapping=p9.aes(x='Pclass', + y='Survived') # 3. AESTHETICS - relate variables to geometry + ) +) -+++ +myplot +``` -[jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot) provides a very convenient function to check the combined distribution of two variables in a DataFrame (bivariate plot) +And further customization (_layers_) can be added to the specification, e.g. -+++ +```{code-cell} ipython3 +import plotnine as p9 -Using the default options on the flow_data dataset +myplot = (p9.ggplot(titanic) # 1. DATA + + p9.geom_bar( # 2. GEOMETRY, geom_* + stat='stat_summary', + mapping=p9.aes(x='Pclass', + y='Survived') # 3. AESTHETICS - relate variables to geometry + ) + + p9.xlab("Cabin class") # labels + + p9.theme_minimal() # theme + # ... +) -```{code-cell} ipython3 -g = sns.jointplot(data=flow_data, - x='LS06_347', y='LS06_348') +myplot ``` -```{code-cell} ipython3 -g = sns.jointplot(data=flow_data, - x='LS06_347', y='LS06_348', - kind="reg", space=0) -``` +As Plotnine is built on top of Matplotlib, one can still retrieve the Matplotlib `Figure` object from Plotnine for eventual customization. -more options, applied on the migration data set: +The trick is to use the `.draw()` function in Plotnine: ```{code-cell} ipython3 -g = sns.jointplot(data=district_migration.transpose(), - x='Oud Gentbrugge', y='Nieuw Gent - UZ', - kind="kde", height=7, space=0) # kde +my_plt_version = myplot.draw(); # extract as Matplotlib Figure + +# Do some Matplotlib magick... +my_plt_version.axes[0].set_title("Titanic fare price per cabin class") +ax2 = my_plt_version.add_axes([0.7, 0.5, 0.3, 0.3], label="ax2") ``` -
    +
    - Notice!: +**REMEMBER** -
      -
    • Watch out with the interpretation. The representations (`kde`, `regression`) is based on a very limited set of data points!
    • -
    -
    +- If you are already familiar to ggplot in R, the conversion of Plotnine will be easy. +- Plotnine is based on Matplotlib, making further customization possible as we have seen before.
    +++ -Adding the data points itself, provides at least this info to the user: +### Altair + +> *[Altair](https://altair-viz.github.io/) is a declarative statistical visualization library for Python, based on Vega-Lite.* + +```{code-cell} ipython3 +:tags: [] + +import altair as alt +``` + +Altair implements the `Grammar of Graphics` with the same main ingredients, but a different syntax: ```{code-cell} ipython3 -g = (sns.jointplot( - data=district_migration.transpose(), - x='Oud Gentbrugge', y='Nieuw Gent - UZ', - kind="scatter", height=7, space=0, stat_func=None, - marginal_kws=dict(bins=20, rug=True) - ).plot_joint(sns.kdeplot, zorder=0, - n_levels=5, cmap='Reds')) -g.savefig("my_great_plot.pdf") +(alt.Chart(titanic) # 1. DATA + .mark_bar() # 2. GEOMETRY, geom_* + .encode( # 3. AESTHETICS - relate variables to geometry + x=alt.X('Pclass:O', + axis=alt.Axis(title='Cabin class')), + y=alt.Y('mean(Survived):Q', + axis=alt.Axis(format='%', + title='Survival rate')) + ) +) ``` -#### jointplot +When encoding the variables for the chosen geometry, Altair provides a specific syntax on the data type of each variable. For information on this `...:Q`, `...:N`,`...:O`, see the [data type section](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types) of the documentation: + +Data Type | Shorthand Code | Description +----------|-----------------|--------------- +quantitative | Q | a continuous real-valued quantity +ordinal | O | a discrete ordered quantity +nominal | N | a discrete unordered category +temporal | T | a time or date value +++ -With [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) and [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html#seaborn.relplot), Seaborn provides similarities with the Grammar of Graphics +Altair is made for the web, providing interactive features for the plots. See more examples [here](https://altair-viz.github.io/gallery/index.html#interactive-charts). ```{code-cell} ipython3 -sns.catplot(data=titanic, x="Survived", - col="Pclass", kind="count") +brush = alt.selection(type='interval') + +(alt.Chart(titanic) + .mark_circle().encode( + x="Fare:Q", + y="Age:Q", + column="Sex:O", + color=alt.condition(brush, "Pclass:N", alt.value('grey')), +).add_selection(brush)) ```
    - Remember - Check the package galleries: +**Remember** - -
    +- Altair provides a pure-Python Grammar of Graphics implementation +- Altair is built on top of the Vega-Lite visualization grammar, which can be interpreted as a language to specify a graph (from data to figure). +- Altair easily integrates with web-technology (HTML/Javascript).
    +++ -## Interactivity and the web matter these days! +## Interactivity and the web + ++++ + +Whereas Matplotlib/Seaborn/Plotnine are packages to create static charts, the charts created by Altair are mainly targeted to __integrate in websites and web applications__. + +With the increasing interest for interactive data visualization and dashboards, other packages were designed to fulfill this requirement. Both the [Bokeh](https://bokeh.org/) package and the [Plotly](https://plotly.com/python/) package can be used as a stand-alone data visualization tool or as part of web applications and dashboards. + ++++ + +__Note:__ Bokeh and Plotly are also the components for some packages to build interactive web applications, respectively [Panel](https://panel.holoviz.org/) and [Dash](https://dash.plotly.com/). +++ @@ -367,13 +369,13 @@ sns.catplot(data=titanic, x="Survived", +++ -> *[Bokeh](https://bokeh.pydata.org/en/latest/) is a Python interactive visualization library that targets modern web browsers for presentation* +> *[Bokeh](https://bokeh.pydata.org/en/latest/) is a Python interactive visualization library that targets modern web browsers for presentation*. ```{code-cell} ipython3 from bokeh.plotting import figure, output_file, show ``` -By default, Bokeh will open a new webpage to plot the figure. Still, the **integration with notebooks** is provided as well: +By default, Bokeh will open a new webpage to plot the figure. Still, an **integration with Jupyter notebooks** is provided: ```{code-cell} ipython3 from bokeh.io import output_notebook @@ -385,17 +387,16 @@ output_notebook() ```{code-cell} ipython3 p = figure() -p.line(x=[1, 2, 3], y=[4,6,2]) +p.line(x=[1, 2, 3], + y=[4,6,2]) show(p) ```
    - Notice!: +__Warning__ -
      -
    • Bokeh does not support eps, pdf export of plots directly. Exporting to svg is available but still limited, see documentation
    • . -
    +- Bokeh does not support eps, pdf export of plots directly. Exporting to svg is available but still limited, see documentation.
    @@ -405,32 +406,32 @@ To accomodate the users of **Pandas**, a `pd.DataFrame` can also be used as the ```{code-cell} ipython3 from bokeh.models import ColumnDataSource -source_data = ColumnDataSource(data=flow_data) -``` -```{code-cell} ipython3 -flow_data.head() +flow_data = pd.read_csv("data/vmm_flowdata.csv", parse_dates=True, index_col=0) + +source_data = ColumnDataSource(data=flow_data) ``` Useful to know when you want to use the index as well: > *If the DataFrame has a named index column, then CDS will also have a column with this name. However, if the index name (or any subname of a MultiIndex) is None, then the CDS will have a column generically named index for the index.* ```{code-cell} ipython3 -p = figure(x_axis_type="datetime", plot_height=300, plot_width=900) -p.line(x='Time', y='L06_347', source=source_data) +p = figure(x_axis_type="datetime", plot_height=200, plot_width=600) +p.line(x='Time', y='LS06_347', source=source_data) show(p) ``` -The setup of the graph, is by adding new elements to the figure object, e.g. adding annotations: +Bokeh has lots of functionalities to adjust and customize charts, e.g. by adding new annotations to the figure object: ```{code-cell} ipython3 from bokeh.models import ColumnDataSource, BoxAnnotation, Label ``` ```{code-cell} ipython3 -p = figure(x_axis_type="datetime", plot_height=300, plot_width=900) +p = figure(x_axis_type="datetime", plot_height=200, plot_width=600) p.line(x='Time', y='L06_347', source=source_data) -p.circle(x='Time', y='L06_347', source=source_data, fill_alpha= 0.3, line_alpha=0.3) +p.circle(x='Time', y='L06_347', source=source_data, + fill_alpha= 0.3, line_alpha=0.3) alarm_box = BoxAnnotation(bottom=10, fill_alpha=0.3, fill_color='#ff6666') # arbitrary value; this is NOT the real-case value @@ -443,150 +444,121 @@ p.add_layout(alarm_label) show(p) ``` -Also [this `jointplot`](https://demo.bokehplots.com/apps/selection_histogram) and [this gapminder reproduction](https://demo.bokehplots.com/apps/gapminder) is based on Bokeh! +### hvplot/holoviews +++ -
    - - More Bokeh? - - - -
    +> hvPlot provides an alternative for the static plotting API provided by Pandas and other libraries, with an interactive Bokeh-based plotting API that supports panning, zooming, hovering, and clickable/selectable legends +++ -### Plotly +Similar to Matplotlib, Bokeh is a low-level package. Whereas Matplotlib provides the building blocks to define static plots, Bokeh provides the building blocks to create interactive visualizations. -+++ +Just as Seaborn provides an abstraction on top of the Matplotlib package, the [hvplot](https://hvplot.holoviz.org/index.html) and [Holoviews](http://holoviews.org/index.html) packages provide an abstraction on top of Bokeh, i.e. plots with less code. -> [plotly.py](https://plot.ly/python/) is an interactive, browser-based graphing library for Python +_Actually, hvplot is built on top of Holoviews, which is built on top of Bokeh_ ```{code-cell} ipython3 -import plotly -``` +import hvplot.pandas -In the last years, plotly has been developed a lot and provides now a lot of functionalities for interactive plotting, see https://plot.ly/python/#fundamentals. It consists of two main components: __plotly__ provides all the basic components (so called `plotly.graph_objects`) to create plots and __plotly express__ provides a more high-level wrapper around `plotly.graph_objects` for rapid data exploration and figure generation. The latter focuses on _tidy_ data representation. - -As an example: create a histogram using the plotly `graph_objects`: - -```{code-cell} ipython3 -import plotly.graph_objects as go - -fig = go.Figure(data=[go.Histogram(x=titanic['Fare'].values)]) -fig.show() +flow_data.hvplot() ``` -Can be done in plotly express as well, supporting direct interaction with a Pandas DataFrame: +The link in between hvplot/holoviews and Bokeh (for further adjustments) can be made using the `render` function: ```{code-cell} ipython3 -import plotly.express as px +import holoviews as hv -fig = px.histogram(titanic, x="Fare") -fig.show() +fig = hv.render(flow_data.hvplot()) +type(fig) ``` -
    - - Notice!: +
    -
      -
    • Prior versions of plotly.py contained functionality for creating figures in both "online" and "offline" modes. Version 4 of plotly is "offline"-only. Make sure you check the latest documentation and watch out with outdated stackoverflow suggestions. The previous commercial/online version is rebranded into chart studio. -
    • -
    +A similar advice as with Matplotlib: "do as much as you easily can in your convenience layer of choice [e.g. hvplot, GeoViews, Holoviews], use Bokeh for customization. + +**More Bokeh?** +- Try the quickstart notebook yourself and check the tutorials +- Check the Bokeh package gallery +- Documentation is very extensive... +
    +++ -As mentioned in the example, the interaction of plotly with Pandas is supported: +### Plotly +++ -.1. Indirectly, by using the `plotly` specific [dictionary](https://plot.ly/python/creating-and-updating-figures/#figures-as-dictionaries) syntax: +> _[plotly.py](https://plot.ly/python/) is an interactive, browser-based graphing library for Python_ ```{code-cell} ipython3 -import plotly.graph_objects as go - -df = flow_data[["L06_347", "LS06_348"]] - -fig = go.Figure({ - "data": [{'x': df.index, - 'y': df[col], - 'name': col} for col in df.columns], # remark, we use a list comprehension here ;-) - "layout": {"title": {"text": "Streamflow data"}} -}) -fig.show() +import plotly ``` -.2. or using the `plotly` object oriented approach with [graph objects](https://plot.ly/python/creating-and-updating-figures/#figures-as-graph-objects): +Similar to Bokeh, Plotly provides a lot of building blocks for interactive plotting, see https://plot.ly/python/#fundamentals. It consists of two main components: __plotly__ provides all the basic components (so called `plotly.graph_objects`) to create plots and __plotly express__ provides a high-level wrapper/abstraction around `plotly.graph_objects` for rapid data exploration and figure generation. The latter focuses more on _tidy_ data representation. + +As an example: create a our example plot using the plotly `graph_objects`: ```{code-cell} ipython3 -df = flow_data[["L06_347", "LS06_348"]] +import plotly.graph_objects as go + +survival_rate = titanic.groupby("Pclass")['Survived'].mean().reset_index() fig = go.Figure() -for col in df.columns: - fig.add_trace(go.Scatter( - x=df.index, - y=df[col], - name=col)) - -fig.layout=go.Layout( - title=go.layout.Title(text="Streamflow data") - ) +fig.add_trace(go.Bar( + x=survival_rate["Pclass"], + y=survival_rate["Survived"]) +) +fig.update_xaxes(type='category') +fig.update_layout( + xaxis_title="Cabin class", + yaxis_title="Survival rate") fig.show() ``` -.3. or using the `plotly express` functionalities: - -```{code-cell} ipython3 -df = flow_data[["L06_347", "LS06_348"]].reset_index() # reset index, as plotly express can not use the index directly -df = df.melt(id_vars="Time") # from wide to long format -df.head() -``` - -As mentioned, plotly express targets __tidy__ data (cfr. plotnine,...), so we converted the data to tidy/long format before plotting: +Similar to other high-level interfaces, this can be done by `Plotly Express` as well, supporting direct interaction with a Pandas `DataFrame`: ```{code-cell} ipython3 import plotly.express as px -fig = px.line(df, x='Time', y='value', color="variable", title="Streamflow data") -fig.show() +# plotly express does not provide the count statistics out of the box, so calculating these +survival_rate = titanic.groupby("Pclass")['Survived'].mean().reset_index() + +px.bar(survival_rate, x="Pclass", y="Survived") ``` -.4. or by installing an additional package, `cufflinks`, which enables Pandas plotting with `iplot` instead of `plot`: +
    -```{code-cell} ipython3 -import cufflinks as cf +**Notice!** -df = flow_data[["L06_347", "LS06_348"]] -fig = df.iplot(kind='scatter', asFigure=True) -fig.show() -``` +Prior versions of plotly.py contained functionality for creating figures in both "online" and "offline" modes. Since version 4 plotly is "offline"-only. Make sure you check the latest documentation and watch out with outdated Stackoverflow suggestions. The previous commercial/online version is rebranded into chart studio. -`cufflinks` applied on the data set of district migration: +
    + ++++ + +The main interface to use plotly with Pandas is using Plotly Express: ```{code-cell} ipython3 -district_migration.transpose().iplot(kind='box', asFigure=True).show() +df = flow_data.reset_index() + +fig = px.line(flow_data.reset_index(), x="Time", y=df.columns, + hover_data={"Time": "|%B %d, %Y"} + ) +fig.show() ```
    - Plotly +A similar advice as with Matplotlib/Bokeh: "do as much as you easily can in your convenience layer of choice [e.g. plotly express]. + +**More plotly?** -
      -
    • Check the package gallery for plot examples.
    • -
    • Plotly express provides high level plotting functionalities and plotly graph objects the low level components. -
    • More information about the cufflinks connection with Pandas is available here.
    • -
    -
    +- Check the package gallery for plot examples. +- Plotly express provides high level plotting functionalities and plotly graph objects the low level components.
    @@ -603,87 +575,46 @@ Both plotly and Bokeh provide interactivity (sliders,..), but are not the full e +++ -## You like web-development and Javascript? +## Change the default Pandas plotting backend +++ -### Altair - -> *[Altair](https://altair-viz.github.io/) is a declarative statistical visualization library for Python, based on Vega-Lite.* +During the course, we mostly used the `.plot()` method of Pandas to create charts, which relied on Matplotlib. Matplotlib is the default back-end for Pandas to create plots: ```{code-cell} ipython3 -import altair as alt -``` - -Reconsider the titanic example of the start fo this notebook: - -```{code-cell} ipython3 -fig, ax = plt.subplots() -plt.style.use('ggplot') -survival_rate = titanic.groupby("Pclass")['Survived'].mean() -survival_rate.plot(kind='bar', color='grey', - rot=0, figsize=(6, 4), ax=ax) -ylab = ax.set_ylabel("Survival rate") -xlab = ax.set_xlabel("Cabin class") +pd.options.plotting.backend = 'matplotlib' +flow_data.plot() ``` -Translating this to `Altair` syntax: +However, both Holoviews/hvplot and Plotly can be used as Pandas back-end for plotting, by defining the `pd.options.plotting.backend` variable: ```{code-cell} ipython3 -alt.Chart(titanic).mark_bar().encode( - x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')), - y=alt.Y('mean(Survived):Q', - axis=alt.Axis(format='%', - title='survival_rate')) -) +pd.options.plotting.backend = 'holoviews' +flow_data.plot() ``` -Similar to `plotnine` with `aesthetic`, expressing the influence of a varibale on the plot building can be `encoded`: - ```{code-cell} ipython3 -alt.Chart(titanic).mark_bar().encode( - x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')), - y=alt.Y('mean(Survived):Q', - axis=alt.Axis(format='%', - title='survival_rate')), - column="Sex" -) +pd.options.plotting.backend = 'plotly' +flow_data.plot() ``` -The typical ingedrients of the **grammar of graphics** are available again: +
    -```{code-cell} ipython3 -(alt.Chart(titanic) # Link with the data - .mark_circle().encode( # defining a geometry - x="Fare:Q", # provide aesthetics by linking variables to channels - y="Age:Q", - column="Pclass:O", - color="Sex:N", -)) -# scales,... can be adjusted as well -``` +**Remember** -For information on this `...:Q`, `...:N`,`...:O`, see the [data type section](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types) of the documentation: +To get an interactive version of a plot created with Pandas, switch the `pd.options.plotting.backend` to `'holoviews'` or `'plotly'`. -Data Type | Shorthand Code | Description -----------|-----------------|--------------- -quantitative | Q | a continuous real-valued quantity -ordinal | O | a discrete ordered quantity -nominal | N | a discrete unordered category -temporal | T | a time or date value +
    +++ -
    +
    - Remember +**Warning** -
      -
    • Altair provides a pure-Python Grammar of Graphics implementation!
    • -
    • Altair is built on top of the Vega-Lite visualization grammar, which can be interpreted as a language to specify a graph (from data to figure).
    • -
    • Altair easily integrates with web-technology (HTML/Javascript)
    • -
    -
    +When saving Jupyter notebooks with interactive visualizations in the output of multiple cells, the file size will increase a lot, making these files less suitable for version control. + +Consider saving your notebook with the outputs cleared (Menu > `Kernel` > `Restart kernel and clear all outputs...`) or automate this with a tool like [nbstripout](https://pypi.org/project/nbstripout/).
    @@ -693,7 +624,7 @@ temporal | T | a time or date value +++ -When you're working with a lot of records, the visualization of the individual points does not always make sense as there are simply to many dots overlapping eachother (check [this](https://bokeh.github.io/datashader-docs/user_guide/1_Plotting_Pitfalls.html) notebook for a more detailed explanation). +When you're working with a lot of records, the visualization of the individual points does not always make sense as there are simply to many dots overlapping each other (check [this](https://bokeh.github.io/datashader-docs/user_guide/1_Plotting_Pitfalls.html) notebook for a more detailed explanation). +++ @@ -706,30 +637,31 @@ Working with such a data set on a local machine is not straightforward anymore, +++ -The package [datashader](https://bokeh.github.io/datashader-docs/index.html) provides a solution for this size of data sets and works together with other packages such as `bokeh` and `holoviews`. +The package [datashader](https://bokeh.github.io/datashader-docs/index.html) provides a solution for this size of data sets and works together with other packages such as `Bokeh` and `Holoviews`. +++ -We download just a single year (e.g. 2018) of data from [the gull data set](zenodo.org/record/3541812#.XfZYcNko-V6) and store it in the `data` folder. The 2018 data file has around 4.8 million records. +The data from [the gull data set](https://zenodo.org/record/3541812#.XfZYcNko-V6) is downloaded and stored it in the `data` folder and is not part of the Github repository. For example, downloading the [2018 data set](https://zenodo.org/record/3541812/files/HG_OOSTENDE-acceleration-2018.csv?download=1) from Zenodo: ```{code-cell} ipython3 import pandas as pd, holoviews as hv from colorcet import fire -from datashader.geo import lnglat_to_meters +from datashader.utils import lnglat_to_meters from holoviews.element.tiles import EsriImagery from holoviews.operation.datashader import rasterize, shade -df = pd.read_csv('../data/HG_OOSTENDE-gps-2018.csv', usecols=['location-long', 'location-lat']) -df.columns = ['longitude', 'latitude'] -df.loc[:,'longitude'], df.loc[:,'latitude'] = lnglat_to_meters(df.longitude, df.latitude) +df = pd.read_csv('data/HG_OOSTENDE-gps-2018.csv', nrows=1_000_000, # for the live demo on my laptop, I just use 1_000_000 points + usecols=['location-long', 'location-lat', 'individual-local-identifier']) +df.loc[:,'location-long'], df.loc[:,'location-lat'] = lnglat_to_meters(df["location-long"], df["location-lat"]) ``` ```{code-cell} ipython3 hv.extension('bokeh') -map_tiles = EsriImagery().opts(alpha=1.0, width=800, height=800, bgcolor='black') -points = hv.Points(df, ['longitude', 'latitude']) -rasterized = shade(rasterize(points, x_sampling=1, y_sampling=1, width=800, height=800), cmap=fire) +map_tiles = EsriImagery().opts(alpha=1.0, width=600, height=600, bgcolor='black') +points = hv.Points(df, ['location-long', 'location-lat']) +rasterized = shade(rasterize(points, x_sampling=1, y_sampling=1, + width=600, height=600), cmap=fire) map_tiles * rasterized ``` @@ -760,12 +692,12 @@ map_tiles * rasterized +++ -
    +
    More alternatives for large data set visualisation that are worthwhile exploring::
      -
    • vaex which also provides on the fly binning or aggregating the data on a grid to be represented.
    • +
    • vaex has a Pandas-like syntax and also provides on the fly binning or aggregating the data on a grid to be represented.
    • Glumpy and Vispy, which both rely on OpenGL to achieve high performance

    @@ -789,11 +721,24 @@ or check the interactive version [here](https://rougier.github.io/python-visuali +++ -further reading: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833 +Further reading: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833 + ++++ + +
    + + Remember - Check the package galleries: + +- Matplotlib gallery. +- Seaborn gallery. +- Plotnine gallery and R ggplot2 gallery. +- An overview based on the type of graph using Python is given here. + +
    +++ ## Acknowledgements -https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017 +- https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017