CSC108-Fall-2022-A3 / README.md
README.md
Raw

CSC108-Fall-2022-A3

Assignment 3 of CSC108 University of Toronto 2022

Hypertension and Low Income in Toronto Neighbourhoods Goals of this Assignment: In this assignment, you will practise working with files, building and using dictionaries, designing functions using the Function Design Recipe, reading documentation, and writing unit tests.

Background A commonly-held belief is that an individual's health is largely influenced by the choices they make. However, there is lots of evidence that health is affected by systemic factors.

Health researchers often study the relationships between an individual's health outcomes and factors related to their physical environment, social and economic situations, and geographic location.

In this assignment, you will write code to assist with analyzing data on the relationship between hypertension (also known as high blood pressure) and income levels in Toronto neighbourhoods. The data you will work with is real data, however we have simplified it somewhat to make this assignment clearer for you.

A note on math and stats The data analysis that your code will do will include some statistical analysis that we have not talked about in the course. You do NOT need to understand the underlying statistics to complete this assignment. The code you write will do some simple mathematical operations, like adding up some numbers, or finding ratios using division. We will use Pearson correlation for the more advanced analysis and you will use existing functions that we have imported for you. You will need to take a look at the examples of these functions in order to figure out what arguments you need to pass to them, and what types of data they return, but you do not need to understand how they work in any detail.

Correlation is a single coefficient expressing the tendency of one set of data to grow linearly, in the same or opposite direction, with another set of data. This is done by comparing whether points that have been paired between the two sets are similarly greater or less than their set's respective averages. For example, if we wanted to compare whether for students in the class, age is correlated with height, we would have two sets of data, birth date (which we could express as, say, number of weeks old for finer granularity), and heights. Numbers from each set are ordered in the same way so that each height value corresponds to the age value for the same student. What is nice about the correlation metric we are using, is that it is normalized to be between -1 and 1, with these values giving us a nice human interpretation. A value of 1 means that the points make a straight line. In our example, this means, for some increase in age, we have a consistent increase in height. Similarly, a value of -1 is the same relationship but with a flip of direction, where older students would be shorter than younger ones. Finally, a value of 0 would say that there is no consistent increase or decrease in height for a change in age. We will use this to investigate the relationship between low income rates and hypertension, for any tendency to increase or decrease together.

If you are a statistics person, keep in mind that the learning goals of the assignment are about writing code using what we've learned in the course, not about doing a proper statistical analysis :)

Dataset descriptions This assignment uses data files related to one of the two variables of interest (i.e. hypertension data or income data). The files are CSV (comma separated values) files, where each column in a line is separated by a comma. You can assume there are no commas anywhere else in the files, other than to separate columns, and that any file given is in the correct format. The two file types are described below.

Neighbourhood hypertension data files The first row in a neighbourhood hypertension file contain header information, and the remaining rows each contain data relating to hypertension prevalence in a particular Toronto neighbourhood.

Here is a description of the different columns of the dataset:

Column index : Description 0 : An ID that uniquely identifies each neighbourhood. 1 : The name of the neighbourhood. Neighbourhood names are unique. 2 : The number of people aged 20 to 44 with hypertension in the neighbourhood. 3 : The total number of people aged 20 to 44 in the neighbourhood. 4 : The number of people aged 45 to 64 with hypertension in the neighbourhood. 5 : The total number of people aged 45 to 64 in the neighbourhood. 6 : The number of people aged 65 and older with hypertension in the neighbourhood. 7 : The total number of people aged 65 and older in the neighbourhood.

Neighbourhood income data files The first row in a neighbourhood income data file contains header information, and the remaining rows each contain data about low income status.

Here is a description of the different columns of the dataset:

Column index: Description 0 : An ID that uniquely identifies each neighbourhood. 1 : The name of the neighbourhood. Neighbourhood names are unique. 2 : The total population in the neighbourhood. 3 : The number of people in the neighbourhood with low income status. Neighbourhood names and ids are the same between our hypertension data files and our low income data files. However, the total population of a neighbourhood can be different between the two data files, as they were collected at different times.

The CityData Type

The code you will write for this assignment will build and then use a dictionary that contains hypertension and low income data about neighbourhoods in a city. This section describes the format of that dictionary.

Key/value pairs in a CityData dictionary

Each key in a CityData dictionary is a string representing the name of a neighbourhood. As is necessary for dictionary keys, all neighbourhood names will be unique.

The values in a CityData dictionary are dictionaries containing information about a neighbourhood. These inner dictionaries contain specific keys that label a neighbourhood's data.

Format of the inner dictionaries A dictionary that is a value in a dictionary of type CityData has the following key/value pairs:

Key (Type) Value 'id': (int) The id number of this neighbhourhood. 'total': (int) The total population of this neighbourhood, as given in the low income data file. 'low_income': (int) The number of people in this neighbourhood who are classfied as low income. 'hypertension': (list[int]) A list of the hypertension data of this neighbourhood. This list will have length exactly 6, and the values will be the numbers from columns 2 through 7 as described in the section above on neighbourhood hypertension data files.

Age standardization This section describes the process of age standardization that we will use in this assignment to perform a more accurate analysis. Note that we have given you a function that computes the age standardized rate from the raw rate (described in Task 3). This section is for your information only; we have already implemented this for you.

Our dataset will let us calculate the rate of hypertension in each Toronto neighbourhood. One complicating factor is that different neighbourhoods have different age demographics. For example, the Henry Farm neighbourhood has a significantly lower proportion of 65+ residents than Hillcrest Village. And because people aged 65+ have a higher overall rate of hypertension, this demographic different alone would cause us to expect to see a difference in the overall hypertension between these neighbourhoods.

So because we care about the impact of low income status and hypertension rates, we want to remove the impact of different age demographics between the neighbourhoods. To do so, we will use a process called age standardization to calculate an adjusted hypertension rate that ignores differences in ages. This process involves the following steps for each neighbourhood:

First, we'll calculate the hypertension rate within each of the following age groups: 20-44, 45-64, and 65+. We'll report these rates as percentages, which you can think of as being "X out cases of hypertension per 100 people aged 20-44". Then, we'll pick one standard population with certain numbers of people in these age groups. Then, we'll use the neighbourhood rates to calculate the hypothetical number of people in the standard population who would have hypertension. For example, if the rates for neighbourhood X were 20% of 20-44, 30% of 45-64, and 66% of 65+, the total number of people with hypertension in the standard population would be ... Finally, divide this number of people with hypertension by the total size of the standard population, yielding a final percentage. This percentage is the age standardized rate for the neighbourhood.

Required Functions

In the starter code file a3.py, follow the Function Design Recipe to complete the functions described below.

You will need helper functions (i.e. functions you define yourself to be called in other functions) for some of the required functions, but likely not for all of them. Helper functions also require complete docstrings. We strongly recommend you also follow any suggestions about helper functions in the table below; we give you these hints to make your programming task easier.

For each of the functions below, other than the file reading functions in Task 1, write at least two examples in the docstring. You can use the provided SAMPLE_DATA dictionary, and you should also create another small CityData dictionary for examples and testing. If your helper function takes an open file as an argument, you do NOT need to write any examples in that function's docstring. Otherwise, for any helper functions you add, write at least two examples in the docstring.

Your functions should not mutate their arguments, unless the description says that is what they do.

Assumptions

Assume the following about the data:

  • All neighbourhood ids and names are unique, and will appear the same in all data files. That is, no neighbourhood will have a different id between files, or a different name.
  • In all tasks except Task 1, the dictionary parameter will have both hypertension and low income data for every neighbourhood. That is, it will be a valid CityData dictionary.
  • All float values should be left as is; do not round any of them.

Using Constants

The starter code contains constants that you should use in your solution for the list indices and key identifiers for the CityData dictionary as well as the column numbers for the input files. You may add other constants if you wish.

This project was completed on Wing IDE.