Data Management and Visualization: Making Data Management Decisions

   For the week 3 assignment of the "Data Management and Visualization" course, I am to perform some Data Management techniques, which I will describe in detail bellow.

First, I will be coding out missing data to some of the selected variables. After that I will be creating a new variable with values based on response codes of another variable. And lastly, I will create a secondary variable by multiplying a couple of variables (in fact the very same variables that were given in the lecture examples of the NESARC dataset). Before we start, let's just rewind what the original question of the project was: "Have ex-cigarette smokers smoked, on average, less cigarettes than current ones, prior to their quitting?", where less was described with a couple of meanings: "less as a count" and "less as breifer duration periods". Now, let us begin!

1. Coding out missing data.

As you can see, before I started coding out the missing data, I made a copy dataframe on which to make changes.
The variables I worked on were:
- "SMOKER" - replacing '3' values (Lifetime non-smoker) with 'NaN'-s;
- "S3AQ3D1R" (Duration (days) of usual cigarette smoking) - replacing '99999' values (Unknown) with 'NaN'-s;
- "S3AQ3C1" (Usual quantity when smoked cigarettes) - replacing '99' values (Unknown) with 'NaN'-s;
- "S3AQ3B1" (Usual (daily) frequency when smoked cigarettes) - replacing '9' values (Unknown) with 'NaN'-s;

All of the changes were made successfully!

2. Creating a new variable with values based on response codes of a variable

As evident from the picture, I assigned the values with a data dictionary for the new variable "USFREQMO" based on the existing values of the variable "S3AQ3B1".

Again, no issue occured while making the changes!

3. Creating a secondary variable by combining values from a couple of other variables

As evident from the picture, I multiplied the values of "USFREQMO" (usual frequency of smoking cigarettes per month) and "S3AQ3C1" (Usual quantity when smoked cigarettes) to create a variable "NUMCIGMO_EST" which indicates the estimated number of cigarettes smoked per month.

After concluding these steps, I made a couple of other data management decisions.

Firstly, I subsetted the data for only a few chosen variables, as shown on the picture bellow:



Secondly, I removed from the new set all records with 'NaN' values for the variable "SMOKER". These records were containing the value '3' (Lifetime non-smoker) for the indicated variable before step 1 of this article. And because our question includes only people who are current or ex-cigarette smokers, the data for lifetime non-smokers is unnecessary for the project.






Коментари

Популярни публикации