: Cluster Analysis to find Patterns in Patients with Heart Disease
The SAS VA HEART dataset on the Teradata University Network site contains data about patients with heart disease. It includes such variables as gender, age of death, age of diagnosis, weight status, cholesterol status, and smoking status (Non-Smoker has Smoking variable coded as 0; Light has Smoking variable coded as 1-5; Moderate has Smoking variable coded as 6-15; Heavy has Smoking variable coded as 16-35; Very Heavy has Smoking variable coded as >25). You will perform a cluster analysis in this problem to similarly group the patients who have died in this dataset.
1. Open the HEART dataset and create some visualizations to get familiar with the data. (Note: You do not need to submit these visualizations to Moodle for this problem.)
2. Create clusters over patients who have died. To do so, filter the data over the entire dataset over Status = Dead. Remove missing values.
3. Click the New Cluster icon on the toolbar and assign all the measure variables except Metropolitan Relative Weight and Age at Start to it.
4. Click the Properties tab. Notice that the number of clusters is set to 5, which is the default. Five clusters were crated with cluster IDs 0-4. Change the number of clusters to 4.
5. Increase the Visible Roles to 7. Maximize the cluster matrix. Right-click on one of the cells that have Age of Death on the X axis. Select Plot Age of Death by Cluster ID. Which cluster has the patients who died the youngest?
6. Create a box plot of Smoking by Cluster ID. Which cluster represents those that were heavy smokers in this dataset?
7. Minimize the cluster matrix and maximize the parallel coordinates plot. The plot shows the cluster IDs on the left side of the plot and the effects along the top. The clusters are colored differently. The bar sizes on the left represent the number of observations in each cluster. The minimum and maximum values for each effect are shown at the top and bottom of the effect. By looking at the plot with all the clusters shown, what can you assess? For example, which cluster appears to have the patients with the highest cholesterol?
8. Which cluster can be classified as follows: Non-smokers who were older in age at death, had lower cholesterol, and had lower blood pressure?
9. Characterize each of the other clusters.
10. Which cluster is the most different; that is, it has the largest Within-Cluster SS?
11. Right-click any of the cells in the cluster matrix. Select Derive a Cluster ID Variable. A new variable is created and appears in the Data pane. This may be used now as an input to other models.