Monday, November 30, 2015

Lab 5 - Regression Analysis

Part 1

A local news outlet is making a claim saying that the greater number of kids who receive free school lunch the greater the crime rate is for the area. To check the accuracy of the claim a regression analysis can be done. After the regression analysis is complete (Figure 1-3) the results show that there is a significance level of .005. This shows there is significance. With that being said, the hypothesis fails to be rejected, there is a small relationship between free lunches and crime. The regression equation is y=1.685x+21.819, looking at the data with crime at 79.7 over 34% are getting free lunch. The news station is making a correct claim except the correlation between the two is not very strong. 







Figure 1 - Summary

Figure 2









Figure 3 - Results
 

Part 2 
Introduction:

The UW - School system is attempting to look at all the Universities in it's network and make some conclusions based on the results after analyzing. Analyzing data from the universities might help the UW System understand how Wisconsin students choose various UW schools. For this lab the data for UW - Eau Claire and UW - Milwaukee were looked at. 

Methods:
Using spatial regression many conclusions can be made after analyzing the data. To do so, the program SPSS and ArcMap compliment each other heavily for analyzing and presenting the results. A series of variables can all be analyzed using the data provided by the UW - School System. In this case there are two hypothesis's possible, the null hypothesis states there is no relationship between the two variables and the alternate hypothesis states there is a relationship. The data looked at for this lab was the number of students attending both Eau Claire and Milwaukee against Population/Distance, % Population with a Bachelor's Degree and Median Household Income at a county level. The regression analysis was done using SPSS and once ran the results showed what sort of relationships existed between the data and variables. After getting the results, data that was statistically significant was mapped. Residual data could be exported as a .dbf and used in ArcMap to make the data more visually appealing and easier to determine patterns. 
 Results:
Three tests were ran comparing Eau Claire and Milwaukee focusing on Population/Distance, % Population with a Bachelor's Degree and Median Household Income at the county level. Of these three tests only two were significant. Median Household Income was not significant. For both Eau Claire (Figure 4) and Milwaukee (Figure 5) the significance value for the number of students attending to Median Household Income (HHI) had significance levels greater than .005 (.104 and .027). With that being said, both fail to reject the null hypothesis.
Figure 4 - HHI Results for Eau Claire

Figure 5 - HHI Results for Milwaukee
    



The other two tests on the other hand did show a level of significance. Both population/distance and percent of the counties population with a Bachelor's degree bad a significance level lower than .005. With that being said the tests show you have to reject the null hypothesis. The pop/dist test for Eau Claire (Figure 6) had a significance of .000 and had a high r^2 (.945) this signifies a strong level of regression. The pop/dist test for Milwaukee county was very similar (Figure 7), it had a significance of .000 also and had a strongly correlated r^2 value of (.922).
Figure 6 - Pop/Dist results for Eau Claire

Figure 7 - Pop/Dist results for Milwaukee


The last test regarding Bachelor degrees was also significant. For Eau Claire (Figure 8) the test produced a significance level of (.003) but had an r^2 value of (.121) this shows a weak level of regression. Milwaukee (Figure 9) had a  similar result with a significance of (.001) and an r^2 value of (.160) showing a weekly correlated regression.
Figure 8 - Bachelor Degree results for Eau Claire

Figure 9 - Bachelor Degree results for Milwaukee


Because two of the three tests came back significant these results could be mapped to give a visual representation. After being exported from SPSS the data was joined to a WI .shp file. (Figure 10) represents the results from the pop/dist test related to UW - Eau Claire. For the most part the counties followed the regression except for Milwaukee county, the other higher population counties have higher amounts of students attending Eau Claire than the model predicted, Milwaukee was lower. (Figure 11) represents the results from the bachelor degree, the counties closest to Eau Claire deviate more than counties further away. (Figure 12) shows the results of the pop/dist test for Milwaukee. The map shows higher deviation in counties with higher populations except for Milwaukee. Closer, smaller populated counties do not deviate as much. The last map (Figure 13) shows the results of the bachelor degree test. The lower population counties with more rural communities deviate higher than the regression and lower the further you get from Milwaukee.

Figure 10 - Map results from Pop/Dist test for Eau Claire


Figure 11 - Map results from Pop/Dist test for Milwaukee
Figure 12 - Map results from Bachelor test for Eau Claire

Figure 13 - Map results from Bachelor test for Milwaukee


Conclusion:
   Out of all the three tests the biggest indicator is population normalized by distance. The results showing a high r^2 value highlights that. Although there was significance regarding bachelor degrees it happened to be rather week overall. People generally enjoy going to school nearest to where they live and in this case this is no different. The higher populated counties send more kids to schools and this can be assumed without the use of tests. Areas with more money are likely to send kids to a more well known school such as Milwaukee or Madison. Going off of that, areas with a high amount of the population holding degrees will often times produce more kids going to school to seek a degree. Although pop/dist and bachelor degrees per population are good variables to look at, there are endless variables that can be explored and focused on additionally when marketing a certain school to an area.


Thursday, November 12, 2015

Lab 4 - Correlation and Spatial Autocorrelation

Part 1: Correlation
1.



Correlation Chart


When looking at a set of data that focuses on distance and sound level one might get curious as to how they correlate to one another. Using tools such as Excel and SPSS calculating and visualizing the correlation between the two becomes much easier.

The null hypothesis would be that there is no linear association between distance (ft) and sound level (db). The alternate hypothesis is there is a linear association between distance and sound level.

Looking at the Correlation chart you can see the correlation for the two is -.896. Being that .896 is close to 1 tells us the variables have a strong correlation and the fact it is a - tells us that as the scatter plot shows us as distance increases the sound level decreases.

The null hypothesis is rejected.

2. Correlation matrices can also be created using SPSS. A correlation matrix is all the variables compared to one another. The following matrix is based on Detroit Census data.
Looking at the matrix, you can see that a strong correlation exists.
White and having a Bachelor Degree (+)
White and Black residents (-)
 Median Household Income and Median Home Value (+)
White and Retail Employee (+)

3. Introduction:
     The Texas Election Commission or (TEC) wants to evaluate the elections in Texas through analyzing patterns and voter turnout in counties. The TEC has provided voter data from the 1980 and 2012 Presidential Elections. Through the examination of spatial auto-correlation and correlation the TEC will have a better idea of what the voter breakdown and turnout is across the state.

Methodology:
     Unfortunately the TEC did not provide enough data alone to analyze and come to any conclusions. Luckily, the U.S. Census Bureau website has a database including demographic data needed.  Once downloaded the data can be joined together using the 'join' tool in ArcMap and exported as a shape file. Once exported the data could be opened in GeoDa, a freeware program and analyzed.

     GeoDa is able to run spatial auto-correlation tests and this is essential in determining information for the TEC. Because the tests are weighted the settings remained default when looking at the Poly ID.
     Spatial Auto-correlation is relevant because it looks at each individual counties and evaluates them based on counties touching each other and in each direction, (side to side, below and above).
A Moran's I test was done as well as a LISA map created to represent the results on a per county basis.
     This was done 5 times covering, Hispanic Population percent in 2010, Voter Turnout in both 1980 and 2012 and also Democratic Vote % in 1980 and again in 2012.
The Moran's I test is done to measure the degree of spatial auto-correlation. This is represented by a value between -1 and +1. A value close to -1 indicates a strong negative pattern, a value close to 0 represents a lack of spatial pattern and a +1 indicates a strong positive pattern.
The LISA maps help create a visually friendly representation the significance difference at a county level.

Results:

Figure 1 - 2012 Democratic Vote Percent

Figure 2 - Moran's I chart for 2012 Democratic Vote Percent

     Looking at the map (Figure 1) and Moran's I chart (Figure 2) regarding the 2012 Democratic Vote Percent in Texas, we can see a significant number of High (red) counties in the south/west part of the state and Low (blue) counties in the North. Looking at the Moran's I chart you can see see it has a value of .6959 which is a positive correlation. The data is grouped in the lower area and spread widely in the high area. 
Figure 3 - 1980 Democratic Vote Percent

Figure 4 - Moran's I chart for 1980 Democratic Vote Percent

      This LISA map (Figure 3) shows a larger variation in voting pattern for the 1980 election. The Moran's I chart (Figure 4) shows a spread as well and has a value of .5752 which equates to medium, positive correlation.
Figure 5 - 2012 voter turnout
Figure 6 - 2012 voter turnout Moran's I chart
     The LISA map (Figure 5) shows the 2012 voter turnout was high in only a few counties while low in the southern counties nearest to the border. The Moran's I chart (Figure 6) represents a weak trend with much variation but still is slightly positive with a value of .3359.
Figure 7 - 1980 voter turnout
Figure 8 - Moran's I chart for 1980 voter turnout
     In 1980 the LISA map (Figure 7) shows a few more high turnout counties while the south even then had a very low turnout. The Moran's I chart  (Figure 8) presents a low yet positive trend at .4681.

Figure 9 - 2010 Hispanic Population Percent
Figure 10 - Moran's I chart for 2010 Hispanic Population Percent
     The last map (Figure 9) represents the Hispanic % throughout Texas. The South/Western counties have a very high percent while the North Eastern counties are generally low except for one county which sits at High-Low. The Moran's I chart (Figure 10) shows a significant positive trend and has a value of .7787 which happens to be the closest to +1 of all of our values. 
Figure 11 - Correlation Matrix
This Correlation matrix (Figure 11) shows there is a very high confidence level. At 99% it shows that each variable relates to one another. 

Conclusion:
     Looking at the maps and charts one can see there is an apparent relationship between the percent of Hispanic Population and percent of Democratic votes. It appears that in the counties with the highest percent of Hispanic population there is also a very low voter turnout. It makes sense there is a higher Hispanic population percentage in the southern part of the state as it is closest to the border of Mexico. Using this information and visuals the TEC and the Governor can see that voter turnout has ever so slightly changed the last 30+ years. Key areas to be considered are counties with higher Hispanic populations as these counties are less likely to turnout and vote but perhaps with the right campaigning these results could be changed depending on the messaged projected to those populations.