Читать книгу Data Science For Dummies - Lillian Pierson - Страница 68

Calculating correlation with Pearson’s r

If you want to uncover dependent relationships between continuous variables in a dataset, you’d use statistics to estimate their correlation. The simplest form of correlation analysis is the Pearson correlation, which assumes that

Your data is normally distributed.

You have continuous, numeric variables.

Your variables are linearly related. You can identify a linear relationship by plotting the data points on a chart and looking to see if there is a clear increasing or decreasing trend within the values of the data points, such that a straight line can be drawn to summarize that trend. See Figure 4-1 for an illustration of what a linear relationship looks like.

FIGURE 4-1: An example of a linear relationship between months and YouTube subscribers.

Because the Pearson correlation has so many conditions, use it only to determine whether a relationship between two variables exists, but not to rule out possible relationships. If you were to get an r-value that is close to 0, it indicates that there is no linear relationship between the variables but that a nonlinear relationship between them still could exist.

To use the Pearson’s r to test for linear correlation between two variables, you’d simply plug your data into the following formula and calculate the result.

= mean of x variable

= mean of y variable

r = Pearson r coefficient of correlation

Once you get a value for your Pearson r, you’d interpret it value according to the following standards:

if r close to +1: Strong positive correlation between variables

if r = 0: Variables are not linearly correlated

if r close to -1: Strong negative correlation between variables

Подняться наверх