Generating multivariate normal variables with a specific covariance matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | Solution ID: 100000316 Question Subtype: Statistical Distributions Title: Generating MVN data with a specific covariance matrix Description: Q. How can I generate data which are multivariate normal and have a covariance or correlation matrix that I specify? A. In SPSS 4.0 or above on non-DOS platforms, this job can be performed with the matrix language of SPSS. Users of SPSS/PC+ or SPSSX can do part of the job but will have to perform some matrix algebra outside of SPSS. The SPSS 4.0 approach is described first, followed by a description of the SPSSX approach. The general algorithm is described in Rubinstein, R. (1981), "Simulation and the Monte Carlo Method", Wiley. I SPSS 4.0 Method The basic strategy is: 1. Before entering matrix, generate the required number of normally- distributed variables by using compute commands. These variables should be multivariate normal (MVN) with correlations approaching 0, i.e. approaching independence. 2. To compute a set of variables which are multivariate normal and strictly independent, or orthogonal, run the FACTOR procedure with the variables created in Step 1. Use the default extraction method of principal components, extracting as many components as you included in the analysis. Save these components to a new set of variables with the FACTOR /SAVE subcommand. These new variables will be MVN and orthogonal with means of 0 and variances of 1. 3. Enter the matrix language and read the set of standard normal variables from Step 2 as a matrix, Z for example. Then define and enter the target covariance matrix, S. 4. Calculate the Cholesky factor for the target covariance matrix. The Cholesky factor is an upper triangular matrix which is the "square root" of the covariance matrix. If V is the Cholesky factor of matrix S, and VT is the transpose of V, then VT*V=S , where * is the matrix multiplication operator. 5. Post-multiply the independent standard normals, Z, by the Cholesky factor, V, to give a new data matrix, X. If the target matrix was a correlation matrix and you want the new variables to have a standard deviation other than 1, then multiply X by the desired SD. If you want the new variables to have a nonzero mean, add the desired mean to X (but not before multiplication by SD). If the target covariance matrix is not a correlation matrix, the new variables will have variances equal to their respective diagonal elements in the target matrix. Their means will be 0. The desired mean can be added to the variables with a compute statement, either before or after leaving matrix. 6. Save the X matrix to variables in the SPSS active file and leave matrix. An example of the implementation of this algorithm is shown below. The target matrix is a correlation matrix for the five variables to be generated. I added some statements to print the determinant and the condition number of the input covariance matrix to insure that it was not singular or ill-conditioned. I also printed the product of VT*V to insure that the target matrix was reproduced. All the new variables were scaled to have means of 100 and standard deviations of 15 before leaving matrix. input program. + loop #i = 1 to 10000. + do repeat response=r1 to r5 . + compute response = normal(1) . + end repeat. + end case. + end loop. + end file. end input program. correlations r1 to r5 / statistics=descriptives. * Factor procedure computes pr1 to pr5, which are standard MVN . factor variables = r1 to r5 / print = default det /criteria = factors(5) /save=reg (all,pr). * use matrix to set corr matrix. * x is a 10,000 by 5 matrix of independent standard normals . * cor is the target covariance matrix. * cho is the Cholesky factor of cor . * newx is the 10,000 by 5 data matrix which has target covariance matrix . matrix. get x / variables=pr1 to pr5. compute cor={1, 0.4, 0.3, 0.2, 0.1 ; 0.4, 1, 0.4, 0.3, 0.2 ; 0.3, 0.4, 1, 0.4, 0.3 ; 0.2, 0.3, 0.4, 1, 0.4 ; 0.1, 0.2, 0.3, 0.4, 1 }. compute deter=det(cor). print deter / title "determinant of corr matrix" / format=f10.7 . print sval(cor) / title "singular value decomposition of corr". print eval(cor) / title "eigenvalues of input corr". * In a symmetric matrix sval and eigenvalues are identical - choose 1 . compute condnum=mmax(sval(cor))/mmin(sval(cor)). print condnum / title "condition number of corr matrix" / format=f10.2 . compute cho=chol(cor). print cho / title "cholesky factor of corr matrix" . compute chochek=t(cho)*cho. print chochek / title "chol factor premult by its transpose " /format=f10.2 . compute newx=x*cho. compute newx=newx*15 + 100. save newx /outfile=* /variables= nr1 to nr5. end matrix. correlations nr1 to nr5 / statistics=descriptives. II SPSSX and SPSS/PC+ Method If the matrix language is not available, users can apply the Cholesky factor matrix to the standard normal variables with a series of compute commands. However, the Cholesky factorization must be calculated outside of SPSS. The formulae for the Cholesky are provided in Rubinstein's book and in Bock (1975), "Multivariate Statistical Methods in Behavioral Research", McGraw-Hill. The example below uses the Cholesky factor for the target matrix in the previous example. The Cholesky factor matrix is printed below. One compute statement is required for each new variable. Note that the values in the columns of the Cholesky matrix act as weights for the original standard normal variables and these weighted variables are combined to form the new variables. Cholesky Matrix for the Target Correlation Matrix. 1.0 0.4 0.3 0.2 0.1 0 0.91652 0.30551 0.24004 0.17457 0 0 0.9037 0.29508 0.23976 0 0 0 0.90294 0.29608 0 0 0 0 0.90243 * SPSS-X, SPSS or SPSS/PC+ program - assume an active file exists with five independent random vars created by compute r1=normal(1). correlations r1 to r5 / statistics=descriptives. * This correlation matrix should be nearly an identity matrix, 1's in the diagonal and 0's elsewhere. factor variables = r1 to r5 / print = default det /criteria = factors(5) /save=reg (all,pr). correlations pr1 to pr5 / statistics=descriptives. * This correlation matrix should be an identity matrix. * use series of computes to post-mult data matrix by a Cholesky matrix calculated outside spss . compute nr1 = pr1. compute nr2 = 0.4*pr1 + 0.91652*pr2 . compute nr3 = 0.3*pr1 + 0.30551*pr2 + 0.9037*pr3. compute nr4 = 0.2*pr1 + 0.24004*pr2 + 0.29508*pr3 + 0.90294*pr4. compute nr5 = 0.1*pr1 + 0.17457*pr2 + 0.23976*pr3 + 0.29608*pr4 + 0.90243*pr5. correlations nr1 to nr5 / statistics=descriptives. * This correlation matrix should be like the target correlation matrix. |