* soundex * SPSS Code: Simon Freidin 2003 * *** Start algorithm *** * (http://www.fearme.com/misc/alg/node128.html) by Scott Gasch * 0.5.10 Soundex English word-sounding Algorithm * M. K. Odell and R. C. Russell patented the Soundex phonetic comparison system in 1918 and 1922. * Soundex coding takes an English word and produces a four digit representation of the word designed * to match the phonetic pronunciation of the word. It is normally used for ``fuzzy'' * searches where a close match may be desired. For example, to come up with alternative * possibilities for a misspelled word some spelling checker programs generate a Soundex * code for the misspelled word and then suggest other words with the same Soundex value. * Additionally Soundex codes are often used on surnames which are difficult to spell. * The creation of a Soundex code is a pretty simple operation. * The first step is to remove all non-English letters or symbols. * In the case of accented vowels, simply remove the accents. Any hyphens, spaces, etc... also. * In addition, remove all H's and W's unless they are the initial letter in the word. * Next, take the first letter in the word and make it the first letter of the Soundex code. * For each remaining letter in the word, translate it to a number with the table below and * concatenate the numbers, preserving order, on to the Soundex value. * * A, E, I, O, U, Y = 0 * B, F, P, V = 1 * C, G, J, K, Q, S, X, Z = 2 * D, T = 3 * L = 4 * M, N = 5 * R = 6 * * Now, combine any double numbers into a single instance of that number. * Further, if the first number in the Soundex value is the same as the code number for * the initial letter, delete the first number. Now, remove all zeros from the Soundex string. * Finally, return the first four characters of the end product as the Soundex encoding. * If there are less than four characters to be returned, concatenate enough zeros to make the length four. * **** End algorithm ***** set printback=listing. data list list/name (a20). begin data. Oconnell smythe smith end data. /* convert to upper case and remove leading spaces */ compute name=ltrim(rtrim(upcase(name))). string a1 to a20 (a1) soundex1 (a20). * break the name into characters, make the first letter the first character of soundex string . do repeat a=a1 to a20/b=1 to 20. compute a=substr(name,b,1). end repeat. compute soundex1=a1. recode a1 to a20 ('A', 'E', 'I', 'O', 'U', 'Y' = '0')('B', 'F', 'P', 'V' = '1') ('C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' = '2') ('D', 'T' = '3')('L' = '4')('M', 'N' = '5')('R' = '6')(else=''). * add numbers to soundex string . * (dropping spaces, H, W and non-alpha characters which were recoded to '') . do repeat a=a2 to a20. if a ~= '' soundex1=concat(ltrim(rtrim(soundex1)),a). end repeat. execute. * Now, combine any double numbers into a single instance of that number. string pl cl (a1) soundex2 (a20). loop x=1 to 20. compute cl=substr(soundex1,x,1). if cl ~= pl soundex2=concat(ltrim(rtrim(soundex2)),cl). compute pl=cl. end loop. * Further, if the first number in the Soundex value is the same as the code number for * the initial letter, delete the first number. string soundex3 (a20). compute soundex3=soundex2. if a1=substr(soundex2,2,1) soundex3=concat(substr(soundex2,1,1),substr(soundex2,3)). * Now, remove all zeros from the Soundex string. string soundex4 (a20). loop x=1 to 20. compute cl=substr(soundex3,x,1). if cl ~= '0' soundex4=concat(ltrim(rtrim(soundex4)),cl). end loop. * Finally, return the first four characters of the end product as the Soundex encoding. * If there are less than four characters to be returned, concatenate enough zeros to make the length four. string soundex (a4). compute soundex=soundex4. if length(ltrim(rtrim(soundex)))=3 soundex=concat(ltrim(rtrim(soundex)),'0'). if length(ltrim(rtrim(soundex)))=2 soundex=concat(ltrim(rtrim(soundex)),'00'). if length(ltrim(rtrim(soundex)))=1 soundex=concat(ltrim(rtrim(soundex)),'000'). execute. match files file=*/keep=name soundex. execute.