1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
* soundex
* SPSS Code: Simon Freidin 2003
* *** Start algorithm ***
* (http://www.fearme.com/misc/alg/node128.html) by Scott Gasch
* 0.5.10 Soundex English word-sounding Algorithm
* M. K. Odell and R. C. Russell patented the Soundex phonetic comparison
system in 1918 and 1922.
* Soundex coding takes an English word and produces a four digit
representation of the word designed
* to match the phonetic pronunciation of the word. It is normally used for
``fuzzy''
* searches where a close match may be desired. For example, to come up with
alternative
* possibilities for a misspelled word some spelling checker programs
generate a Soundex
* code for the misspelled word and then suggest other words with the same
Soundex value.
* Additionally Soundex codes are often used on surnames which are difficult
to spell.
* The creation of a Soundex code is a pretty simple operation.
* The first step is to remove all non-English letters or symbols.
* In the case of accented vowels, simply remove the accents. Any hyphens,
spaces, etc... also.
* In addition, remove all H's and W's unless they are the initial letter in
the word.
* Next, take the first letter in the word and make it the first letter of
the Soundex code.
* For each remaining letter in the word, translate it to a number with the
table below and
* concatenate the numbers, preserving order, on to the Soundex value.
*
*           A, E, I, O, U, Y = 0
*                 B, F, P, V = 1
*     C, G, J, K, Q, S, X, Z = 2
*                       D, T = 3
*                          L = 4
*                       M, N = 5
*                          R = 6
*
* Now, combine any double numbers into a single instance of that number.
* Further, if the first number in the Soundex value is the same as the code
number for
* the initial letter, delete the first number. Now, remove all zeros from
the Soundex string.
* Finally, return the first four characters of the end product as the
Soundex encoding.
* If there are less than four characters to be returned, concatenate enough
zeros to make the length four.
* ****  End algorithm *****
set printback=listing.
data list list/name (a20).
begin data.
Oconnell
smythe
smith
end data.
/* convert to upper case and remove leading spaces */
compute name=ltrim(rtrim(upcase(name))).
string a1 to a20 (a1) soundex1 (a20).
* break the name into characters, make the first letter the first character
of soundex string .
do repeat a=a1 to a20/b=1 to 20.
compute a=substr(name,b,1).
end repeat.
compute soundex1=a1.
recode a1 to a20  ('A', 'E', 'I', 'O', 'U', 'Y' = '0')('B', 'F', 'P', 'V' =
'1')
  ('C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' = '2')
  ('D', 'T' = '3')('L' = '4')('M', 'N' = '5')('R' = '6')(else='').
* add numbers to soundex string .
* (dropping spaces, H, W and non-alpha characters which were recoded to '') .
do repeat a=a2 to a20.
if a ~= '' soundex1=concat(ltrim(rtrim(soundex1)),a).
end repeat.
execute.
* Now, combine any double numbers into a single instance of that number.
string pl cl (a1) soundex2 (a20).
loop x=1 to 20.
compute cl=substr(soundex1,x,1).
if cl ~= pl soundex2=concat(ltrim(rtrim(soundex2)),cl).
compute pl=cl.
end loop.

* Further, if the first number in the Soundex value is the same as the code
number for
* the initial letter, delete the first number.
string soundex3 (a20).
compute soundex3=soundex2.
if a1=substr(soundex2,2,1)
soundex3=concat(substr(soundex2,1,1),substr(soundex2,3)).

* Now, remove all zeros from the Soundex string.
string soundex4 (a20).
loop x=1 to 20.
compute cl=substr(soundex3,x,1).
if cl ~= '0' soundex4=concat(ltrim(rtrim(soundex4)),cl).
end loop.

* Finally, return the first four characters of the end product as the
Soundex encoding.
* If there are less than four characters to be returned, concatenate enough
zeros to make the length four.
string soundex (a4).
compute soundex=soundex4.
if length(ltrim(rtrim(soundex)))=3 soundex=concat(ltrim(rtrim(soundex)),'0').
if length(ltrim(rtrim(soundex)))=2 soundex=concat(ltrim(rtrim(soundex)),'00').
if length(ltrim(rtrim(soundex)))=1 soundex=concat(ltrim(rtrim(soundex)),'000').
execute.
match files file=*/keep=name soundex.
execute.