1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
*(Q)
 In a large file, I have a string variable (part of a street address) that
 contains errors in which certain letters are erroneously substituted for
 numerals (For example, the string '123B SMITH ST' should be '1238 SMITH
 ST'). (These problems result from scanning data.)
 
*To detect at least some such errors, I want to do a sort of search to
 detect things like "any instance of a string of numerals that contains an
 embedded letter."  I'm thinking to create a logical variable that flags
 such cases, and I can imagine syntax that looks at a three character
 window within a string, and then checks whether the middle character = '1'
 or '2' or ... This seems like a mess. 
*Any thoughts here?


*(A) From: SPSSX(r) Discussion [SPSSX-L@UGA.CC.UGA.EDU] on behalf of
marso@MY-DEJANEWS.COM
Sent: July 23, 1998 10:42 AM
To: SPSSX-L@UGA.CC.UGA.EDU
Subject: Re: Sophisticated search in string variable

Michael,
  Just check adjacent characters for Number-String flip flop!
David

DATA LIST /id 1-2 address 4-25 (a).
BEGIN DATA
01 123B SMITH ST.
02 461 OCEAN BVD.
03 12A PENNSYLVANIA AVE.
04 444 N. MICHIGAN AVE.
05 22B4 BAKER ST.
END DATA.
STRING  #ALPHA (A26) #NUM (A10) #ADDR (A22).
COMPUTE #ALPHA = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
COMPUTE #NUM = "0123456789" .
COMPUTE #ADDR = UPCASE(ADDRESS).

LOOP #=2 TO LEN(ADDRESS).
COMPUTE #NC = IND(SUB(#ADDR,#,1),#NUM,1) > 0.
COMPUTE #NP = IND(SUB(#ADDR,#-1,1),#NUM,1) > 0.
COMPUTE #SC = IND(SUB(#ADDR,#,1),#ALPHA,1) > 0.
COMPUTE #SP = IND(SUB(#ADDR,#-1,1),#ALPHA,1) > 0.
IF (#NC * #SP + #SC * #NP) BAD=1.
END LOOP.
EXE .


Or for a three liner after setup:

DATA LIST /id 1-2 address 4-25 (a).
BEGIN DATA
01 123B SMITH ST.
02 461 OCEAN BVD.
03 12A PENNSYLVANIA AVE.
04 444 N. MICHIGAN AVE.
05 22B4 BAKER ST.
END DATA.
STRING  #ADDR (A22)   .

COMPUTE #ADDR = UPCASE(ADDRESS).
LOOP #=2 TO LEN(ADDRESS).
IF (IND(SUB(#ADDR,#,1),"0123456789",1) > 0)
  * ( IND(SUB(#ADDR,#-1,1),"ABCDEFGHIJKLMNOPQRSTUVWXYZ",1) > 0)
  + (IND(SUB(#ADDR,#,1),"ABCDEFGHIJKLMNOPQRSTUVWXYZ",1)>0 )
  * (IND(SUB(#ADDR,#-1,1),"0123456789",1) > 0) BAD=1.
END LOOP.
LIST .