Sophisticated search in string variable
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | *(Q) In a large file, I have a string variable (part of a street address) that contains errors in which certain letters are erroneously substituted for numerals (For example, the string '123B SMITH ST' should be '1238 SMITH ST'). (These problems result from scanning data.) *To detect at least some such errors, I want to do a sort of search to detect things like "any instance of a string of numerals that contains an embedded letter." I'm thinking to create a logical variable that flags such cases, and I can imagine syntax that looks at a three character window within a string, and then checks whether the middle character = '1' or '2' or ... This seems like a mess. *Any thoughts here? *(A) From: SPSSX(r) Discussion [SPSSX-L@UGA.CC.UGA.EDU] on behalf of marso@MY-DEJANEWS.COM Sent: July 23, 1998 10:42 AM To: SPSSX-L@UGA.CC.UGA.EDU Subject: Re: Sophisticated search in string variable Michael, Just check adjacent characters for Number-String flip flop! David DATA LIST /id 1-2 address 4-25 (a). BEGIN DATA 01 123B SMITH ST. 02 461 OCEAN BVD. 03 12A PENNSYLVANIA AVE. 04 444 N. MICHIGAN AVE. 05 22B4 BAKER ST. END DATA. STRING #ALPHA (A26) #NUM (A10) #ADDR (A22). COMPUTE #ALPHA = "ABCDEFGHIJKLMNOPQRSTUVWXYZ". COMPUTE #NUM = "0123456789" . COMPUTE #ADDR = UPCASE(ADDRESS). LOOP #=2 TO LEN(ADDRESS). COMPUTE #NC = IND(SUB(#ADDR,#,1),#NUM,1) > 0. COMPUTE #NP = IND(SUB(#ADDR,#-1,1),#NUM,1) > 0. COMPUTE #SC = IND(SUB(#ADDR,#,1),#ALPHA,1) > 0. COMPUTE #SP = IND(SUB(#ADDR,#-1,1),#ALPHA,1) > 0. IF (#NC * #SP + #SC * #NP) BAD=1. END LOOP. EXE . Or for a three liner after setup: DATA LIST /id 1-2 address 4-25 (a). BEGIN DATA 01 123B SMITH ST. 02 461 OCEAN BVD. 03 12A PENNSYLVANIA AVE. 04 444 N. MICHIGAN AVE. 05 22B4 BAKER ST. END DATA. STRING #ADDR (A22) . COMPUTE #ADDR = UPCASE(ADDRESS). LOOP #=2 TO LEN(ADDRESS). IF (IND(SUB(#ADDR,#,1),"0123456789",1) > 0) * ( IND(SUB(#ADDR,#-1,1),"ABCDEFGHIJKLMNOPQRSTUVWXYZ",1) > 0) + (IND(SUB(#ADDR,#,1),"ABCDEFGHIJKLMNOPQRSTUVWXYZ",1)>0 ) * (IND(SUB(#ADDR,#-1,1),"0123456789",1) > 0) BAD=1. END LOOP. LIST . |
Related pages
...