Function Module Soundex

Soundex function returns a string which is phonetic representation of word. This function can be used to identify words that are spelled differently but sounds alike in English. Soundex can also be used to identify typing errors one of challenge of data cleaning activity.

History

The original Soundex algorithm was not named Soundex. Index systems similar to what we now call “Soundex” were originally developed and used for indexing American census records in the late nineteenth, and early 20th century.Read more

SoundEx Limitations

SoundEx acts as a bridge between the fuzzy and inexact process of human vocal interaction, and the concise true/false processes at the foundation of computer communication. As such, SoundEx is an inherently unreliable interface. Read more

Below is SAP function module which is based on algorithm described at above links. I have used this in one of my data cleaning and upload project. You can adapt this to have as a method of class if you prefer.

FUNCTION zsoundex.
*"----------------------------------------------------------------------
*"*"Local Interface:
*"  IMPORTING
*"     REFERENCE(WORDSTRING) TYPE  STRING
*"     VALUE(LENGTHOPTION) TYPE  I DEFAULT 4
*"     VALUE(CENSUSOPTION) TYPE  I OPTIONAL
*"  EXPORTING
*"     REFERENCE(SOUNDEX_CODE) TYPE  STRING
*"----------------------------------------------------------------------

* http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm

  DATA : lv_input   TYPE string     ,
         lv_option  TYPE i          ,
         lv_soundex TYPE char10     ,
         lv_length  TYPE i          ,
         lv_codelen TYPE i          ,
         lv_tmp     TYPE string     .

* Option = 0
*   Enhanced SoundEx
* Option = 1 - Normal Census
*   Properly calculated SoundEx codes found in all census years.
* Option = 2
*   Special Census - Improperly calculated SoundEx codes found
*   in SOME of the censuses performed in 1880, 1900, and 1910.
* Other values will be treated as 0

  lv_option = censusoption .

  IF lv_option = 0 OR
     lv_option = 1 OR
     lv_option = 2 .
  ELSE.
    lv_option = 0 .
  ENDIF.

* Set length based on Option and length specified
  IF lv_option > 0 .
    lv_codelen  = 4 .
  ENDIF.
  IF lengthoption > 0  .
    lv_codelen = lengthoption .
  ENDIF .
  IF lv_codelen > 10 .
    lv_codelen = 10 .
  ENDIF .
  IF lv_codelen < 4 .
    lv_codelen = 4 .
  ENDIF .

* Step 1
* Capitalize all letters in the word and drop all punctuation marks.
* Pad the word with rightmost blanks as needed during each procedure step.

  lv_input = wordstring .

  REPLACE ALL OCCURRENCES OF REGEX '[^A-Z]' IN lv_input WITH space IGNORING CASE.

  CONDENSE  lv_input NO-GAPS  .
  TRANSLATE lv_input TO UPPER CASE.

  CHECK lv_input IS NOT INITIAL .

  IF lv_option = 0 .
*   DG with G
    REPLACE ALL OCCURRENCES OF 'DG' IN lv_input WITH 'G' .

*   GH with H
    REPLACE ALL OCCURRENCES OF 'GH' IN lv_input WITH 'H' .

*   GN with N (not 'ng')
    REPLACE ALL OCCURRENCES OF 'GN' IN lv_input WITH 'N' .

*   KN with N
    REPLACE ALL OCCURRENCES OF 'KN' IN lv_input WITH 'N' .

*   PH with F
    REPLACE ALL OCCURRENCES OF 'PH' IN lv_input WITH 'F' .

*   MP with M ...WHEN... it is followed by S, Z, or T
    REPLACE ALL OCCURRENCES OF 'MPS' IN lv_input WITH 'MS' .
    REPLACE ALL OCCURRENCES OF 'MPZ' IN lv_input WITH 'MZ' .
    REPLACE ALL OCCURRENCES OF 'MPT' IN lv_input WITH 'MT' .

*   PS with S ...WHEN... it starts a word
    REPLACE ALL OCCURRENCES OF REGEX '^PS' IN lv_input WITH 'S' .

*   PF with F ...WHEN... it starts a word
    REPLACE ALL OCCURRENCES OF REGEX '^PF' IN lv_input WITH 'F' .

*   MB with M
    REPLACE ALL OCCURRENCES OF 'MB' IN lv_input WITH 'M' .

*   TCH with CH
    REPLACE ALL OCCURRENCES OF 'TCH' IN lv_input WITH 'CH' .

*   A or I with E WHEN - starts word+followed by[AEIO]
    REPLACE ALL OCCURRENCES OF REGEX '^A[AEIO]' IN lv_input WITH 'E' .
  ENDIF.

* Step 2
* Retain the first letter of the word.
  lv_soundex = lv_input+0(1) .

  IF lv_input+0(1) = 'H' OR lv_input+0(1) = 'W' .
    CONCATENATE '-' lv_input+1 INTO lv_input .
  ENDIF.

  IF lv_option = 1 .
    REPLACE ALL OCCURRENCES OF 'H' IN lv_input WITH space .
    REPLACE ALL OCCURRENCES OF 'W' IN lv_input WITH space .
    CONDENSE lv_input NO-GAPS .
  ENDIF.

* Step 3
* Change all occurrence of the following letters to '0' (zero):
* 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
  REPLACE ALL OCCURRENCES OF REGEX '[AEIOUHWY]' IN lv_input WITH '0' .

* Step 4
* Change letters from the following sets into the digit given:
* 1 = 'B', 'F', 'P', 'V'
* 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'
* 3 = 'D','T'
* 4 = 'L'
* 5 = 'M','N'
* 6 = 'R'

  REPLACE ALL OCCURRENCES OF REGEX '[BFPV]'     IN lv_input WITH '1' .
  REPLACE ALL OCCURRENCES OF REGEX '[CGJKQSXZ]' IN lv_input WITH '2' .
  REPLACE ALL OCCURRENCES OF REGEX '[DT]'       IN lv_input WITH '3' .
  REPLACE ALL OCCURRENCES OF REGEX '[L]'        IN lv_input WITH '4' .
  REPLACE ALL OCCURRENCES OF REGEX '[MN]'       IN lv_input WITH '5' .
  REPLACE ALL OCCURRENCES OF REGEX '[R]'        IN lv_input WITH '6' .

* Step 5
* Remove all pairs of digits which occur beside each other
* from the string that resulted after step (4).
  lv_length = STRLEN( lv_input ) .

  DATA : lv_last  TYPE char01     ,
         lv_index TYPE syst-index .

  DO lv_length TIMES .
    IF sy-index = 1 .
      lv_tmp  = lv_input+0(1) .
      lv_last = lv_input+0(1) .
    ELSE.
      lv_index = sy-index - 1 .
      IF lv_last <> lv_input+lv_index(1) .
        lv_last = lv_input+lv_index(1) .
        CONCATENATE lv_tmp lv_last INTO lv_tmp .
      ENDIF.
    ENDIF.
  ENDDO.

  lv_input = lv_tmp+1 .

* Remove all zeros from the string that results from step 5.0 (placed there in step 3)
  REPLACE ALL OCCURRENCES OF '0' IN lv_input WITH space .
  CONDENSE lv_input .

*  Pad the string that resulted from step (6) with trailing zeros
*  and return only the first four positions, which will be of the
*  form <uppercase letter> <digit> <digit> <digit>.
  lv_soundex+1 = lv_input .

  IF lv_soundex+1(1) = space .
    lv_soundex+1(1) = 0 .
  ENDIF.

  IF lv_soundex+2(1) = space .
    lv_soundex+2(1) = 0 .
  ENDIF.

  IF lv_soundex+3(1) = space .
    lv_soundex+3(1) = 0 .
  ENDIF.

  soundex_code = lv_soundex+0(lv_codelen) .

ENDFUNCTION.

And this is little program which describe the power of this algorithm to identify word which are spelled differently but sound alike.

TYPES : BEGIN OF ty_data ,
         lastname  TYPE string ,
         soundcode TYPE string ,
        END OF ty_data .

DATA : i_names TYPE TABLE OF ty_data ,
       w_name  TYPE ty_data          .

DEFINE insert_name .
  clear w_name .

  w_name-lastname = &1 .
  call function 'ZSOUNDEX'
    exporting
      wordstring   = w_name-lastname
    importing
      soundex_code = w_name-soundcode.
  append w_name to i_names .

END-OF-DEFINITION.

insert_name 'Morgan' .
insert_name 'Morgana' .
insert_name 'Morgahan' .
insert_name 'Morgain' .
insert_name 'Morgahan' .
insert_name 'Morgahin' .

LOOP AT i_names INTO w_name .
  WRITE : / w_name-lastname , 30 w_name-soundcode .
ENDLOOP .

Leave a Reply