Skip to content

Preprocess

Normalize text (especially CJK character)

text = re.sub(r"\xa0", " ", text)
text = re.sub(r"\xA0", " ", text)
text = re.sub(r"\t", " ", text)
import unicodedata
text = "株式会社KADOKAWA\u3000Future\u3000Publishing"
text = unicodedata.normalize("NFKC", text)
# it will become 株式会社KADOKAWA Future Publishing
  • For unicodedata.normalize, more explanation is here

    アイウエオ ==(NFC)==> アイウエオ
    アイウエオ ==(NFD)==> アイウエオ
    アイウエオ ==(NFKC)==> アイウエオ
    アイウエオ ==(NFKD)==> アイウエオ
    パピプペポ ==(NFC)==> パピプペポ
    パピプペポ ==(NFD)==> パピプペポ
    パピプペポ ==(NFKC)==> パピプペポ
    パピプペポ ==(NFKD)==> パピプペポ
    パピプペポ ==(NFC)==> パピプペポ
    パピプペポ ==(NFD)==> パピプペポ
    パピプペポ ==(NFKC)==> パピプペポ
    パピプペポ ==(NFKD)==> パピプペポ
    abcABC ==(NFC)==> abcABC
    abcABC ==(NFD)==> abcABC
    abcABC ==(NFKC)==> abcABC
    abcABC ==(NFKD)==> abcABC
    123 ==(NFC)==> 123
    123 ==(NFD)==> 123
    123 ==(NFKC)==> 123
    123 ==(NFKD)==> 123
    +-.~)} ==(NFC)==> +-.~)}
    +-.~)} ==(NFD)==> +-.~)}
    +-.~)} ==(NFKC)==> +-.~)}
    +-.~)} ==(NFKD)==> +-.~)}
    

Comments