API

Build Unihan into tabular / structured format and export it.

class unihan_etl.process.Packager(options)

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)

Download raw UNIHAN data if not exists.

Parameters:urlretrieve_fn (function) – function to download file
export()

Extract zip and process information into CSV’s.

classmethod from_cli(argv)

Create Packager instance from CLI argparse arguments.

Parameters:argv (list) – Arguments passed in via CLI.
Returns:builder
Return type:Packager
unihan_etl.process.ALLOWED_EXPORT_TYPES = [u'json', u'csv']

Allowed export types

unihan_etl.process.DESTINATION_DIR = '/home/docs/.local/share/unihan_etl'

Filepath to output built CSV file to.

class unihan_etl.process.Packager(options)

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)

Download raw UNIHAN data if not exists.

Parameters:urlretrieve_fn (function) – function to download file
export()

Extract zip and process information into CSV’s.

classmethod from_cli(argv)

Create Packager instance from CLI argparse arguments.

Parameters:argv (list) – Arguments passed in via CLI.
Returns:builder
Return type:Packager
unihan_etl.process.UNIHAN_FIELDS = ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJis0', 'kJis1', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')

Default Unihan fields

unihan_etl.process.UNIHAN_FILES = ['Unihan_RadicalStrokeCounts.txt', 'Unihan_NumericValues.txt', 'Unihan_Variants.txt', 'Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_OtherMappings.txt', 'Unihan_Readings.txt', 'Unihan_IRGSources.txt']

Default Unihan Files

unihan_etl.process.UNIHAN_URL = u'http://www.unicode.org/Public/UNIDATA/Unihan.zip'

URI of Unihan.zip data.

unihan_etl.process.UNIHAN_ZIP_PATH = u'/home/docs/.cache/unihan_etl/downloads/Unihan.zip'

Filepath to download Zip file.

unihan_etl.process.WORK_DIR = u'/home/docs/.cache/unihan_etl/downloads'

Directory to use for processing intermittent files.

unihan_etl.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)

Download a file to a destination.

Parameters:
  • url (str) – URL to download from.
  • dest (str) – file path where download is to be saved.
  • urlretrieve_fn (function) – function to download file
  • reporthook (function) – Function to write progress bar to stdout buffer.
Returns:

destination where file downloaded to.

Return type:

str

unihan_etl.process.expand_delimiters(normalized_data)

Return expanded multi-value fields in UNIHAN.

Parameters:normalized (list of dict) – Expects data in list of hashes, per process.normalize()
Returns:Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
Return type:list of dict
unihan_etl.process.extract_zip(zip_path, dest_dir)

Extract zip file. Return zipfile.ZipFile instance.

Parameters:
  • zip_path (str) – filepath to extract.
  • dest_dir (str) – (optional) directory to extract to.
Returns:

The extracted zip.

Return type:

zipfile.ZipFile

unihan_etl.process.files_exist(path, files)

Return True if all files exist in specified path.

unihan_etl.process.filter_manifest(files)

Return filtered UNIHAN_MANIFEST from list of file names.

unihan_etl.process.get_fields(d)

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

unihan_etl.process.get_parser()

Return argparse.ArgumentParser instance for CLI.

Returns:argument parser for CLI use.
Return type:argparse.ArgumentParser
unihan_etl.process.has_valid_zip(zip_path)

Return True if valid zip exists.

Parameters:zip_path (str) – absolute path to zip
Returns:True if valid zip exists at path
Return type:bool
unihan_etl.process.in_fields(c, fields)

Return True if string is in the default fields.

unihan_etl.process.listify(data, fields)

Convert tabularized data to a CSV-friendly list.

Parameters:data (list) – List of dicts
Params fields:keys/columns, e.g. [‘kDictionary’]
unihan_etl.process.load_data(files)

Extract zip and process information into CSV’s.

Parameters:files (list) –
Return type:str
Returns:string of combined data from files
unihan_etl.process.normalize(raw_data, fields)

Return normalized data from a UNIHAN data files.

Parameters:
  • raw_data (str) – combined text files from UNIHAN
  • fields (list) – list of columns to pull
Returns:

list of unihan character information

Return type:

list

unihan_etl.process.not_junk(line)

Return False on newlines and C-style comments.

unihan_etl.process.setup_logger(logger=None, level=u'DEBUG')

Setup logging for CLI use.

Parameters:logger (Logger) – instance of logger
unihan_etl.process.zip_has_files(files, zip_file)

Return True if zip has the files inside.

Parameters:
  • files (list) – list of files inside zip
  • zip_file (zipfile.ZipFile) – zip file to look inside.
Returns:

True if files inside of :py:meth:`zipfile.ZipFile.namelist().

Return type:

bool

Constants

unihan_etl.constants.CUSTOM_DELIMITED_FIELDS = ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')

FIELDS with multiple values via custom delimiters

unihan_etl.constants.INDEX_FIELDS = ('ucn', 'char')

Default index fields for unihan csv’s. You probably want these.

unihan_etl.constants.SPACE_DELIMITED_DICT_FIELDS = ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Fields with multiple values UNIHAN delimits by spaces -> dict

unihan_etl.constants.SPACE_DELIMITED_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJis0', 'kJIS0213', 'kJis1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Any space delimited field regardless of expanded form

unihan_etl.constants.SPACE_DELIMITED_LIST_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJis0', 'kJIS0213', 'kJis1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')

Fields with multiple values UNIHAN delimits by spaces -> list

unihan_etl.constants.UNIHAN_MANIFEST = {'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant'), 'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJis0', 'kJis1', 'kJIS0213', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kXerox'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')}

Dictionary of tuples mapping locations of files to fields

Expansion

Functions to uncompact details inside field values.

Note

re.compile() operations are inside of expand functions:

  1. readability
  2. module-level function bytecode is cached in python
  3. the last used compiled regexes are cached
unihan_etl.expansion.N_DIACRITICS = u'\u0144\u0148\u01f9'

diacritics from kHanyuPinlu

unihan_etl.expansion.expand_field(field, fvalue)

Return structured value of information in UNIHAN field.

Parameters:
  • field (str) – field name
  • fvalue – value of field
Returns:

list or dict of expanded field information per UNIHAN’s documentation

Return type:

list or dict

Utilities and test helpers

Utility and helper methods for script.

util

unihan_etl.util.ucn_to_unicode(ucn)

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

unihan_etl.util.ucnstring_to_python(ucn_string)

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

unihan_etl.util.ucnstring_to_unicode(ucn_string)

Return ucnstring as Unicode.

Test helpers functions for downloading and processing Unihan data.

unihan_etl.test.assert_dict_contains_subset(subset, dictionary, msg=None)

Ported assertion for dict subsets in py.test.

Parameters:
  • subset (dict) – needle
  • dictionary (dict) – haystack
  • msg (str) – message for assert if failure found