API

Build Unihan into tabular / structured format and export it.

class unihan_etl.process.Packager(options)[source]

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Parameters:urlretrieve_fn (function) – function to download file
export()[source]

Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Parameters:argv (list) – Arguments passed in via CLI.
Returns:builder
Return type:Packager
unihan_etl.process.ALLOWED_EXPORT_TYPES = ['json', 'csv']

Allowed export types

unihan_etl.process.DESTINATION_DIR = '/home/docs/.local/share/unihan_etl'

Filepath to output built CSV file to.

class unihan_etl.process.Packager(options)[source]

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Parameters:urlretrieve_fn (function) – function to download file
export()[source]

Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Parameters:argv (list) – Arguments passed in via CLI.
Returns:builder
Return type:Packager
unihan_etl.process.UNIHAN_FIELDS = ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')

Default Unihan fields

unihan_etl.process.UNIHAN_FILES = dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])

Default Unihan Files

unihan_etl.process.UNIHAN_URL = 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'

URI of Unihan.zip data.

unihan_etl.process.UNIHAN_ZIP_PATH = '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'

Filepath to download Zip file.

unihan_etl.process.WORK_DIR = '/home/docs/.cache/unihan_etl/downloads'

Directory to use for processing intermittent files.

unihan_etl.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]

Download file at URL to a destination.

Parameters:
  • url (str) – URL to download from.
  • dest (str) – file path where download is to be saved.
  • urlretrieve_fn (callable) – function to download file
  • reporthook (function) – Function to write progress bar to stdout buffer.
Returns:

destination where file downloaded to.

Return type:

str

unihan_etl.process.expand_delimiters(normalized_data)[source]

Return expanded multi-value fields in UNIHAN.

Parameters:normalized_data (list of dict) – Expects data in list of hashes, per process.normalize()
Returns:Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
Return type:list of dict
unihan_etl.process.extract_zip(zip_path, dest_dir)[source]

Extract zip file. Return zipfile.ZipFile instance.

Parameters:
  • zip_file (str) – filepath to extract.
  • dest_dir (str) – directory to extract to.
Returns:

The extracted zip.

Return type:

zipfile.ZipFile

unihan_etl.process.files_exist(path, files)[source]

Return True if all files exist in specified path.

unihan_etl.process.filter_manifest(files)[source]

Return filtered UNIHAN_MANIFEST from list of file names.

unihan_etl.process.get_fields(d)[source]

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

unihan_etl.process.get_parser()[source]

Return argparse.ArgumentParser instance for CLI.

Returns:argument parser for CLI use.
Return type:argparse.ArgumentParser
unihan_etl.process.has_valid_zip(zip_path)[source]

Return True if valid zip exists.

Parameters:zip_path (str) – absolute path to zip
Returns:True if valid zip exists at path
Return type:bool
unihan_etl.process.in_fields(c, fields)[source]

Return True if string is in the default fields.

unihan_etl.process.listify(data, fields)[source]

Convert tabularized data to a CSV-friendly list.

Parameters:
  • data (list of dict) –
  • params (list of str) – keys/columns, e.g. [‘kDictionary’]
unihan_etl.process.load_data(files)[source]

Extract zip and process information into CSV’s.

Parameters:files (list of str) –
Returns:combined data from files
Return type:str
unihan_etl.process.normalize(raw_data, fields)[source]

Return normalized data from a UNIHAN data files.

Parameters:
  • raw_data (str) – combined text files from UNIHAN
  • fields (list of str) – list of columns to pull
Returns:

list of unihan character information

Return type:

list

unihan_etl.process.not_junk(line)[source]

Return False on newlines and C-style comments.

unihan_etl.process.setup_logger(logger=None, level='DEBUG')[source]

Setup logging for CLI use.

Parameters:
  • logger (Logger) – instance of logger
  • level (str) – logging level, e.g. ‘DEBUG’
unihan_etl.process.zip_has_files(files, zip_file)[source]

Return True if zip has the files inside.

Parameters:
  • files (list of str) – files inside zip file
  • zip_file (zipfile.ZipFile) –
Returns:

True if files inside of :py:meth:`zipfile.ZipFile.namelist()

Return type:

bool

Constants

unihan_etl.constants.CUSTOM_DELIMITED_FIELDS = ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')

FIELDS with multiple values via custom delimiters

unihan_etl.constants.INDEX_FIELDS = ('ucn', 'char')

Default index fields for unihan csv’s. You probably want these.

unihan_etl.constants.SPACE_DELIMITED_DICT_FIELDS = ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Fields with multiple values UNIHAN delimits by spaces -> dict

unihan_etl.constants.SPACE_DELIMITED_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Any space delimited field regardless of expanded form

unihan_etl.constants.SPACE_DELIMITED_LIST_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')

Fields with multiple values UNIHAN delimits by spaces -> list

unihan_etl.constants.UNIHAN_MANIFEST = {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}

Dictionary of tuples mapping locations of files to fields

Expansion

Functions to uncompact details inside field values.

Notes

re.compile() operations are inside of expand functions:

  1. readability
  2. module-level function bytecode is cached in python
  3. the last used compiled regexes are cached
unihan_etl.expansion.N_DIACRITICS = 'ńňǹ'

diacritics from kHanyuPinlu

unihan_etl.expansion.expand_field(field, fvalue)[source]

Return structured value of information in UNIHAN field.

Parameters:
  • field (str) – field name
  • fvalue (str) – value of field
Returns:

expanded field information per UNIHAN’s documentation

Return type:

list or dict

Utilities and test helpers

Utility and helper methods for script.

util

unihan_etl.util.ucn_to_unicode(ucn)[source]

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

unihan_etl.util.ucnstring_to_python(ucn_string)[source]

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

unihan_etl.util.ucnstring_to_unicode(ucn_string)[source]

Return ucnstring as Unicode.

Test helpers functions for downloading and processing Unihan data.

unihan_etl.test.assert_dict_contains_subset(subset, dictionary, msg=None)[source]

Ported assertion for dict subsets in py.test.

Parameters:
  • subset (dict) – needle
  • dictionary (dict) – haystack
  • msg (str, optional) – message display if assertion fails