API¶

Build Unihan into tabular / structured format and export it.

class unihan_etl.process.Packager(options)[source]¶

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]¶

Download raw UNIHAN data if not exists.

Parameters:	urlretrieve_fn (function) – function to download file

export()[source]¶: Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]¶

Create Packager instance from CLI argparse arguments.

Parameters:	argv (list) – Arguments passed in via CLI.
Returns:	builder
Return type:	`Packager`

unihan_etl.process.ALLOWED_EXPORT_TYPES = ['json', 'csv']¶: Allowed export types

unihan_etl.process.DESTINATION_DIR = '/home/docs/.local/share/unihan_etl'¶: Filepath to output built CSV file to.

class unihan_etl.process.Packager(options)[source]

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Parameters:	urlretrieve_fn (function) – function to download file

export()[source]: Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Parameters:	argv (list) – Arguments passed in via CLI.
Returns:	builder
Return type:	`Packager`

unihan_etl.process.UNIHAN_FIELDS = ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')¶: Default Unihan fields

unihan_etl.process.UNIHAN_FILES = dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])¶: Default Unihan Files

unihan_etl.process.UNIHAN_URL = 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'¶: URI of Unihan.zip data.

unihan_etl.process.UNIHAN_ZIP_PATH = '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'¶: Filepath to download Zip file.

unihan_etl.process.WORK_DIR = '/home/docs/.cache/unihan_etl/downloads'¶: Directory to use for processing intermittent files.

unihan_etl.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]¶

Download file at URL to a destination.

Parameters:	url (str) – URL to download from. dest (str) – file path where download is to be saved. urlretrieve_fn (callable) – function to download file reporthook (function) – Function to write progress bar to stdout buffer.
Returns:	destination where file downloaded to.
Return type:	str

unihan_etl.process.expand_delimiters(normalized_data)[source]¶

Return expanded multi-value fields in UNIHAN.

Parameters:	normalized_data (list of dict) – Expects data in list of hashes, per `process.normalize()`
Returns:	Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
Return type:	list of dict

unihan_etl.process.extract_zip(zip_path, dest_dir)[source]¶

Extract zip file. Return zipfile.ZipFile instance.

Parameters:	zip_file (str) – filepath to extract. dest_dir (str) – directory to extract to.
Returns:	The extracted zip.
Return type:	`zipfile.ZipFile`

unihan_etl.process.files_exist(path, files)[source]¶: Return True if all files exist in specified path.

unihan_etl.process.filter_manifest(files)[source]¶: Return filtered UNIHAN_MANIFEST from list of file names.

unihan_etl.process.get_fields(d)[source]¶: Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

unihan_etl.process.get_parser()[source]¶

Return argparse.ArgumentParser instance for CLI.

Returns:	argument parser for CLI use.
Return type:	`argparse.ArgumentParser`

unihan_etl.process.has_valid_zip(zip_path)[source]¶

Return True if valid zip exists.

Parameters:	zip_path (str) – absolute path to zip
Returns:	True if valid zip exists at path
Return type:	bool

unihan_etl.process.in_fields(c, fields)[source]¶: Return True if string is in the default fields.

unihan_etl.process.listify(data, fields)[source]¶

Convert tabularized data to a CSV-friendly list.

Parameters:	data (list of dict) – params (list of str) – keys/columns, e.g. [‘kDictionary’]

unihan_etl.process.load_data(files)[source]¶

Extract zip and process information into CSV’s.

Parameters:	files (list of str) –
Returns:	combined data from files
Return type:	str

unihan_etl.process.normalize(raw_data, fields)[source]¶

Return normalized data from a UNIHAN data files.

Parameters:	raw_data (str) – combined text files from UNIHAN fields (list of str) – list of columns to pull
Returns:	list of unihan character information
Return type:	list

unihan_etl.process.not_junk(line)[source]¶: Return False on newlines and C-style comments.

unihan_etl.process.setup_logger(logger=None, level='DEBUG')[source]¶

Setup logging for CLI use.

Parameters:	logger (`Logger`) – instance of logger level (str) – logging level, e.g. ‘DEBUG’

unihan_etl.process.zip_has_files(files, zip_file)[source]¶

Return True if zip has the files inside.

Parameters:	files (list of str) – files inside zip file zip_file (`zipfile.ZipFile`) –
Returns:	True if files inside of :py:meth:`zipfile.ZipFile.namelist()
Return type:	bool

Constants¶

unihan_etl.constants.CUSTOM_DELIMITED_FIELDS = ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')¶: FIELDS with multiple values via custom delimiters

unihan_etl.constants.INDEX_FIELDS = ('ucn', 'char')¶: Default index fields for unihan csv’s. You probably want these.

unihan_etl.constants.SPACE_DELIMITED_DICT_FIELDS = ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶: Fields with multiple values UNIHAN delimits by spaces -> dict

unihan_etl.constants.SPACE_DELIMITED_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶: Any space delimited field regardless of expanded form

unihan_etl.constants.SPACE_DELIMITED_LIST_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')¶: Fields with multiple values UNIHAN delimits by spaces -> list

unihan_etl.constants.UNIHAN_MANIFEST = {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}¶: Dictionary of tuples mapping locations of files to fields

Expansion¶

Functions to uncompact details inside field values.

Notes

re.compile() operations are inside of expand functions:

readability
module-level function bytecode is cached in python
the last used compiled regexes are cached

unihan_etl.expansion.N_DIACRITICS = 'ńňǹ'¶: diacritics from kHanyuPinlu

unihan_etl.expansion.expand_field(field, fvalue)[source]¶

Return structured value of information in UNIHAN field.

Parameters:	field (str) – field name fvalue (str) – value of field
Returns:	expanded field information per UNIHAN’s documentation
Return type:	list or dict

Utilities and test helpers¶

Utility and helper methods for script.

util¶

unihan_etl.util.ucn_to_unicode(ucn)[source]¶

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

unihan_etl.util.ucnstring_to_python(ucn_string)[source]¶: Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

unihan_etl.util.ucnstring_to_unicode(ucn_string)[source]¶: Return ucnstring as Unicode.

Test helpers functions for downloading and processing Unihan data.

unihan_etl.test.assert_dict_contains_subset(subset, dictionary, msg=None)[source]¶

Ported assertion for dict subsets in py.test.

Parameters:	subset (dict) – needle dictionary (dict) – haystack msg (str, optional) – message display if assertion fails

API¶

Constants¶

Expansion¶

Utilities and test helpers¶

util¶

Navigation

Related Topics

Other Projects