API¶
Build Unihan into tabular / structured format and export it.
-
class
unihan_etl.process.
Packager
(options)[source]¶ Download and generate a tabular release of UNIHAN.
-
unihan_etl.process.
ALLOWED_EXPORT_TYPES
= ['json', 'csv']¶ Allowed export types
-
unihan_etl.process.
DESTINATION_DIR
= '/home/docs/.local/share/unihan_etl'¶ Filepath to output built CSV file to.
-
class
unihan_etl.process.
Packager
(options)[source] Download and generate a tabular release of UNIHAN.
-
download
(urlretrieve_fn=<function urlretrieve>)[source] Download raw UNIHAN data if not exists.
Parameters: urlretrieve_fn (function) – function to download file
-
export
()[source] Extract zip and process information into CSV’s.
-
-
unihan_etl.process.
UNIHAN_FIELDS
= ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')¶ Default Unihan fields
-
unihan_etl.process.
UNIHAN_FILES
= dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])¶ Default Unihan Files
-
unihan_etl.process.
UNIHAN_URL
= 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'¶ URI of Unihan.zip data.
-
unihan_etl.process.
UNIHAN_ZIP_PATH
= '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'¶ Filepath to download Zip file.
-
unihan_etl.process.
WORK_DIR
= '/home/docs/.cache/unihan_etl/downloads'¶ Directory to use for processing intermittent files.
-
unihan_etl.process.
download
(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]¶ Download file at URL to a destination.
Parameters: Returns: destination where file downloaded to.
Return type:
-
unihan_etl.process.
expand_delimiters
(normalized_data)[source]¶ Return expanded multi-value fields in UNIHAN.
Parameters: normalized_data (list of dict) – Expects data in list of hashes, per process.normalize()
Returns: Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent). Return type: list of dict
-
unihan_etl.process.
extract_zip
(zip_path, dest_dir)[source]¶ Extract zip file. Return
zipfile.ZipFile
instance.Parameters: Returns: The extracted zip.
Return type:
-
unihan_etl.process.
files_exist
(path, files)[source]¶ Return True if all files exist in specified path.
-
unihan_etl.process.
filter_manifest
(files)[source]¶ Return filtered
UNIHAN_MANIFEST
from list of file names.
-
unihan_etl.process.
get_fields
(d)[source]¶ Return list of fields from dict of {filename: [‘field’, ‘field1’]}.
-
unihan_etl.process.
get_parser
()[source]¶ Return
argparse.ArgumentParser
instance for CLI.Returns: argument parser for CLI use. Return type: argparse.ArgumentParser
-
unihan_etl.process.
has_valid_zip
(zip_path)[source]¶ Return True if valid zip exists.
Parameters: zip_path (str) – absolute path to zip Returns: True if valid zip exists at path Return type: bool
-
unihan_etl.process.
listify
(data, fields)[source]¶ Convert tabularized data to a CSV-friendly list.
Parameters: - data (list of dict) –
- params (list of str) – keys/columns, e.g. [‘kDictionary’]
-
unihan_etl.process.
load_data
(files)[source]¶ Extract zip and process information into CSV’s.
Parameters: files (list of str) – Returns: combined data from files Return type: str
-
unihan_etl.process.
normalize
(raw_data, fields)[source]¶ Return normalized data from a UNIHAN data files.
Parameters: - raw_data (str) – combined text files from UNIHAN
- fields (list of str) – list of columns to pull
Returns: list of unihan character information
Return type:
-
unihan_etl.process.
setup_logger
(logger=None, level='DEBUG')[source]¶ Setup logging for CLI use.
Parameters: - logger (
Logger
) – instance of logger - level (str) – logging level, e.g. ‘DEBUG’
- logger (
-
unihan_etl.process.
zip_has_files
(files, zip_file)[source]¶ Return True if zip has the files inside.
Parameters: - files (list of str) – files inside zip file
- zip_file (
zipfile.ZipFile
) –
Returns: True if files inside of :py:meth:`zipfile.ZipFile.namelist()
Return type:
Constants¶
-
unihan_etl.constants.
CUSTOM_DELIMITED_FIELDS
= ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')¶ FIELDS with multiple values via custom delimiters
-
unihan_etl.constants.
INDEX_FIELDS
= ('ucn', 'char')¶ Default index fields for unihan csv’s. You probably want these.
-
unihan_etl.constants.
SPACE_DELIMITED_DICT_FIELDS
= ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Fields with multiple values UNIHAN delimits by spaces -> dict
-
unihan_etl.constants.
SPACE_DELIMITED_FIELDS
= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Any space delimited field regardless of expanded form
-
unihan_etl.constants.
SPACE_DELIMITED_LIST_FIELDS
= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')¶ Fields with multiple values UNIHAN delimits by spaces -> list
-
unihan_etl.constants.
UNIHAN_MANIFEST
= {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}¶ Dictionary of tuples mapping locations of files to fields
Expansion¶
Functions to uncompact details inside field values.
Notes
re.compile()
operations are inside of expand functions:
- readability
- module-level function bytecode is cached in python
- the last used compiled regexes are cached
-
unihan_etl.expansion.
N_DIACRITICS
= 'ńňǹ'¶ diacritics from kHanyuPinlu
Utilities and test helpers¶
Utility and helper methods for script.
util¶
-
unihan_etl.util.
ucn_to_unicode
(ucn)[source]¶ Return a python unicode value from a UCN.
Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)
Test helpers functions for downloading and processing Unihan data.