API¶
Build Unihan into tabular / structured format and export it.
-
class
unihan_etl.process.Packager(options)[source]¶ Download and generate a tabular release of UNIHAN.
-
unihan_etl.process.ALLOWED_EXPORT_TYPES= ['json', 'csv']¶ Allowed export types
-
unihan_etl.process.DESTINATION_DIR= '/home/docs/.local/share/unihan_etl'¶ Filepath to output built CSV file to.
-
class
unihan_etl.process.Packager(options)[source] Download and generate a tabular release of UNIHAN.
-
download(urlretrieve_fn=<function urlretrieve>)[source] Download raw UNIHAN data if not exists.
Parameters: urlretrieve_fn (function) – function to download file
-
export()[source] Extract zip and process information into CSV’s.
-
-
unihan_etl.process.UNIHAN_FIELDS= ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')¶ Default Unihan fields
-
unihan_etl.process.UNIHAN_FILES= dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])¶ Default Unihan Files
-
unihan_etl.process.UNIHAN_URL= 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'¶ URI of Unihan.zip data.
-
unihan_etl.process.UNIHAN_ZIP_PATH= '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'¶ Filepath to download Zip file.
-
unihan_etl.process.WORK_DIR= '/home/docs/.cache/unihan_etl/downloads'¶ Directory to use for processing intermittent files.
-
unihan_etl.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]¶ Download file at URL to a destination.
Parameters: Returns: destination where file downloaded to.
Return type:
-
unihan_etl.process.expand_delimiters(normalized_data)[source]¶ Return expanded multi-value fields in UNIHAN.
Parameters: normalized_data (list of dict) – Expects data in list of hashes, per process.normalize()Returns: Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent). Return type: list of dict
-
unihan_etl.process.extract_zip(zip_path, dest_dir)[source]¶ Extract zip file. Return
zipfile.ZipFileinstance.Parameters: Returns: The extracted zip.
Return type:
-
unihan_etl.process.files_exist(path, files)[source]¶ Return True if all files exist in specified path.
-
unihan_etl.process.filter_manifest(files)[source]¶ Return filtered
UNIHAN_MANIFESTfrom list of file names.
-
unihan_etl.process.get_fields(d)[source]¶ Return list of fields from dict of {filename: [‘field’, ‘field1’]}.
-
unihan_etl.process.get_parser()[source]¶ Return
argparse.ArgumentParserinstance for CLI.Returns: argument parser for CLI use. Return type: argparse.ArgumentParser
-
unihan_etl.process.has_valid_zip(zip_path)[source]¶ Return True if valid zip exists.
Parameters: zip_path (str) – absolute path to zip Returns: True if valid zip exists at path Return type: bool
-
unihan_etl.process.listify(data, fields)[source]¶ Convert tabularized data to a CSV-friendly list.
Parameters: - data (list of dict) –
- params (list of str) – keys/columns, e.g. [‘kDictionary’]
-
unihan_etl.process.load_data(files)[source]¶ Extract zip and process information into CSV’s.
Parameters: files (list of str) – Returns: combined data from files Return type: str
-
unihan_etl.process.normalize(raw_data, fields)[source]¶ Return normalized data from a UNIHAN data files.
Parameters: - raw_data (str) – combined text files from UNIHAN
- fields (list of str) – list of columns to pull
Returns: list of unihan character information
Return type:
-
unihan_etl.process.setup_logger(logger=None, level='DEBUG')[source]¶ Setup logging for CLI use.
Parameters: - logger (
Logger) – instance of logger - level (str) – logging level, e.g. ‘DEBUG’
- logger (
-
unihan_etl.process.zip_has_files(files, zip_file)[source]¶ Return True if zip has the files inside.
Parameters: - files (list of str) – files inside zip file
- zip_file (
zipfile.ZipFile) –
Returns: True if files inside of :py:meth:`zipfile.ZipFile.namelist()
Return type:
Constants¶
-
unihan_etl.constants.CUSTOM_DELIMITED_FIELDS= ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')¶ FIELDS with multiple values via custom delimiters
-
unihan_etl.constants.INDEX_FIELDS= ('ucn', 'char')¶ Default index fields for unihan csv’s. You probably want these.
-
unihan_etl.constants.SPACE_DELIMITED_DICT_FIELDS= ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Fields with multiple values UNIHAN delimits by spaces -> dict
-
unihan_etl.constants.SPACE_DELIMITED_FIELDS= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Any space delimited field regardless of expanded form
-
unihan_etl.constants.SPACE_DELIMITED_LIST_FIELDS= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')¶ Fields with multiple values UNIHAN delimits by spaces -> list
-
unihan_etl.constants.UNIHAN_MANIFEST= {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}¶ Dictionary of tuples mapping locations of files to fields
Expansion¶
Functions to uncompact details inside field values.
Notes
re.compile() operations are inside of expand functions:
- readability
- module-level function bytecode is cached in python
- the last used compiled regexes are cached
-
unihan_etl.expansion.N_DIACRITICS= 'ńňǹ'¶ diacritics from kHanyuPinlu
Utilities and test helpers¶
Utility and helper methods for script.
util¶
-
unihan_etl.util.ucn_to_unicode(ucn)[source]¶ Return a python unicode value from a UCN.
Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)
Test helpers functions for downloading and processing Unihan data.