python - 正则表达式解析格式良好的多行数据字典-6ren

python - 正则表达式解析格式良好的多行数据字典

转载作者：太空宇宙更新时间：2023-11-03 18:07:34

我正在尝试读取和解析人口普查局美国社区调查公共(public)使用微样本数据发布的数据字典，如 here 所示。 .

它的格式相当好，尽管有一些错误，其中插入了一些解释性注释。

我认为我的首选结果是获取每个变量一行的数据帧，并将给定变量的所有值标签序列化到存储在同一行的值字典字段中的一个字典中(尽管是类似于分层 json 的格式)不会很糟糕，但会更复杂。

我得到了以下代码:

 import pandas as pd
 import re
 import urllib2
 data = urllib2.urlopen('http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict13.txt')

 ## replace newline characters so we can use dots and find everything until a double 
 ## carriage return (replaced to ||) with a lookahead assertion.
 data=data.replace('\n','|')

 datadict=pd.DataFrame(re.findall("([A-Z]{2,8})\s{2,9}([0-9]{1})\s{2,6}\|\s{2,4}([A-Za-z\-\(\) ]{3,85})",data,re.MULTILINE),columns=['variable','width','description'])
 datadict.head(5)

+----+----------+-------+------------------------------------------------+
|    | variable | width | description                                    |
+----+----------+-------+------------------------------------------------+
| 0  | RT       | 1     | Record Type                                    |
+----+----------+-------+------------------------------------------------+
| 1  | SERIALNO | 7     | Housing unit                                   |
+----+----------+-------+------------------------------------------------+
| 2  | DIVISION | 1     | Division code                                  |
+----+----------+-------+------------------------------------------------+
| 3  | PUMA     | 5     | Public use microdata area code (PUMA) based on |
+----+----------+-------+------------------------------------------------+
| 4  | REGION   | 1     | Region code                                    |
+----+----------+-------+------------------------------------------------+
| 5  | ST       | 2     | State Code                                     |
+----+----------+-------+------------------------------------------------+

到目前为止一切顺利。变量列表就在那里，以及每个变量的字符宽度。

我可以扩展它并获得额外的行(值标签所在的位置)，如下所示:

datadict_exp=pd.DataFrame(
re.findall("([A-Z]{2,9})\s{2,9}([0-9]{1})\s{2,6}\|\s{4}([A-Za-z\-\(\)\;\<\> 0-9]{2,85})\|\s{11,15}([a-z0-9]{0,2})[ ]\.([A-Za-z/\-\(\) ]{2,120})",
           data,re.MULTILINE))
 datadict_exp.head(5)

+----+----------+-------+---------------------------------------------------+---------+--------------+
| id | variable | width | description                                       | value_1 | label_1      |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 0  | DIVISION | 1     | Division code                                     | 0       | Puerto Rico  |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 1  | REGION   | 1     | Region code                                       | 1       | Northeast    |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 2  | ST       | 2     | State Code                                        | 1       | Alabama/AL   |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 3  | NP       | 2     | Number of person records following this housin... | 0       | Vacant unit  |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 4  | TYPE     | 1     | Type of unit                                      | 1       | Housing unit |
+----+----------+-------+---------------------------------------------------+---------+--------------+

这样就得到了第一个值和关联的标签。我的正则表达式问题是如何重复以 \s{11,15} 开始并结束的多行匹配 - 即一些变量具有大量唯一值(ST 或 state code 后跟大约 50 行，表示每个状态的值和标签)。

我很早就用管道更改了源文件中的回车符，认为我可以无耻地依赖点来匹配所有内容，直到双回车符，指示该特定变量的结尾，这就是我的位置被卡住了。

那么——如何重复多行模式任意次数。

(稍后的一个复杂问题是，某些变量并未在字典中完全枚举，但显示了有效的值范围。NP 例如[与同一家庭相关的人数]，在描述后面用“02..20”表示。如果我不考虑这一点，我的解析当然会错过这样的条目。)

最佳答案

这不是正则表达式，但我使用此 Python 3x 脚本解析了 PUMSDataDict2013.txt 和 PUMS_Data_Dictionary_2009-2013.txt ( Census ACS 2013 documentation 、 FTP server )以下。我使用 pandas.DataFrame.from_dict 和 pandas.concat 创建了一个分层数据框，如下所示。

用于解析 PUMSDataDict2013.txt 和 PUMS_Data_Dictionary_2009-2013.txt 的 Python 3x 函数:

import collections
import os


def parse_pumsdatadict(path:str) -> collections.OrderedDict:
    r"""Parse ACS PUMS Data Dictionaries.

    Args:
        path (str): Path to downloaded data dictionary.

    Returns:
        ddict (collections.OrderedDict): Parsed data dictionary with original
            key order preserved.

    Raises:
        FileNotFoundError: Raised if `path` does not exist.

    Notes:
        * Only some data dictionaries have been tested.[^urls]
        * Values are all strings. No data types are inferred from the
            original file.
        * Example structure of returned `ddict`:
            ddict['title'] = '2013 ACS PUMS DATA DICTIONARY'
            ddict['date'] = 'August 7, 2015'
            ddict['record_types']['HOUSING RECORD']['RT']\
                ['length'] = '1'
                ['description'] = 'Record Type'
                ['var_codes']['H'] = 'Housing Record or Group Quarters Unit'
            ddict['record_types']['HOUSING RECORD'][...]
            ddict['record_types']['PERSON RECORD'][...]
            ddict['notes'] =
                ['Note for both Industry and Occupation lists...',
                 '*  In cases where the SOC occupation code ends...',
                 ...]

    References:
        [^urls]: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/
            PUMSDataDict2013.txt
            PUMS_Data_Dictionary_2009-2013.txt

    """
    # Check arguments.
    if not os.path.exists(path):
        raise FileNotFoundError(
            "Path does not exist:\n{path}".format(path=path))
    # Parse data dictionary.
    # Note:
    # * Data dictionary keys and values are "codes for variables",
    #   using the ACS terminology,
    #   https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html
    # * The data dictionary is not all encoded in UTF-8. Replace encoding
    #   errors when found.
    # * Catch instances of inconsistently formatted data.
    ddict = collections.OrderedDict()
    with open(path, encoding='utf-8', errors='replace') as fobj:
        # Data dictionary name is line 1.
        ddict['title'] = fobj.readline().strip()
        # Data dictionary date is line 2.
        ddict['date'] = fobj.readline().strip()    
        # Initialize flags to catch lines.
        (catch_var_name, catch_var_desc,
         catch_var_code, catch_var_note) = (None, )*4
        var_name = None
        var_name_last = 'PWGTP80' # Necessary for unformatted end-of-file notes.
        for line in fobj:
            # Replace tabs with 4 spaces
            line = line.replace('\t', ' '*4).rstrip()
            # Record type is section header 'HOUSING RECORD' or 'PERSON RECORD'.
            if (line.strip() == 'HOUSING RECORD'
                or line.strip() == 'PERSON RECORD'):
                record_type = line.strip()
                if 'record_types' not in ddict:
                    ddict['record_types'] = collections.OrderedDict()
                ddict['record_types'][record_type] = collections.OrderedDict()
            # A newline precedes a variable name.
            # A newline follows the last variable code.
            elif line == '':
                # Example inconsistent format case:
                # WGTP54     5
                #     Housing Weight replicate 54
                #
                #           -9999..09999 .Integer weight of housing unit
                if (catch_var_code
                    and 'var_codes' not in ddict['record_types'][record_type][var_name]):
                    pass
                # Terminate the previous variable block and look for the next
                # variable name, unless past last variable name.
                else:
                    catch_var_code = False
                    catch_var_note = False
                    if var_name != var_name_last:
                        catch_var_name = True
            # Variable name is 1 line with 0 space indent.
            # Variable name is followed by variable description.
            # Variable note is optional.
            # Variable note is preceded by newline.
            # Variable note is 1+ lines.
            # Variable note is followed by newline.
            elif (catch_var_name and not line.startswith(' ') 
                and var_name != var_name_last):
                # Example: "Note: Public use microdata areas (PUMAs) ..."
                if line.lower().startswith('note:'):
                    var_note = line.strip() # type(var_note) == str
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    # Append a new note.
                    ddict['record_types'][record_type][var_name]['notes'].append(var_note)
                    catch_var_note = True
                # Example: """
                # Note: Public Use Microdata Areas (PUMAs) designate areas ...
                # population.  Use with ST for unique code. PUMA00 applies ...
                # ...
                # """
                elif catch_var_note:
                    var_note = line.strip() # type(var_note) == str
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    # Concatenate to most recent note.
                    ddict['record_types'][record_type][var_name]['notes'][-1] += ' '+var_note
                # Example: "NWAB       1 (UNEDITED - See 'Employment Status Recode' (ESR))"
                else:
                    # type(var_note) == list
                    (var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
                    ddict['record_types'][record_type][var_name] = collections.OrderedDict()
                    ddict['record_types'][record_type][var_name]['length'] = var_len
                    # Append a new note if exists.
                    if len(var_note) > 0:
                        if 'notes' not in ddict['record_types'][record_type][var_name]:
                            ddict['record_types'][record_type][var_name]['notes'] = list()
                        ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
                    catch_var_name = False
                    catch_var_desc = True
                    var_desc_indent = None
            # Variable description is 1+ lines with 1+ space indent.
            # Variable description is followed by variable code(s).
            # Variable code(s) is 1+ line with larger whitespace indent
            # than variable description. Example:"""
            # PUMA00     5      
            #     Public use microdata area code (PUMA) based on Census 2000 definition for data
            #     collected prior to 2012. Use in combination with PUMA10.          
            #           00100..08200 .Public use microdata area codes 
            #                   77777 .Combination of 01801, 01802, and 01905 in Louisiana
            #             -0009 .Code classification is Not Applicable because data 
            #                         .collected in 2012 or later            
            # """
            # The last variable code is followed by a newline.
            elif (catch_var_desc or catch_var_code) and line.startswith(' '):
                indent = len(line) - len(line.lstrip())
                # For line 1 of variable description.
                if catch_var_desc and var_desc_indent is None:
                    var_desc_indent = indent
                    var_desc = line.strip()
                    ddict['record_types'][record_type][var_name]['description'] = var_desc
                # For lines 2+ of variable description.
                elif catch_var_desc and indent <= var_desc_indent:
                    var_desc = line.strip()
                    ddict['record_types'][record_type][var_name]['description'] += ' '+var_desc
                # For lines 1+ of variable codes.
                else:
                    catch_var_desc = False
                    catch_var_code = True
                    is_valid_code = None
                    if not line.strip().startswith('.'):
                        # Example case: "01 .One person record (one person in household or"
                        if ' .' in line:
                            (var_code, var_code_desc) = line.strip().split(
                                sep=' .', maxsplit=1)
                            is_valid_code = True
                        # Example inconsistent format case:"""
                        #            bbbb. N/A (age less than 15 years; never married)
                        # """
                        elif '. ' in line:
                            (var_code, var_code_desc) = line.strip().split(
                                sep='. ', maxsplit=1)
                            is_valid_code = True
                        else:
                            raise AssertionError(
                                "Program error. Line unaccounted for:\n" +
                                "{line}".format(line=line))
                        if is_valid_code:
                            if 'var_codes' not in ddict['record_types'][record_type][var_name]:
                                ddict['record_types'][record_type][var_name]['var_codes'] = collections.OrderedDict()
                            ddict['record_types'][record_type][var_name]['var_codes'][var_code] = var_code_desc
                    # Example case: ".any person in group quarters)"
                    else:
                        var_code_desc = line.strip().lstrip('.')
                        ddict['record_types'][record_type][var_name]['var_codes'][var_code] += ' '+var_code_desc
            # Example inconsistent format case:"""
            # ADJHSG     7      
            # Adjustment factor for housing dollar amounts (6 implied decimal places)
            # """
            elif (catch_var_desc and
                'description' not in ddict['record_types'][record_type][var_name]):
                var_desc = line.strip()
                ddict['record_types'][record_type][var_name]['description'] = var_desc
                catch_var_desc = False
                catch_var_code = True
            # Example inconsistent format case:"""
            # WGTP10     5
            #     Housing Weight replicate 10
            #           -9999..09999 .Integer weight of housing unit
            # WGTP11     5
            #     Housing Weight replicate 11
            #           -9999..09999 .Integer weight of housing unit
            # """
            elif ((var_name == 'WGTP10' and 'WGTP11' in line)
                or (var_name == 'YOEP12' and 'ANC' in line)):
                # type(var_note) == list
                (var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
                ddict['record_types'][record_type][var_name] = collections.OrderedDict()
                ddict['record_types'][record_type][var_name]['length'] = var_len
                if len(var_note) > 0:
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
                catch_var_name = False
                catch_var_desc = True
                var_desc_indent = None
            else:
                if (catch_var_name, catch_var_desc,
                    catch_var_code, catch_var_note) != (False, )*4:
                    raise AssertionError(
                        "Program error. All flags to catch lines should be set " +
                        "to `False` by end-of-file.")
                if var_name != var_name_last:
                    raise AssertionError(
                        "Program error. End-of-file notes should only be read "+
                        "after `var_name_last` has been processed.")
                if 'notes' not in ddict:
                    ddict['notes'] = list()
                ddict['notes'].append(line)
    return ddict

创建分层数据框(格式如下为 Jupyter Notebook 单元格):

In [ ]:
import pandas as pd
ddict = parse_pumsdatadict(path=r'/path/to/PUMSDataDict2013.txt')
tmp = dict()
for record_type in ddict['record_types']:
    tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index')
df_ddict = pd.concat(tmp, names=['record_type', 'var_name'])
df_ddict.head()

Out[ ]:
# Click "Run code snippet" below to render the output from `df_ddict.head()`.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>length</th>
      <th>description</th>
      <th>var_codes</th>
      <th>notes</th>
    </tr>
    <tr>
      <th>record_type</th>
      <th>var_name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="5" valign="top">HOUSING RECORD</th>
      <th>ACCESS</th>
      <td>1</td>
      <td>Access to the Internet</td>
      <td>{'b': 'N/A (GQ)', '1': 'Yes, with subscription...</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>ACR</th>
      <td>1</td>
      <td>Lot size</td>
      <td>{'b': 'N/A (GQ/not a one-family house or mobil...</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>ADJHSG</th>
      <td>7</td>
      <td>Adjustment factor for housing dollar amounts (...</td>
      <td>{'1000000': '2013 factor (1.000000)'}</td>
      <td>[Note: The value of ADJHSG inflation-adjusts r...</td>
    </tr>
    <tr>
      <th>ADJINC</th>
      <td>7</td>
      <td>Adjustment factor for income and earnings doll...</td>
      <td>{'1007549': '2013 factor (1.007549)'}</td>
      <td>[Note: The value of ADJINC inflation-adjusts r...</td>
    </tr>
    <tr>
      <th>AGS</th>
      <td>1</td>
      <td>Sales of Agriculture Products (Yearly sales)</td>
      <td>{'b': 'N/A (GQ/vacant/not a one family house o...</td>
      <td>[Note: no adjustment factor is applied to AGS.]</td>
    </tr>
  </tbody>
</table>

关于python - 正则表达式解析格式良好的多行数据字典，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26564775/

文章推荐：散列分配中的 Ruby 数组奇怪行为

文章推荐： ruby - 如何从两个数组中独立删除一个元素，该元素已相互复制？

文章推荐： c# - 获取wpf应用程序中资源文件夹的绝对路径或动态路径

javascript - 控制台错误 - 解析 AJAX JSON 解析
我一直在使用 AJAX 从我正在创建的网络服务中解析 JSON 数组时遇到问题。我的前端是一个简单的 ajax 和 jquery 组合，用于显示从我正在创建的网络服务返回的结果。尽管知道我的数据库查
xml - Json 解析 vs xml 解析？
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
android - java.lang.NoClassDefFoundError : com. 解析。解析
我在尝试运行 Android 应用程序时遇到问题并收到以下错误 java.lang.NoClassDefFoundError: com.parse.Parse 当我尝试运行该应用时。最佳答案在这
python - 解析 HTML 内容时防止 etree 解析 HTML 实体
有什么办法可以防止etree在解析HTML内容时解析HTML实体吗？ html = etree.HTML('&') html.find('.//body').text 这给了我 '&' 但我想
javascript - 使用 JSON 解析/解析 js 对象时，返回方法中的函数范围会丢失
我有一个有点疯狂的例子，但对于那些 JavaScript 函数作用域专家来说，它看起来是一个很好的练习: (function (global) { // our module number one
java - 使用 Java 解析 HTML 数据(DOM 解析)
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
php - 在服务器上用 PHP 解析 HTML 还是在最终用户端用 JavaScript 解析 HTML 会更好？
我需要编写一个脚本来获取链接并解析链接页面的 HTML 以提取标题和其他一些数据，例如可能是简短的描述，就像您链接到 Facebook 上的内容一样。当用户向站点添加链接时将调用它，因此在客户端启动
node.js - 为什么 npm 包从/AppData 解析，而不是从 local/node_modules 解析？
在 VS Code 中本地开发时，包解析为 C:/Users//AppData/Local/Microsoft/TypeScript/3.5/node_modules/@types//index而不是
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
解析，在哪里可以了解
我被赋予了将一种语言“翻译”成另一种语言的工作。对于使用正则表达式的简单逐行方法来说，源代码过于灵活(复杂)。我在哪里可以了解更多关于词法分析和解析器的信息？最佳答案如果你想对这个主题产生“情绪化
正则表达式 {} 解析
您好，我在解析此文本时遇到问题 { { { {[system1];1;1;0.612509325}; {[system2];1;
JavaScript 解析？
我正在为 adobe after effects 在 extendscript 中编写一些代码，最终变成了 javascript。我有一个数组，我想只搜索单词“assemble”并返回整个 jc3_
JavaScript 解析
我有这段代码: $(document).ready(function() { // }); 问题:FB_RequireFeatures block 外部的代码先于其内部的代码执行。因此 who
解析.netcore项目中IStartupFilter使用教程
背景： netcore项目中有些服务是在通过中间件来通信的，比如orleans组件。它里面服务和客户端会指定网关和端口，我们只需要开放客户端给外界，服务端关闭端口。相当于去掉host，这样省掉了些
解析:继承ViewGroup后的子类如何重写onMeasure方法
1.首先贴上我试验成功的代码复制代码代码如下: protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec)
Python如何对XML 解析
什么是 XML？ XML 指可扩展标记语言（eXtensible Markup Language），标准通用标记语言的子集，是一种用于标记电子文件使其具有结构性的标记语言。你可以通过本站学习 X
解析:php调用MsSQL存储过程使用内置RETVAL获取过程中的return值
【PHP代码】复制代码代码如下: $stmt = mssql_init('P__Global_Test', $conn) or die("initialize sto
解析:清除SQL被注入恶意病毒代码的语句
在SQL查询分析器执行以下代码就可以了。复制代码代码如下: declare @t varchar(255),@c varchar(255) declare table_cursor curs
【JavaScript】前端算法题40道题+解析
前言最近练习了一些前端算法题，现在做个总结，以下题目都是个人写法，并不是标准答案，如有错误欢迎指出，有对某道题有新的想法的友友也可以在评论区发表想法，互相学习🤭 题目题目一: 二维数组中的

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 正则表达式解析格式良好的多行数据字典