How can I modify my Python code to extract and format list items from a mixed input string as described, including contextualizing each list item?(如何修改我的Python代码以从混合输入字符串中提取列表项并设置其格式(包括将每个列表项设置为上下文)？)

转载作者：bug小助手更新时间：2023-10-28 13:29:02

Code:

import textwrap


def filter_lists(text):
    # split text into lines
    lines = text.split('\n')

    patterns = [
        r'^[0-9]+\.',  # number list item
        r'^[a-zA-Z]\.',  # letter list item
        r'^\u2022',  # bullet point list item
        r'^[ivx]+\.',  # roman numeral list item
        r'^\u25E6',  # special bullet point list item
        r'^\u2713',  # Checkmark List Item
        r'^[→←↑↓]',  # Arrow List Item (Add arrows as needed)
        r'^♦',  # Diamond List Item
        r'^★',  # Star List Item
        r'^[^\w\s]',  # Emoji List Item (Matches any non-word, non-space character)
        r'^\uE000',  # Icon List Item (Replace with the specific Unicode code for your icon)
        r'^[@#*%!&]',  # Custom Symbol List Item (Add your custom symbols within the brackets)
        r'^(red|blue|green|yellow)',  # Color-Coded List Item (Add color names or codes)
        r'^\d+\.(jpg|png|gif)',  # Image List Item (Matches numbered image file names)
        r'^\[\d{1,3}%\]',  # Progress Bar List Item (Matches percentages in square brackets)
        r'^\[[A-Za-z]+\]',  # Tag or Label List Item (Matches words in square brackets)
        r'^\d+⚫',  # Numbered Icon List Item (Matches numbers followed by a black circle)
        r'^"([^"]+)"',  # Quote List Item (Matches text enclosed in double quotes)
        r'^\d{8}',  # Barcode List Item (Matches 8-digit numbers, adjust as needed)
    ]
    # initialize list for filtered lines
    filtered_lines = []

    # iterate over lines
    for line in lines:
        # iterate over patterns
        for pattern in patterns:
            # if line matches pattern, add to filtered lines
            if re.match(pattern, line.strip()):
                filtered_lines.append(line)
                break

    return '\n'.join(filtered_lines)
import hashlib
import re

def process_string(input_string, prefix_format=None, hierarchy=None):
    output_string = ''
    used_prefixes = {}
    stack = []
    level = 0
    if prefix_format is None:
        first_line = input_string.split('\n')[0]
        prefix_format = infer_prefix_format(first_line)
    if not isinstance(prefix_format, list):
        prefix_format = [prefix_format]
    if hierarchy is None:
        hierarchy = infer_hierarchy(input_string)
    if not isinstance(hierarchy, list):
        hierarchy = [hierarchy]
    input_string = textwrap.dedent(input_string)
    input_string = '\n'.join(textwrap.wrap(input_string, width=80))
    lines = input_string.split('\n')
    if not input_string or input_string.isspace():
        return output_string
    for line in lines:
        line = line.strip()
        if not line:
            output_string += '\n'
            used_prefixes = {k: v for k, v in used_prefixes.items() if k in stack[:level]}
            stack = stack[:level]
            continue
        current_level = len(re.match(r'^(\s+)', line).group(1)) // 4 if re.match(r'^(\s+)', line) else 0
        if current_level > level:
            level += 1
            if len(prefix_format) >= level:
                format_dict = prefix_format[level - 1]
                validate_prefix_format(format_dict)
            else:
                raise ValueError(f"Missing prefix format for level {level}")
            if len(hierarchy) >= level:
                preference_list = hierarchy[level - 1]
                validate_hierarchy(preference_list)
            else:
                raise ValueError(f"Missing hierarchy for level {level}")
        elif current_level < level:
            while current_level < level and stack:
                stack.pop()
                used_prefixes.popitem()
                level -= 1
        else:
            format_dict = prefix_format[level - 1]
            preference_list = hierarchy[level - 1]
        try:
            match = re.match(r'(\d+\.|\w+\.|\w+\-|\w+\:|\S+)(\s+)(.+)?', line)
            if match:
                prefix = match.group(1)
                separator = match.group(2)
                content = match.group(3) or 'N/A'
            else:
                prefix = ''
                separator = ''
                content = ''
            prefix, separator, content, different_level = normalize_prefix(prefix, stack, used_prefixes=used_prefixes, separator=separator, mode='lenient', hierarchy=preference_list)
        except ValueError as e:
            print(e)
            continue
        else:
            if not content or content.isspace():
                content = 'N/A'
            output_string += format_output(prefix, separator, content, level, format_dict, preference_list)
            if different_level:
                stack.append(prefix)
    return output_string
def infer_hierarchy(input_string):
    """
    This function infers the hierarchy from the input string by using a regular expression to extract the prefixes at different levels of indentation from the input string,
    and creates a list based on them.
    """
    inferred_hierarchy = []
    lines = input_string.split('\n')
    for line in lines:
        match = re.match(r'^(\s*)(\d+\.|\w+\.|\w+\-|\w+\:|\S+)', line)
        if match:
            indentation = len(match.group(1))
            prefix = match.group(2)
            if indentation < len(inferred_hierarchy):
                preference_list = inferred_hierarchy[indentation]
                if prefix.isdigit() and 'number' not in preference_list:
                    preference_list.append('number')
                elif prefix.isalpha() and 'letter' not in preference_list:
                    preference_list.append('letter')
                elif not prefix.isdigit() and not prefix.isalpha() and 'other' not in preference_list:
                    preference_list.append('other')
            else:
                preference_list = []
                if prefix.isdigit():
                    preference_list.append('number')
                elif prefix.isalpha():
                    preference_list.append('letter')
                else:
                    preference_list.append('other')
                inferred_hierarchy.append(preference_list)
    return inferred_hierarchy
def validate_hierarchy(preference_list):
    """This function validates a given preference list and raises an exception if it is invalid.
    A valid preference list must be a list of strings that contain only 'number', 'letter', or 'other',
    and must have at least one element.
    """
    if not isinstance(preference_list, list):
        raise ValueError("Preference list must be a list")
    if not preference_list:
        raise ValueError("Preference list cannot be empty")
    for element in preference_list:
        if not isinstance(element, str):
            raise ValueError("Preference list must contain only strings")
        if not (element == 'number' or element == 'letter' or element == 'other'):
            raise ValueError("Preference list must contain only 'number', 'letter', or 'other'")
def format_output(prefix, separator, content, level, format_dict, hierarchy):
    """
    This function formats the output string with proper indentation and formatting based on
    - The prefix type and format.
    - The separator.
    - The hierarchy list that specifies the order of preference for different types of prefixes at each level of indentation.
    """
    output_string = ''
    output_string += ' ' * (level * 4)
    if hierarchy:
        if level < len(hierarchy):
            preference = hierarchy[level]
            if prefix.isdigit():
                output_string += prefix + preference[0] + ' '
            elif prefix.isalpha():
                output_string += prefix + preference[1] + ' '
            else:
                output_string += preference[2].format(prefix) + ' '
        else:
            if prefix.isdigit():
                output_string += prefix + format_dict['number'] + ' '
            elif prefix.isalpha():
                output_string += prefix + format_dict['letter'] + ' '
            else:
                output_string += format_dict['other'].format(prefix) + ' '
    else:
        if prefix.isdigit():
            output_string += prefix + format_dict['number'] + ' '
        elif prefix.isalpha():
            output_string += prefix + format_dict['letter'] + ' '
        else:
            output_string += format_dict['other'].format(prefix) + ' '
    output_string += separator
    output_string += content + '\n'
    return output_string
def validate_prefix_format(format_dict):
    """This function validates a given prefix format dictionary and raises an exception if it is invalid.
    A valid prefix format dictionary must have keys for numbers, letters, and other types of prefixes,
    and values that are valid separators or formats.
    """
    # Check if the format_dict has keys for numbers, letters, and other types of prefixes
    if not ('number' in format_dict and 'letter' in format_dict and 'other' in format_dict):
        raise ValueError("Prefix format dictionary must have keys for numbers, letters, and other types of prefixes")
    # Check if the format_dict has values that are valid separators or formats
    for key, value in format_dict.items():
        # If the value is a list or not a string, use the first element as the value
        if isinstance(value, list) or not isinstance(value, str):
            value = value[0]
        # If the value is not a dot, a dash, or a placeholder for other types of prefixes, raise an exception
        if not (value == '.' or value == '-' or re.match(r'\(\{\}\)', value)):
            raise ValueError(f"Prefix format dictionary must have a valid separator or format for {key} type of prefix")
        # If the key is 'l', add a key for lowercase letters with the same value
        if key == 'l':
            format_dict['letter'] = value

def normalize_prefix(prefix, previous_prefixes, used_prefixes={}, separator='.', mode='lenient', hierarchy=None):
    """
    This function normalizes a given prefix and returns it along with the separator and the content as a tuple.
    A normalized prefix is one that follows a logical sequence or hierarchy based on the previous prefixes at the same level,
    does not contain both numbers and letters, is a valid alphanumeric character or a symbol, and is not repeated at different levels of indentation.
    The hierarchy argument specifies the order of preference for different types of prefixes at each level of indentation.
    For example, hierarchy = [['number', 'letter', 'other'], ['letter', 'number', 'other']] means that at level 1, numbers are preferred over letters and other symbols,
    and at level 2, letters are preferred over numbers and other symbols. If hierarchy is None, then it can be inferred from the input string or assigned a default value.
    """
    format_dict = {'number': '.', 'letter': '-', 'other': '({})'}
    hashed_prefix = hashlib.md5(prefix.encode()).hexdigest()
    # split prefix into number and letter parts
    match = re.match(r'^(\d+)(\w+)', prefix)
    if match:
        number_part = match.group(1)
        letter_part = match.group(2)
        # normalize number part
        number_part, number_separator, _, _ = normalize_prefix(number_part, previous_prefixes, used_prefixes=used_prefixes, separator=format_dict['number'], mode=mode, hierarchy=hierarchy)
        # normalize letter part
        letter_part, letter_separator, _, _ = normalize_prefix(letter_part, previous_prefixes, used_prefixes=used_prefixes, separator=format_dict['letter'], mode=mode, hierarchy=hierarchy)
        # combine number and letter parts
        content = prefix.replace(number_part, '').replace(letter_part, '')
        return (number_part + letter_part, format_dict['number'] if number_part.isdigit() else format_dict['letter'], content, False)
    # use a list of symbols
    symbols = ['•', '◦', '▪', '▫', '▸', '◂', '▴', '▾']
    # check if prefix is a symbol
    if prefix in symbols:
        # normalize symbol according to its position in the list
        index = symbols.index(prefix)
        sorted_symbols = sorted([p for p in previous_prefixes if p in symbols], key=symbols.index)
        new_index = sorted_symbols.index(prefix) if prefix in sorted_symbols else len(sorted_symbols)
        new_prefix = symbols[new_index]
        content = prefix[1:]
        return (new_prefix, format_dict['other'].format(prefix), content, False)
    match = re.match(r'^(\w|\S)(\W+)(.+)', prefix)
    if match:
        prefix = match.group(1)
        separator = match.group(2)
        content = match.group(3)
    else:
        match = re.match(r'^(\w|\S)(.+)', prefix)
        if match:
            prefix = match.group(1)
            separator = ''
            content = match.group(2)
        else:
            match = re.match(r'^(\w|\S)', prefix)
            if match:
                prefix = match.group(1)
                separator = ''
                content = ''
            else:
                prefix = ''
                separator = ''
                content = ''
    # handle empty or whitespace prefixes by assigning a default prefix based on the hierarchy
    if not prefix or prefix.isspace():
        level = len(previous_prefixes)
        if level < len(hierarchy):
            preference_list = hierarchy[level]
            first_type = preference_list[0]
            if first_type == 'number':
                sorted_number_prefixes = sorted([p for p in previous_prefixes if p.isdigit()], key=int)
                new_prefix = str(int(sorted_number_prefixes[-1]) + 1) if sorted_number_prefixes else '1'
            elif first_type == 'letter':
                sorted_letter_prefixes = sorted([p for p in previous_prefixes if p.isalpha()])
                new_prefix = chr(ord(sorted_letter_prefixes[-1]) + 1) if sorted_letter_prefixes else 'A'
            else:
                symbols = ['•', '◦', '▪', '▫', '▸', '◂', '▴', '▾']
                sorted_symbol_prefixes = sorted([p for p in previous_prefixes if p in symbols], key=symbols.index)
                new_prefix = symbols[symbols.index(sorted_symbol_prefixes[-1]) + 1] if sorted_symbol_prefixes else symbols[0]
            return (new_prefix, format_dict[first_type], content, True)
        else:
            return ('*', format_dict['other'].format('*'), content, True)
    # check if prefix contains both numbers and letters
    if re.match(r'\d+\w+|\w+\d+', prefix):
        if mode == 'strict':
            raise ValueError(f"Invalid prefix: {prefix}")
        else:
            unique_prefix = prefix + str(used_prefixes.get(prefix, 0))
            used_prefixes[prefix] = used_prefixes.get(prefix, 0) + 1
            return (unique_prefix, format_dict['other'].format(prefix), content, False)
    # check if prefix is a valid alphanumeric character or a symbol
    if not re.match(r'\w|\S', prefix):
        if mode == 'strict':
            raise ValueError(f"Invalid prefix: {prefix}")
        else:
            return (hashed_prefix, format_dict['other'].format(prefix), content, False)
    # check if there are previous prefixes at the same level
    if previous_prefixes:
        last_prefix = previous_prefixes[-1]
        # check if prefix has the same type as the last prefix
        if (prefix.isdigit() == last_prefix.isdigit()) and (prefix.isalpha() == last_prefix.isalpha()):
            # check if prefix is a number
            if prefix.isdigit():
                # check if prefix has one digit
                if len(prefix) == 1:
                    if mode == 'strict':
                        raise ValueError(f"Inconsistent prefix: {prefix}")
                    else:
                        # increment the prefix by one
                        new_prefix = str(int(prefix) + 1)
                        return (new_prefix, format_dict['number'], content, False)
                # check if prefix is less than or equal to the last prefix
                elif int(prefix) <= int(last_prefix):
                    if mode == 'strict':
                        raise ValueError(f"Inconsistent prefix: {prefix}")
                    else:
                        # sort the prefixes and find the index of the current prefix
                        sorted_prefixes = sorted(previous_prefixes + [prefix], key=int)
                        index = sorted_prefixes.index(prefix)
                        # assign a new prefix based on the index
                        new_prefix = str(index + 1)
                        return (new_prefix, format_dict['number'], content, False)
                else:
                    pass
            # check if prefix is a letter
            elif prefix.isalpha():
                # check if prefix has one letter
                if len(prefix) == 1:
                    if mode == 'strict':
                        raise ValueError(f"Inconsistent prefix: {prefix}")
                    else:
                        # increment the prefix by one
                        new_prefix = chr(ord(prefix) + 1)
                        return (new_prefix, format_dict['letter'], content, False)
                # check if prefix is less than or equal to the last prefix
                elif ord(prefix) <= ord(last_prefix):
                    if mode == 'strict':
                        raise ValueError(f"Inconsistent prefix: {prefix}")
                    else:
                        # sort the prefixes and find the index of the current prefix
                        sorted_prefixes = sorted(previous_prefixes + [prefix])
                        index = sorted_prefixes.index(prefix)
                        # assign a new prefix based on the index
                        new_prefix = chr(ord('A') + index)
                        return (new_prefix, format_dict['letter'], content, False)
                else:
                    pass
            else:
                pass
    else:
        # check if there is a hierarchy for the current level
        if hierarchy:
            level = len(previous_prefixes)
            # check if the hierarchy list has an element for the current level
            if level < len(hierarchy):
                preference_list = hierarchy[level]
                # check if the prefix type matches the first preference for the current level
                if (prefix.isdigit() and preference_list[0] == 'number') or (prefix.isalpha() and preference_list[0] == 'letter') or (not prefix.isdigit() and not prefix.isalpha() and preference_list[0] == 'other'):
                    pass
                else:
                    # handle inconsistent prefixes according to the mode argument
                    if mode == 'strict':
                        raise ValueError(f"Inconsistent prefix: {prefix}")
                    else:
                        # assign a new prefix based on the first preference for the current level
                        first_type = preference_list[0]
                        if first_type == 'number':
                            sorted_number_prefixes = sorted([p for p in previous_prefixes if p.isdigit()], key=int)
                            new_prefix = str(int(sorted_number_prefixes[-1]) + 1) if sorted_number_prefixes else '1'
                        elif first_type == 'letter':
                            sorted_letter_prefixes = sorted([p for p in previous_prefixes if p.isalpha()])
                            new_prefix = chr(ord(sorted_letter_prefixes[-1]) + 1) if sorted_letter_prefixes else 'A'
                        else:
                            symbols = ['•', '◦', '▪', '▫', '▸', '◂', '▴', '▾']
                            sorted_symbol_prefixes = sorted([p for p in previous_prefixes if p in symbols], key=symbols.index)
                            new_prefix = symbols[symbols.index(sorted_symbol_prefixes[-1]) + 1] if sorted_symbol_prefixes else symbols[0]
                        return (new_prefix, format_dict['other'], content, False)
            else:
                return (hashed_prefix, format_dict['other'].format(prefix), content, False)
def infer_prefix_format(input_string):
    """
    This function infers the prefix format from the input string by using a regular expression to extract the prefixes and their separators from the first line of the input string,
    and creates a dictionary based on them.
    """
    inferred_format = []
    first_line = input_string.split('\n')[0]
    matches = re.findall(r'(\w|\S)(\W+)', first_line)
    for match in matches:
        prefix = match[0]
        separator = match[1]
        if prefix.isdigit():
            format_dict = {'number': separator}
        elif prefix.isalpha():
            format_dict = {'letter': separator}
        elif prefix == 'o': # added this line to handle the KeyError: 'o'
            format_dict = {'o': separator} # added this line to handle the KeyError: 'o'
        else:
            format_dict = {'other': '({})'}
        inferred_format.append(format_dict)
    return inferred_format

testData = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus. \n\n1. England Cities:\n- Liverpool\n- London\n- Huyton\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nB. Spain Cities: \na. Place1 \nb. Place2 \nc. Place3\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nIII. Germany Cities: \ni. Place1 \nii. Place2 \niii. Place3\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n1. England Cities:\n 1. Liverpool\n 2. London\n 3. Huyton\n \nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n• Spain Cities:\n◦ Place1\n◦ Place2\n◦ Place3\n\na. Germany\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\na. England\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus."
output1 = filter_lists(testData)
print(output1)
output2 = process_string(output1)
print("The output is: "+output2)

Purpose of Code:

I've created a Python function that should take a string as input. This string will contain a series of lists along with non-list items. The code should output only the list data, using the item at the top of each list to contextualize the printing of each list item, and remove non-list items unrelated to lists. An example of the input and expected output is provided below.

我已经创建了一个Python函数，它应该接受字符串作为输入。此字符串将包含一系列列表以及非列表项。代码应该只输出列表数据，使用每个列表顶部的项来设置每个列表项的打印上下文，并删除与列表无关的非列表项。下面提供了输入和预期输出的示例。

Problem with Code:

I've been able to extract only the list items from the input, but thus far, I haven't been able to output them in the format I described earlier (as shown in the expected output below).

我只能从输入中提取列表项，但到目前为止，我还不能以前面描述的格式输出它们(如下面的预期输出所示)。

Input:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus. \n\n1. England Cities:\n- Liverpool\n- London\n- Huyton\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nB. Spain Cities: \na. Place1 \nb. Place2 \nc. Place3\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nIII. Germany Cities: \ni. Place1 \nii. Place2 \niii. Place3\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n1. England Cities:\n 1. Liverpool\n 2. London\n 3. Huyton\n \nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n• Spain Cities:\n◦ Place1\n◦ Place2\n◦ Place3\n\na. Germany\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\na. England\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus."

Expected Output:

['England Cities: Liverpool', 'England Cities: London', 'England Cities: Huyton', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany Cities: Place1', 'Germany Cities: Place2', 'Germany Cities: Place3', 'England Cities: Liverpool', 'England Cities: London', 'England Cities: Huyton', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany', 'England']

Actual Output:


1. England Cities:
- Liverpool
- London
- Huyton
B. Spain Cities: 
a. Place1 
b. Place2 
c. Place3
i. Place1 
ii. Place2 
iii. Place3
1. England Cities:
 1. Liverpool
 2. London
 3. Huyton
• Spain Cities:
◦ Place1
◦ Place2
◦ Place3
a. Germany
a. England

h ({}).
h ({})lace2
h ({})iverpool
h ({})ermany

How can I modify my code to get it working?

更多回答

Does this assume the text on the input will always have some non-list lines between the lists?

这是否假设输入中的文本在列表之间总是有一些非列表行？

There should be some way to differentiate what lines are title of lists and what lines are items of lists, as you are putting the title before every item 'England Cities: Liverpool', 'England Cities: London', ...

应该有一些方法来区分哪些行是列表的标题，哪些行是列表的项目，因为您要将标题放在每个项目之前，例如英格兰城市：利物浦、英格兰城市：伦敦、……

Please trim your code to make it easier to find your problem. Follow these guidelines to create a minimal reproducible example.

请修改您的代码，以便更容易地找到您的问题。遵循这些指导原则，创建一个最小的可重现示例。

@user316108, the code should not assume that 'the input will always have some non-list lines between the lists,' but the code should be prepared to remove all instances of it

@user316108，则代码不应该假定“输入在列表之间总是有一些非列表行”，但代码应该准备好删除它的所有实例

Does the title of the lists with at least one item always end with ':'? like in "Spain Cities:"

至少有一项的列表的标题是否总是以‘：’结尾？就像《西班牙城市》中的：

优秀答案推荐

A suggestion:

一个建议：

import re

s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus. \n\n1. England Cities:\n- Liverpool\n- London\n- Huyton\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nB. Spain Cities: \na. Barcelona \nb. Córdoba \nc. Valladolid\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nIII. Germany Cities: \ni. Köln \nii. Hambourg \niii. Aachen\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n1. England Cities:\n 1. Cardiff\n 2. Bristol\n 3. Coventry\n \nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n• Spain Cities:\n◦ Toledo\n◦ Sevilla\n◦ Zaragoza\n\na. Germany\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\na. England\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus."

pList = re.compile(r'^\w{0,4}[-•◦.] +(?P<title>.*)(?:: *\n(?P<items>(?:.*$\n?)+?)(?: *\n|\Z)| *\n\n)', re.M)
pItem = re.compile(r'^ *\S+ +(?P<item>.*\S)', re.M)

result = []

for m in pList.finditer(s):
    if m.group('items'):
        for i in pItem.finditer(m.group('items')):
            result.append(f"{m.group('title')}: {i.group('item')}")
    else:
        result.append(m.group('title'))
    
print(result)

Here I made the choice to not split the text into lines. List items (or orphan list title) are identified by two things: the double newline sequence at the end and the sequence that ends with a symbol at the beginning of the "block".

在这里，我选择了不将文本拆分成行。列表项(或孤立列表标题)由两件事标识：末尾的双换行符序列和以“块”开头的符号结束的序列。

Note that the description of items start is actually naive: \w{0,4}[-•◦.] for the title and \S+ for items, but feel free to write something more precise. Also, if you capture these parts and use the re.findall method for the second pattern, you will be able to perform more checks as in your initial code.

请注意，项目Start的描述实际上很幼稚：\W{0，4}[-·◦。]标题和\S+的项目，但请随意写一些更准确的东西。此外，如果您捕获这些部分并对第二个模式使用re.findall方法，您将能够像在初始代码中一样执行更多的检查。

更多回答

I genuinely appreciate your efforts to resolve my code problem. However, your solution does not produce the 'expected outcome.' For the input 's,' it generates the following: ['England Cities: Liverpool', 'England Cities: London', 'England Cities: Huyton', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany Cities: Place1', 'Germany Cities: Place2', 'Germany Cities: Place3', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany', 'England'] This differs from the expected result, which should be:

我真诚地感谢您为解决我的代码问题所做的努力。然而，你的解决方案并没有产生“预期的结果”。对于输入‘S’，它生成以下内容：[‘England Cities：利物浦’，‘England Cities：London’，‘England Cities：Huyton’，‘西班牙城市：Place1’，‘西班牙城市：Place2’，‘西班牙城市：Place3’，‘德国城市：Place1’，‘德国城市：Place2’，‘德国城市：Place3’，‘西班牙城市：Place1’，‘西班牙城市：Place2’，‘西班牙城市：Place3’，‘德国’，‘英格兰’]这与预期的结果不同：

['England Cities: Liverpool', 'England Cities: London', 'England Cities: Huyton', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany Cities: Place1', 'Germany Cities: Place2', 'Germany Cities: Place3', 'England Cities: Liverpool', 'England Cities: London', 'England Cities: Huyton', 'Spain Cities: Place1', 'Spain Cities: Place2', 'Spain Cities: Place3', 'Germany', 'England'] The expected result is missing the second occurrence of 'England Cities: Liverpool', 'England Cities: London', and 'England Cities: Huyton'.

英格兰城市：Liverpool“、”England Cities：London“、”England Cities：Huyton“、"Spain Cities：Place1”、" Spain Cities：Place2“、”Spain Cities：Place3“、"Germany Cities：Place1”、" Germany Cities：Place2“、”Germany Cities：Place3“、"England Cities：Liverpool”、" England Cities：London“、”England Cities：Huyton“、"Spain Cities：Place1”、" Spain Cities：Place1“地点2“、”西班牙城市：地点3“、”德国“、”英格兰“]预期结果缺少第二次出现的”英格兰城市：利物浦“、”英格兰城市：伦敦“和”英格兰城市：惠顿“。

My above comments refer to the use of the following input: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus. \n\n1. England Cities:\n- Liverpool\n- London\n- Huyton\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nB. Spain Cities: \na. Place1 \nb. Place2 \nc. Place3\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\nIII. Germany Cities: \ni. Place1 \nii. Place2 \niii. Place3

我以上的评论指的是以下输入的使用：“Lorem ipsum dolor set amet，consecteur adipiscing elit.这是一种可能存在的错误。妊娠枕悬吊、妊娠枕悬吊。手性矢状肌软膏。\n\n1。英格兰城市：\n-利物浦\n-伦敦\n-惠顿\n\n\nLorem ipsum dolor坐在一起，敬爱你的脚下精灵。西班牙城市：。地点1\nb。位置2\nC。位置3\n\nLorem ipsum door坐在一起，敬爱你的脚。德国城市：。位置1\n第二。位置2\n。排名3

\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n1. England Cities:\n 1. Liverpool\n 2. London\n 3. Huyton\n \nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n• Spain Cities:\n◦ Place1\n◦ Place2\n◦ Place3\n\na. Germany\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\na. England\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent hendrerit dapibus posuere. Suspendisse tempus, mi rhoncus gravida pulvinar, orci orci congue diam, vel malesuada felis elit sit amet nisl. Aliquam sagittis facilisis rhoncus."

\n\nLorem ipsum door坐在一起，敬爱你的脚下。英格兰城市：\n 1.利物浦\n 2.伦敦\n 3.惠顿\n\n\nLorem ipsum Dolor坐在阿梅特，圣徒们正在努力。\n\n·西班牙城市：\n◦Place 1\n◦Place 2\n◦Place 3\n\n.德国\n\nLorem ipsum door坐在一起，敬爱你的精灵。英国\n\nLorem ipsum door坐在一起，敬爱你的精灵。这是一种可能存在的错误。妊娠枕悬吊、妊娠枕悬吊。[医]手足矢状肌

@hopeful1412: It's only because there are spaces before each item in the second "England cities" block. Add an optional space to the second pattern. Note that my answer shows how to proceed and do not pretend to provide waterproof patterns. You have to build them yourself according to your data.

@Hope 1412：这只是因为在第二个“英格兰城市”区块的每一项之前都有空格。在第二个模式中添加一个可选的空格。请注意，我的回答说明了如何继续进行，而不是假装提供防水图案。你必须根据你的数据自己建立它们。

文章推荐： Kotlin 外部作用域

文章推荐： enums - 如何在 Kotlin 中声明枚举类型的变量？

文章推荐： mongodb - 在 MongoDB 中存储出生日期的最佳方法是什么？

css - mini-css-extract-plugin block 中的警告chunkName [mini-css-extract-plugin]之间的冲突顺序:
警告在块AccessRights〜Groups〜Navigator中[mini-css-extract-plugin] 之间的顺序冲突: css ../node_modules/css-loader?
mysql - 选择 unique (extract(year from saledate) || extract(month from saledate)) as ym from trnsact order by ym;
解决练习问题“对于数据库中的每个月/年组合，交易表的销售日期列中有多少个不同的日期？”我发现一个有效的查询，它显然结合了年份和月份 SELECT DISTINCT (extract(year from
mysql - 使用 EXTRACT YEAR、EXTRACT MONTH 和 CONCAT 函数获取 "Month Name, YYYY"格式的日期
我有一个名为 Student 的 MySQL 表，它有一个名为 entry_date 的列，类型为 date。我想选择以下形式的 entry_date: November, 2014 即它将显示 en
Python + Selenium ( Chrome ): How can I extract a specific text from my current url and use the extracted text to go to another?
我已经使用 iMacros 很长时间了，我有一个代码可以提取文本并使用它来创建我需要的特定 URL。事情是这样的: #I can extract the XPath text with this.
extract - 如何从saz文件中提取文件？
我将一个 session 从 Fiddler 导出到 saz 文件。此 session 仅包含 jpg 文件，我想知道 - 如何快速轻松地从 saz 中提取 jpg 文件？谢谢! 最佳答案提取 J
extract - 如何从倒谱中提取基频？
应用 FFT 后，我得到了具有多个频率段的频谱。如何使用倒谱方法从该频谱中获取基频？我做了很多研究，尝试了很多代码，并在 stackoverflow 上问了三遍(这很有帮助)，我非常确定倒谱方法是在
extract - 反编译/提取微软代理？
我正在寻找一种解压缩或反编译微软代理的方法。例如梅林代理我想提取动画/图像。到目前为止我发现的最好方法是。用粉红色背景录制我的屏幕。并分割视频......但这不是一个很好的做法...... 有小费
python - --extract-audio等同于YoutubeDL类？
我只想从youtube链接中提取MP3格式的音频，但是如果不从命令行使用youtube-dl选项调用--extract-audio，就无法弄清楚该如何做。在YoutubeDL类中是否有一种类似于her
Azure管道: could not extract archive
我有一个 Azure 管道，应该构建一个项目并将 jar 复制到 Artifactory。这是应该安装节点的 yml: - task: NodeTool@0 inputs: version
php - 使用带有连字符的 extract()
例如，如果您有一个如下所示的关联数组: $array = array('first-value' => 'Hello'); 然后你要提取它: extract($array); 由于变量名称中不能使用连
text-extraction - 如何使用查询从大文本中自动提取数据
我有大型 pdf 文件(法语的 100 页)描述了我的事件部门的一套规则。我正在寻找一种服务，允许我一次查询一个 pdf(或我从中提取的文本)以自动获取信息。 (示例:x 的最大授权长度是多少？)
extract - RDFa Reader提取工具
我是 RDF 初学者，我想从 HTML 中提取 RDF我正在使用 GRDDL，但它不太适合我，我每次都会收到安全异常:(您能向我推荐另一个工具吗？感谢您的帮助。最佳答案我通常使用说唱歌手，你可以
php - extract() 如何在当前范围内创建变量？
我很好奇，PHP 的函数是如何实现的extract有用吗？我想做一个稍微修改的版本。我希望我的函数在从蛇形符号到驼峰式的数组键中提取时生成变量名，例如: 现在 extract 这样做: $array
PHP extract() 函数
如果我使用 PHP 的 extract() 函数从数组中导入变量，同名变量会被覆盖吗？我问的原因是因为我正在尝试初始化所有变量。感谢您的宝贵时间。最佳答案默认情况下它将覆盖。 http://ph
text-extraction - 如何使用查询从大文本中自动提取数据
我有大型 pdf 文件(法语的 100 页)描述了我的事件部门的一套规则。我正在寻找一种服务，允许我一次查询一个 pdf(或我从中提取的文本)以自动获取信息。 (示例:x 的最大授权长度是多少？)
Python多处理: Extracting results
我正在尝试在 Python 中运行大量模拟，因此我尝试使用多处理来实现它。 import numpy as np import matplotlib.pyplot as plt import mult
javascript extract ..在值的第一个下划线处停止
尝试从主机名-rt45_34_we_35 中提取主机名-rt45。我正在使用/(.)_?./g。这似乎不起作用。我已经查看了正则表达式文档。想用？会使它变得贪婪并在第一个下划线处停止。我错过了什么？
java - Extract 通过正则表达式连接字符串中的变量
我目前正在从事一个数据挖掘项目。我必须阅读 C# 源代码，并且必须找到连接 SQL 语句的位置。我真正想要的是获取连接字符串变量名称的名称。示例: stat = "SELECT * FROM CUS
MySQL Extract 函数给出语法错误
我正在订单表上运行查询，以计算每个用户在 6 个月前一个月内的任何一天发出的请求总数(例如:2013 年 8 月)。这工作正常:- SELECT userid,firstname,surname,s
PHP:如何避免在特定情况下使用 extract()
我对 PHP 很陌生，并且到处都看到不建议使用提取函数。我正在从 mysql 表中获取数据来填充网站的一部分。因此我不知道该表可以有多少行。所以我使用 extract 函数，它为每行提供一个数组数组

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城