gpt4 book ai didi

mediawiki - 如何通过维基百科 api 获取特定部分的文本

转载 作者:行者123 更新时间:2023-12-04 02:40:50 25 4
gpt4 key购买 nike

我只想从维基百科页面中提取特定的部分:

例子:
我想从维基百科文章“House”的“Parts”部分中提取文本。

https://en.wikipedia.org/wiki/House

结果文本将是:

Many houses have several large rooms  .....  sections of the home (including in more recent eras a garage). 

我们可以从像下面这样的文章中获取洞文本:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=house&rvprop=content&format=json

但是如何获取特定部分的文本?

最佳答案

您是否需要纯 wikitext 或解析器的结果 HTML?

下面的示例为您提供了“布局”部分(房屋文章的第 3 部分,您也可以使用任何其他部分 ID)。

当您想检索特定部分的解析后的 html 时,您应该使用 parse api:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=text&section=3&disabletoc=1
或者,作为沙箱外的 API 请求:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=text&section=3&disabletoc=1

如果您想拥有特定部分的 wikitext,只需使用 wikitext Prop 而不是 text Prop :
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1

为了知道哪个部分有什么索引,您可以使用“sections”属性查询此信息,无需任何部分索引:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1

因此,作为以仅使用 API 的方式检索布局部分文本的完整示例,您将:

  • 检索文章的部分:
    https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1

  • 回复:
    {
    "parse": {
    "title": "House",
    "pageid": 13590,
    "sections": [
    {
    "toclevel": 1,
    "level": "2",
    "line": "Etymology",
    "number": "1",
    "index": "1",
    "fromtitle": "House",
    "byteoffset": 3549,
    "anchor": "Etymology"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Elements",
    "number": "2",
    "index": "2",
    "fromtitle": "House",
    "byteoffset": 4960,
    "anchor": "Elements"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "Layout",
    "number": "2.1",
    "index": "3",
    "fromtitle": "House",
    "byteoffset": 4976,
    "anchor": "Layout"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "Parts",
    "number": "2.2",
    "index": "4",
    "fromtitle": "House",
    "byteoffset": 6432,
    "anchor": "Parts"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "History of the interior",
    "number": "2.3",
    "index": "5",
    "fromtitle": "House",
    "byteoffset": 7539,
    "anchor": "History_of_the_interior"
    },
    {
    "toclevel": 3,
    "level": "4",
    "line": "Communal rooms",
    "number": "2.3.1",
    "index": "6",
    "fromtitle": "House",
    "byteoffset": 8786,
    "anchor": "Communal_rooms"
    },
    {
    "toclevel": 3,
    "level": "4",
    "line": "Interconnecting rooms",
    "number": "2.3.2",
    "index": "7",
    "fromtitle": "House",
    "byteoffset": 9736,
    "anchor": "Interconnecting_rooms"
    },
    {
    "toclevel": 3,
    "level": "4",
    "line": "Corridor",
    "number": "2.3.3",
    "index": "8",
    "fromtitle": "House",
    "byteoffset": 11126,
    "anchor": "Corridor"
    },
    {
    "toclevel": 3,
    "level": "4",
    "line": "Employment-free house",
    "number": "2.3.4",
    "index": "9",
    "fromtitle": "House",
    "byteoffset": 13092,
    "anchor": "Employment-free_house"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "Work location, technology and doctors",
    "number": "2.4",
    "index": "10",
    "fromtitle": "House",
    "byteoffset": 15969,
    "anchor": "Work_location,_technology_and_doctors"
    },
    {
    "toclevel": 3,
    "level": "4",
    "line": "Technology and privacy",
    "number": "2.4.1",
    "index": "11",
    "fromtitle": "House",
    "byteoffset": 17291,
    "anchor": "Technology_and_privacy"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Construction",
    "number": "3",
    "index": "12",
    "fromtitle": "House",
    "byteoffset": 18782,
    "anchor": "Construction"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "Energy efficiency",
    "number": "3.1",
    "index": "13",
    "fromtitle": "House",
    "byteoffset": 21899,
    "anchor": "Energy_efficiency"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "Earthquake protection",
    "number": "3.2",
    "index": "14",
    "fromtitle": "House",
    "byteoffset": 23057,
    "anchor": "Earthquake_protection"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Found materials",
    "number": "4",
    "index": "15",
    "fromtitle": "House",
    "byteoffset": 25172,
    "anchor": "Found_materials"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Legal issues",
    "number": "5",
    "index": "16",
    "fromtitle": "House",
    "byteoffset": 26235,
    "anchor": "Legal_issues"
    },
    {
    "toclevel": 2,
    "level": "3",
    "line": "United Kingdom",
    "number": "5.1",
    "index": "17",
    "fromtitle": "House",
    "byteoffset": 26644,
    "anchor": "United_Kingdom"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Identifying houses",
    "number": "6",
    "index": "18",
    "fromtitle": "House",
    "byteoffset": 26922,
    "anchor": "Identifying_houses"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Animal houses",
    "number": "7",
    "index": "19",
    "fromtitle": "House",
    "byteoffset": 27397,
    "anchor": "Animal_houses"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "Houses and symbolism",
    "number": "8",
    "index": "20",
    "fromtitle": "House",
    "byteoffset": 27826,
    "anchor": "Houses_and_symbolism"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "See also",
    "number": "9",
    "index": "21",
    "fromtitle": "House",
    "byteoffset": 28620,
    "anchor": "See_also"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "References",
    "number": "10",
    "index": "22",
    "fromtitle": "House",
    "byteoffset": 29690,
    "anchor": "References"
    },
    {
    "toclevel": 1,
    "level": "2",
    "line": "External links",
    "number": "11",
    "index": "23",
    "fromtitle": "House",
    "byteoffset": 29720,
    "anchor": "External_links"
    }
    ]
    }
    }
  • 迭代结果并找到你想要的部分,检索索引
  • 使用下一个 API 请求中的索引获取部分内容:
    https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1

  • 回复:
    {
    "parse": {
    "title": "House",
    "pageid": 13590,
    "wikitext": {
    "*": "=== Layout ===\n[[File:Gingerbread House Essex CT.jpg|thumb|Example of an early [[Victorian architecture|Victorian]] \"Gingerbread House\" in [[Connecticut]], United States, built in 1855]]\n\nIdeally, [[architect]]s of houses design [[room]]s to meet the needs of the people who will live in the house. [[Feng shui]], originally a [[China|Chinese]] method of moving houses according to such factors as rain and micro-climates, has recently expanded its scope to address the design of interior spaces, with a view to promoting harmonious effects on the people living inside the house, although no actual effect has ever been demonstrated. Feng shui can also mean the \"aura\" in or around a dwelling, making it comparable to the [[real estate|real-estate]] sales concept of \"indoor-outdoor flow\".\n\nThe [[square footage]] of a house in the United States reports the area of \"living space\", excluding the garage and other non-living spaces. The \"square metres\" figure of a house in Europe <!-- including Malta ? --> reports the area of the walls enclosing the home, and thus includes any attached garage and non-living spaces.<ref>{{Cite book|title=Land Management: Challenges and Strategies (First Edition)|last=Iyyer|first=Chaitanya|publisher=Global India Publications Pvt Ltd|year=2009|isbn=978-9380228488|location=|pages=}}</ref>{{Citation needed|date=February 2007}} The number of floors or levels making up the house can affect the square footage of a home."
    }
    }
    }

    背景:
    页面中部分的想法尚未集成到修订版中(尚未),修订版“只是”整个页面的内容和附加元数据(例如在多个其他插槽中),但这些部分是内容的一部分(这是仅修订版中的一个插槽)。这就是为什么在使用修订查询 API 时,您只能获取整个文本。需要解析页面才能知道节是什么,因为节是维基文本的概念,因此涉及解析器。

    关于mediawiki - 如何通过维基百科 api 获取特定部分的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59499885/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com