gpt4 book ai didi

python - 在印度专利网站上抓取专利数据

转载 作者:行者123 更新时间:2023-11-28 22:33:59 30 4
gpt4 key购买 nike

我正在尝试为 Indian patent search website 编写一个网络爬虫获取有关专利的数据。这是我到目前为止的代码。

#import the necessary modules
import urllib2
#import the beautifulsoup functions to parse the data
from bs4 import BeautifulSoup

#mention the website that you are trying to scrape
patentsite="http://ipindiaservices.gov.in/publicsearch/"

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(patentsite)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)

print soup

不幸的是,印度专利网站不健全,或者我不确定如何在这方面进一步推进。

这是上述代码的输出。

<!-- 
###################################################################
## ##
## ##
## SIDDHAST.COM ##
## ##
## ##
###################################################################
--><!DOCTYPE HTML>
<html>
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<title>:: InPASS - Indian Patent Advanced Search System ::</title>
<link href="resources/ipats-all.css" rel="stylesheet"/>
<script src="app.js" type="text/javascript"></script>
<link href="resources/app.css" rel="stylesheet"/>
</head>
<body></body>
</html>

我想说的是,假设我提供了一个公司名称,则爬虫应该获得该特定公司的所有专利。如果我能把这部分做好,我想做其他事情,比如提供一组输入信息,供爬虫用来查找专利。但是我卡在了无法继续进行的部分。

任何有关如何获取此数据的指示都将不胜感激。

最佳答案

您只需请求 就可以做到这一点。帖子发给 http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php 带有一个 param rc_,这是我们用 time.time 创建的时间戳。

"field[]" 中的每个值都应与 "fieldvalue[]" 中的每个值匹配,进而与 "operator[]" 无论你选择*AND**OR*还是*NOT*[]之后每个键指定我们正在传递一个 value(s) 数组,否则什么都不会起作用。:

data = {
"publication_type_published": "on",
"publication_type_granted": "on",
"fieldDate": "APD",
"datefieldfrom": "19120101",
"datefieldto": "20160906",
"operatordate": " AND ",
"field[]": ["PA"], # claims,.description, patent-number codes go here
"fieldvalue[]": ["chris*"], # matching values for ^^ go here
"operator[]": [" AND "], # matching sql logic for ^^ goes here
"page": "1", # gives you next page results
"start": "0", # not sure what effect this actually has.
"limit": "25"} # not sure how this relates as len(r.json()[u'record']) stays 25 regardless

import requests
from time import time

post = "http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php?_dc={}".format(
str(time()).replace(".", ""))

with requests.Session() as s:
s.get("http://ipindiaservices.gov.in/publicsearch/")
s.headers.update({"X-Requested-With": "XMLHttpRequest"})
r = s.post(post, data=data)
print(r.json())

输出将如下所示,我无法全部添加,因为要发布的数据太多:

{u'success': True, u'record': [{u'Publication_Status': u'Published', u'appDate': u'2016/06/16', u'pubDate': u'2016/08/31', u'title': u'ACTUATOR FOR DEPLOYABLE IMPLANT', u'sourceID': u'inpat', u'abstract': u'\n    Systems and methods are provided for usin.............

如果你使用记录键,你会得到一个像这样的字典列表:

{u'Publication_Status': u'Published', u'appDate': u'2015/01/27', u'pubDate': u'2015/06/26', u'title': u'CORRUGATED PALLET', u'sourceID': u'inpat', u'abstract': u'\n    A corrugated paperboard pallet is produced from two flat blanks which comprise a pallet top and a pallet bottom. The two blanks are each folded to produce only two parallel vertically extending double thickness ribs&nbsp;three horizontal panels&nbsp;two vertical side walls and two horizontal flaps. The ribs of the pallet top and pallet bottom lock each other from opening in the center of the pallet by intersecting perpendicularly with notches in the ribs. The horizontal flaps lock the ribs from opening at the edges of the pallet by intersecting perpendicularly with notches&nbsp;and the vertical sidewalls include vertical flaps that open inward defining fork passages whereby the vertical flaps lock said horizontal flaps from opening.\n  ', u'Assignee': u'OLVEY Douglas A., SKETO James L., GUMBERT Sean G., DANKO Joseph J., GABRYS Christopher W., ', u'field_of_invention': u'FI10', u'publication_no': u'26/2015', u'patent_no': u'', u'application_no': u'642/DELNP/2015', u'UCID': u'WVJ4NVVIYzFLcUQvVnJsZGczcVRmSS96Vkh3NWsrS1h3Qk43S2xHczJ2WT0%3D', u'Publication_Type': u'A'}

这是你的专利信息。

你可以看到如果我们在浏览器中选择几个值,所有fieldvaluefieldoperator中的值排成一行,AND 是默认值,因此您会看到每个选项:

enter image description here

enter image description here

所以弄清楚代码,选择你想要的并发布。

关于python - 在印度专利网站上抓取专利数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39356677/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com