- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
这是我要抓取的页面:
<!DOCTYPE html><html dir="ltr" class="rezemp-ResumeViewLayout-html"><head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-WNSB8XG');</script>
<!-- End Google Tag Manager -->
<script src="https://cdn.optimizely.com/js/6377170661.js"></script>
<script>
window.createRecaptchaPromise = function () {
return new Promise(function(resolve) { resolve(''); });
};
window.createRecaptchaChallengePromise = function () {
return new Promise(function(resolve) { resolve(''); });
};
</script>
<title>Cat Sitter - Perkasie, PA | Indeed.com</title><meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,user-scalable=no"><link rel="stylesheet" type="text/css" href="/static/a965426693faf68209ad/styles/resume-view-app.css"></head><body class="rezemp-ResumeViewLayout-body">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-WNSB8XG"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<div id="content"><noscript>This page requires JavaScript.</noscript></div><script type="text/javascript">var _sift = window._sift = window._sift || []; _sift.push(['_setAccount', 'fb21e9c129']); _sift.push(['_setUserId', 'b90ff823cb1bcec9']); _sift.push(['_setSessionId', '1dcicspl2f8a9800']); _sift.push(['_trackPageview']);
(function() {
function ls() {
var e = document.createElement('script');
e.src = 'https://cdn.siftscience.com/s.js';
document.body.appendChild(e);
}
if (window.attachEvent) {
window.attachEvent('onload', ls);
} else {
window.addEventListener('load', ls, false);
}
})();
</script><script>window.initialState = JSON.parse('{\x22commonModel\x22:{\x22advertiser\x22:\x22Nyaa Studio\x22,\x22baseAdsUrl\x22:\x22https:\\u002F\\u002Fads.indeed.com\x22,\x22baseAnalyticsUrl\x22:\x22https:\\u002F\\u002Fanalytics.indeed.com\x22,\x22baseBillingUrl\x22:\x22https:\\u002F\\u002Fbilling.indeed.com\x22,\x22baseIndeedEmployerHelpUrl\x22:\x22https:\\u002F\\u002Findeedemployers.zendesk.com\x22,\x22baseIndeedUrl\x22:\x22https:\\u002F\\u002Fwww.indeed.com\x22,\x22baseMyIndeedUrl\x22:\x22https:\\u002F\\u002Fmy.indeed.com\x22,\x22basePieUrl\x22:\x22https:\\u002F\\u002Faccount.indeed.com\x22,\x22baseRozUrl\x22:\x22https:\\u002F\\u002Fresumes.indeed.com\x22,\x22baseSecureUrl\x22:\x22https:\\u002F\\u002Fsecure.indeed.com\x22,\x22billingIssue\x22:\x22CAN_PURCHASE\x22,\x22canSwitchAccount\x22:false,\x22confirmed\x22:true,\x22country\x22:\x22US\x22,\x22csrfParam\x22:\x22indeedcsrftoken\x22,\x22csrfToken\x22:\x22RonYXgzB6OxlClQV4QY9woqaatyPStN8\x22,\x22currentRelativeUrl\x22:\x22\\u002Fresume\\u002Fd53377828e23d884?s\x3dl%3D%26q%3Dcat%2520sitter%26searchFields%3Djt\x22,\x22currentUrl\x22:\x22https:\\u002F\\u002Fresumes.indeed.com\\u002Fresume\\u002Fd53377828e23d884?s\x3dl%3D%26q%3Dcat%2520sitter%26searchFields%3Djt\x22,\x22currentUserAccountKey\x22:\x2286c56776fbc49dff\x22,\x22emailAddress\x22:\x22nyaa.studio.apps@gmail.com\x22,\x22featuredEmployer\x22:false,\x22isMasquerade\x22:false,\x22language\x22:\x22en\x22,\x22locale\x22:\x22en_US\x22,\x22loggedIn\x22:true,\x22masquerade\x22:false,\x22moderated\x22:false,\x22nonMonetizedMarket\x22:false,\x22privileged\x22:false,\x22showLaunchBanner\x22:true,\x22subscriptionInfo\x22:{\x22admin\x22:true,\x22bulkContact\x22:false,\x22contactsRemaining\x22:0,\x22hasUnassignedSubscription\x22:false,\x22hasUnlimitedContacts\x22:false,\x22subscriptionAssigned\x22:false,\x22trial\x22:false},\x22subscriptionsEnabled\x22:true},\x22contactRecord\x22:{\x22allowContact\x22:true,\x22allowRepeatedContact\x22:false,\x22contactedByCoworkerDate\x22:\x22\x22,\x22contactedByCoworkerEmail\x22:\x22\x22,\x22contactedByUserDate\x22:\x22\x22,\x22responseStatus\x22:\x22UNRESPONDED\x22},\x22countryOfEligibility\x22:\x22United States\x22,\x22eligibility\x22:\x22ELG\x22,\x22isSavedResume\x22:false,\x22resumeModel\x22:{\x22accountKey\x22:\x22d53377828e23d884\x22,\x22additionalInformation\x22:\x22Skills\\nWord 2010 and 2013, Excel, Powerpoint, computer and typing skills, interpersonal skills,\\norganizational skills, some ASL knowledge, love of animals, previous animal care experience\x22,\x22assessments\x22:[],\x22awards\x22:[],\x22certifications\x22:[],\x22education\x22:[{\x22dateRange\x22:\x22December 2015 to Present\x22,\x22degree\x22:\x22Liberal Arts degree\x22,\x22field\x22:\x22Liberal Arts\x22,\x22id\x22:\x22EecYz-PgixmaTKmUQsuaQg\x22,\x22location\x22:\x22Newtown, PA\x22,\x22university\x22:\x22Bucks County Community College\x22}],\x22email\x22:\x22\x22,\x22firstName\x22:\x22Cat Sitter\x22,\x22fullName\x22:\x22Cat Sitter\x22,\x22groups\x22:[],\x22headline\x22:\x22Cat Sitter - Local Residence\x22,\x22highlightedWords\x22:[\x22sitters\x22,\x22cat\x22,\x22sitter\x22],\x22id\x22:\x22EecYz-PgixWaTKmUQsuaQg\x22,\x22licenses\x22:[],\x22links\x22:[],\x22location\x22:\x22Perkasie, PA\x22,\x22militaryService\x22:[],\x22patents\x22:[],\x22phoneNumber\x22:\x22\x22,\x22publications\x22:[],\x22skills\x22:[{\x22id\x22:\x22EecYz-PgixqaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:12,\x22skill\x22:\x22Excel\x22},{\x22id\x22:\x22EecYz-PgixuaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:120,\x22skill\x22:\x22organizational skills\x22},{\x22id\x22:\x22EecYz-PgixyaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:12,\x22skill\x22:\x22Powerpoint\x22},{\x22id\x22:\x22EecYz-Pgix2aTKmUQsuaQg\x22,\x22monthsOfExperience\x22:24,\x22skill\x22:\x22typing\x22},{\x22id\x22:\x22EecYz-Pgix6aTKmUQsuaQg\x22,\x22monthsOfExperience\x22:24,\x22skill\x22:\x22Word\x22},{\x22id\x22:\x22Eehfj2sVJeqeoM7c3iCmnw\x22,\x22monthsOfExperience\x22:72,\x22skill\x22:\x22working with animals\x22}],\x22summary\x22:\x22\x22,\x22updatedDate\x22:\x22May 26, 2019\x22,\x22workExperience\x22:[{\x22company\x22:\x22Local Residence\x22,\x22dateRange\x22:\x222015 to Present\x22,\x22description\x22:\x22Feed cats\\n●\\tClean litter boxes\\n●\\tDaily check-ins on cats\x22,\x22id\x22:\x22EecYz-PgixaaTKmUQsuaQg\x22,\x22location\x22:\x22Quakertown, PA\x22,\x22title\x22:\x22Cat Sitter\x22},{\x22company\x22:\x22Local Residence - Dog Walker\x22,\x22dateRange\x22:\x22January 2014 to January 2016\x22,\x22description\x22:\x22Walk dogs\\n●\\tFeed dogs\\n●\\tCheck-ins and play time with dogs\x22,\x22id\x22:\x22EecYz-PgixeaTKmUQsuaQg\x22,\x22location\x22:\x22Quakertown, PA\x22,\x22title\x22:\x22Dog Sitter\x22}]},\x22tk\x22:\x221dckr2vhn3p22800\x22}');</script><script>window.proctorGroups = JSON.parse('[[5,null],[1,null],[1,null],[9,null],[7,null],[1,null],[0,null],[19,null],[-1,null],[0,null],[3,null],[1,null],[19,null],[-1,null],[-1,null],[-1,null],[-1,null],[0,null],[1,null],[-1,null],[-1,\x22${contactName} sent you a message about your resume on Indeed.\x22],[1,null],[1,null],[-1,null],[0,null],[-1,null],[-1,null],[-1,null],[1,null],[1,null],[-1,null],[-1,null],[1,null],[1,null],[-1,null],[1,null],[-1,null],[1,null],[1,{\x22recaptchaThreshold\x22:0.49}],[2,null],[-1,null],[1,null],[1,null],[0,null],[1,null],[-1,null],[1,null],[1,null],[-1,null],[-1,null],[-1,null],[-1,null],[-1,null],[1,null],[-1,null],[-1,null],[1,{\x22accountBlocks\x22:[371495985,371492945,371496796,371495142,371255403,180896675,402708456],\x22ipBlocks\x22:[\x22142.93.160.149\x22,\x22156.213.187.109\x22,\x2254.144.251.118\x22,\x2254.160.231.37\x22,\x2254.161.232.223\x22,\x2254.163.111.234\x22,\x2254.166.201.27\x22,\x2254.167.132.121\x22,\x2254.211.243.158\x22,\x2254.221.65.205\x22,\x2254.234.36.11\x22,\x2254.235.23.71\x22,\x2254.242.123.36\x22,\x2254.242.125.90\x22,\x2254.242.94.44\x22,\x2254.91.29.30\x22,\x2254.81.91.102\x22,\x2218.130.133.224\x22,\x2218.130.98.215\x22,\x223.8.18.212\x22,\x223.8.20.40\x22,\x2234.206.53.38\x22,\x2282.12.238.32\x22,\x22137.135.96.20\x22,\x2213.90.195.83\x22,\x22137.135.96.20\x22,\x22106.51.66.119\x22,\x22116.75.87.250\x22,\x2213.90.195.83\x22,\x22104.131.19.173\x22,\x22106.51.66.119\x22,\x22108.2.166.209\x22,\x2212.133.183.51\x22,\x22163.198.35.32\x22,\x22168.62.165.43\x22,\x2218.203.123.118\x22,\x2223.96.14.105\x22,\x2252.60.89.234\x22,\x2271.14.194.130\x22,\x2273.2.223.45\x22]}],[1,null],[1,null],[1,null],[1,null],[-1,null],[-1,null],[-1,null],[-1,null],[1,null],[1,null],[1,null],[1,null],[-1,null],[2,null],[1,null],[1,null],[-1,null],[3,null]]');</script><script type="text/javascript" src="/static/b9c32234bdbed298be40/scripts/vendor.js"></script><script type="text/javascript" src="/static/f38ebfd/en_US.js"></script><script>!function(n){function r(n){for(var r=a,t=n.length;t;)r=33*r^n.charCodeAt(--t);return r>>>0}var t=this['indeed.i18n.localeData'],e=t['']||{},a=e.salt;if(e.hasOwnProperty('salt'))for(var i in n)t[function(n){var t=r(n);return e.hasOwnProperty('id_length')&&(t=String(t).substring(0,e.id_length)),t}(i)]=n[i];else for(var i in n)t[i]=[null].concat(n[i])}({"Email {0} job seeker":["Contact {0} job seeker","Contact {0} job seekers"],"Email":["Contact"],"Email {0}":["Contact {0}"],"Send Email":["Message"]});</script><script type="text/javascript" src="/static/70ab8de6e2102d523c43/scripts/resume-view-app.js"></script></body></html>
我感兴趣的是末尾的 window.initialState
的一部分。我应该如何提取它?
附言我目前正在使用 Selenium
和 Chromedriver
。使用 requests
抓取信息是不可能的。
最佳答案
decode('unicode-escape')
普通字节字符串上的方法,它将把它们转换为 Unicode 字符串,encode('utf8')
方法从 Unicode 编码到 UTF-8 字节字符串,jsonString[2:-2]
删除字符串的第一个和最后两个字符,json.loads()
方法将字符串转换为 json。
re.compile() 返回一个 regular expression object ,这意味着 h 是一个正则表达式对象。
regex 对象有自己的 match 方法,带有可选的 pos 和 endpos 参数:
regex.match(string[, pos[, endpos]])
from bs4 import BeautifulSoup
import re
import json
html = """ <!DOCTYPE html><html dir="ltr" class="rezemp-ResumeViewLayout-html"><head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-WNSB8XG');</script>
<!-- End Google Tag Manager -->
<script src="https://cdn.optimizely.com/js/6377170661.js"></script>
<script>
window.createRecaptchaPromise = function () {
return new Promise(function(resolve) { resolve(''); });
};
window.createRecaptchaChallengePromise = function () {
return new Promise(function(resolve) { resolve(''); });
};
</script>
<title>Cat Sitter - Perkasie, PA | Indeed.com</title><meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,user-scalable=no"><link rel="stylesheet" type="text/css" href="/static/a965426693faf68209ad/styles/resume-view-app.css"></head><body class="rezemp-ResumeViewLayout-body">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-WNSB8XG"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<div id="content"><noscript>This page requires JavaScript.</noscript></div><script type="text/javascript">var _sift = window._sift = window._sift || []; _sift.push(['_setAccount', 'fb21e9c129']); _sift.push(['_setUserId', 'b90ff823cb1bcec9']); _sift.push(['_setSessionId', '1dcicspl2f8a9800']); _sift.push(['_trackPageview']);
(function() {
function ls() {
var e = document.createElement('script');
e.src = 'https://cdn.siftscience.com/s.js';
document.body.appendChild(e);
}
if (window.attachEvent) {
window.attachEvent('onload', ls);
} else {
window.addEventListener('load', ls, false);
}
})();
</script><script>
window.initialState = JSON.parse('{\x22commonModel\x22:{\x22advertiser\x22:\x22Nyaa Studio\x22,\x22baseAdsUrl\x22:\x22https:\\u002F\\u002Fads.indeed.com\x22,\x22baseAnalyticsUrl\x22:\x22https:\\u002F\\u002Fanalytics.indeed.com\x22,\x22baseBillingUrl\x22:\x22https:\\u002F\\u002Fbilling.indeed.com\x22,\x22baseIndeedEmployerHelpUrl\x22:\x22https:\\u002F\\u002Findeedemployers.zendesk.com\x22,\x22baseIndeedUrl\x22:\x22https:\\u002F\\u002Fwww.indeed.com\x22,\x22baseMyIndeedUrl\x22:\x22https:\\u002F\\u002Fmy.indeed.com\x22,\x22basePieUrl\x22:\x22https:\\u002F\\u002Faccount.indeed.com\x22,\x22baseRozUrl\x22:\x22https:\\u002F\\u002Fresumes.indeed.com\x22,\x22baseSecureUrl\x22:\x22https:\\u002F\\u002Fsecure.indeed.com\x22,\x22billingIssue\x22:\x22CAN_PURCHASE\x22,\x22canSwitchAccount\x22:false,\x22confirmed\x22:true,\x22country\x22:\x22US\x22,\x22csrfParam\x22:\x22indeedcsrftoken\x22,\x22csrfToken\x22:\x22RonYXgzB6OxlClQV4QY9woqaatyPStN8\x22,\x22currentRelativeUrl\x22:\x22\\u002Fresume\\u002Fd53377828e23d884?s\x3dl%3D%26q%3Dcat%2520sitter%26searchFields%3Djt\x22,\x22currentUrl\x22:\x22https:\\u002F\\u002Fresumes.indeed.com\\u002Fresume\\u002Fd53377828e23d884?s\x3dl%3D%26q%3Dcat%2520sitter%26searchFields%3Djt\x22,\x22currentUserAccountKey\x22:\x2286c56776fbc49dff\x22,\x22emailAddress\x22:\x22nyaa.studio.apps@gmail.com\x22,\x22featuredEmployer\x22:false,\x22isMasquerade\x22:false,\x22language\x22:\x22en\x22,\x22locale\x22:\x22en_US\x22,\x22loggedIn\x22:true,\x22masquerade\x22:false,\x22moderated\x22:false,\x22nonMonetizedMarket\x22:false,\x22privileged\x22:false,\x22showLaunchBanner\x22:true,\x22subscriptionInfo\x22:{\x22admin\x22:true,\x22bulkContact\x22:false,\x22contactsRemaining\x22:0,\x22hasUnassignedSubscription\x22:false,\x22hasUnlimitedContacts\x22:false,\x22subscriptionAssigned\x22:false,\x22trial\x22:false},\x22subscriptionsEnabled\x22:true},\x22contactRecord\x22:{\x22allowContact\x22:true,\x22allowRepeatedContact\x22:false,\x22contactedByCoworkerDate\x22:\x22\x22,\x22contactedByCoworkerEmail\x22:\x22\x22,\x22contactedByUserDate\x22:\x22\x22,\x22responseStatus\x22:\x22UNRESPONDED\x22},\x22countryOfEligibility\x22:\x22United States\x22,\x22eligibility\x22:\x22ELG\x22,\x22isSavedResume\x22:false,\x22resumeModel\x22:{\x22accountKey\x22:\x22d53377828e23d884\x22,\x22additionalInformation\x22:\x22Skills\\nWord 2010 and 2013, Excel, Powerpoint, computer and typing skills, interpersonal skills,\\norganizational skills, some ASL knowledge, love of animals, previous animal care experience\x22,\x22assessments\x22:[],\x22awards\x22:[],\x22certifications\x22:[],\x22education\x22:[{\x22dateRange\x22:\x22December 2015 to Present\x22,\x22degree\x22:\x22Liberal Arts degree\x22,\x22field\x22:\x22Liberal Arts\x22,\x22id\x22:\x22EecYz-PgixmaTKmUQsuaQg\x22,\x22location\x22:\x22Newtown, PA\x22,\x22university\x22:\x22Bucks County Community College\x22}],\x22email\x22:\x22\x22,\x22firstName\x22:\x22Cat Sitter\x22,\x22fullName\x22:\x22Cat Sitter\x22,\x22groups\x22:[],\x22headline\x22:\x22Cat Sitter - Local Residence\x22,\x22highlightedWords\x22:[\x22sitters\x22,\x22cat\x22,\x22sitter\x22],\x22id\x22:\x22EecYz-PgixWaTKmUQsuaQg\x22,\x22licenses\x22:[],\x22links\x22:[],\x22location\x22:\x22Perkasie, PA\x22,\x22militaryService\x22:[],\x22patents\x22:[],\x22phoneNumber\x22:\x22\x22,\x22publications\x22:[],\x22skills\x22:[{\x22id\x22:\x22EecYz-PgixqaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:12,\x22skill\x22:\x22Excel\x22},{\x22id\x22:\x22EecYz-PgixuaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:120,\x22skill\x22:\x22organizational skills\x22},{\x22id\x22:\x22EecYz-PgixyaTKmUQsuaQg\x22,\x22monthsOfExperience\x22:12,\x22skill\x22:\x22Powerpoint\x22},{\x22id\x22:\x22EecYz-Pgix2aTKmUQsuaQg\x22,\x22monthsOfExperience\x22:24,\x22skill\x22:\x22typing\x22},{\x22id\x22:\x22EecYz-Pgix6aTKmUQsuaQg\x22,\x22monthsOfExperience\x22:24,\x22skill\x22:\x22Word\x22},{\x22id\x22:\x22Eehfj2sVJeqeoM7c3iCmnw\x22,\x22monthsOfExperience\x22:72,\x22skill\x22:\x22working with animals\x22}],\x22summary\x22:\x22\x22,\x22updatedDate\x22:\x22May 26, 2019\x22,\x22workExperience\x22:[{\x22company\x22:\x22Local Residence\x22,\x22dateRange\x22:\x222015 to Present\x22,\x22description\x22:\x22Feed cats\\n●\\tClean litter boxes\\n●\\tDaily check-ins on cats\x22,\x22id\x22:\x22EecYz-PgixaaTKmUQsuaQg\x22,\x22location\x22:\x22Quakertown, PA\x22,\x22title\x22:\x22Cat Sitter\x22},{\x22company\x22:\x22Local Residence - Dog Walker\x22,\x22dateRange\x22:\x22January 2014 to January 2016\x22,\x22description\x22:\x22Walk dogs\\n●\\tFeed dogs\\n●\\tCheck-ins and play time with dogs\x22,\x22id\x22:\x22EecYz-PgixeaTKmUQsuaQg\x22,\x22location\x22:\x22Quakertown, PA\x22,\x22title\x22:\x22Dog Sitter\x22}]},\x22tk\x22:\x221dckr2vhn3p22800\x22}');</script><script>window.proctorGroups = JSON.parse('[[5,null],[1,null],[1,null],[9,null],[7,null],[1,null],[0,null],[19,null],[-1,null],[0,null],[3,null],[1,null],[19,null],[-1,null],[-1,null],[-1,null],[-1,null],[0,null],[1,null],[-1,null],[-1,\x22${contactName} sent you a message about your resume on Indeed.\x22],[1,null],[1,null],[-1,null],[0,null],[-1,null],[-1,null],[-1,null],[1,null],[1,null],[-1,null],[-1,null],[1,null],[1,null],[-1,null],[1,null],[-1,null],[1,null],[1,{\x22recaptchaThreshold\x22:0.49}],[2,null],[-1,null],[1,null],[1,null],[0,null],[1,null],[-1,null],[1,null],[1,null],[-1,null],[-1,null],[-1,null],[-1,null],[-1,null],[1,null],[-1,null],[-1,null],[1,{\x22accountBlocks\x22:[371495985,371492945,371496796,371495142,371255403,180896675,402708456],\x22ipBlocks\x22:[\x22142.93.160.149\x22,\x22156.213.187.109\x22,\x2254.144.251.118\x22,\x2254.160.231.37\x22,\x2254.161.232.223\x22,\x2254.163.111.234\x22,\x2254.166.201.27\x22,\x2254.167.132.121\x22,\x2254.211.243.158\x22,\x2254.221.65.205\x22,\x2254.234.36.11\x22,\x2254.235.23.71\x22,\x2254.242.123.36\x22,\x2254.242.125.90\x22,\x2254.242.94.44\x22,\x2254.91.29.30\x22,\x2254.81.91.102\x22,\x2218.130.133.224\x22,\x2218.130.98.215\x22,\x223.8.18.212\x22,\x223.8.20.40\x22,\x2234.206.53.38\x22,\x2282.12.238.32\x22,\x22137.135.96.20\x22,\x2213.90.195.83\x22,\x22137.135.96.20\x22,\x22106.51.66.119\x22,\x22116.75.87.250\x22,\x2213.90.195.83\x22,\x22104.131.19.173\x22,\x22106.51.66.119\x22,\x22108.2.166.209\x22,\x2212.133.183.51\x22,\x22163.198.35.32\x22,\x22168.62.165.43\x22,\x2218.203.123.118\x22,\x2223.96.14.105\x22,\x2252.60.89.234\x22,\x2271.14.194.130\x22,\x2273.2.223.45\x22]}],[1,null],[1,null],[1,null],[1,null],[-1,null],[-1,null],[-1,null],[-1,null],[1,null],[1,null],[1,null],[1,null],[-1,null],[2,null],[1,null],[1,null],[-1,null],[3,null]]');</script><script type="text/javascript" src="/static/b9c32234bdbed298be40/scripts/vendor.js"></script><script type="text/javascript" src="/static/f38ebfd/en_US.js"></script><script>!function(n){function r(n){for(var r=a,t=n.length;t;)r=33*r^n.charCodeAt(--t);return r>>>0}var t=this['indeed.i18n.localeData'],e=t['']||{},a=e.salt;if(e.hasOwnProperty('salt'))for(var i in n)t[function(n){var t=r(n);return e.hasOwnProperty('id_length')&&(t=String(t).substring(0,e.id_length)),t}(i)]=n[i];else for(var i in n)t[i]=[null].concat(n[i])}({"Email {0} job seeker":["Contact {0} job seeker","Contact {0} job seekers"],"Email":["Contact"],"Email {0}":["Contact {0}"],"Send Email":["Message"]});</script>
<script type="text/javascript" src="/static/70ab8de6e2102d523c43/scripts/resume-view-app.js"></script></body></html>"""
soup = BeautifulSoup(html, 'lxml')
script = soup.find_all("script")
pattern = re.compile('window.initialState = JSON.parse(.*);')
for i in script:
strObj = i.text
match = pattern.search(strObj)
if match:
jsonString = strObj.split("window.initialState = JSON.parse")[1][:-1].encode('utf8').decode('unicode_escape')
jsonData = json.loads(jsonString[2:-2], strict=False)
print(jsonData)
O/P:
{'commonModel': {'advertiser': 'Nyaa Studio', 'baseAdsUrl': 'https://ads.indeed.com', 'baseAnalyticsUrl': 'https://analytics.indeed.com', 'baseBillingUrl': 'https://billing.indeed.com', 'baseIndeedEmployerHelpUrl': 'https://indeedemployers.zendesk.com', 'baseIndeedUrl': 'https://www.indeed.com', 'baseMyIndeedUrl': 'https://my.indeed.com', 'basePieUrl': 'https://account.indeed.com', 'baseRozUrl': 'https://resumes.indeed.com', 'baseSecureUrl': 'https://secure.indeed.com', 'billingIssue': 'CAN_PURCHASE', 'canSwitchAccount': False, 'confirmed': True, 'country': 'US', 'csrfParam': 'indeedcsrftoken', 'csrfToken': 'RonYXgzB6OxlClQV4QY9woqaatyPStN8', 'currentRelativeUrl': '/resume/d53377828e23d884?s=l%3D%26q%3Dcat%2520sitter%26searchFields%3Djt', 'currentUrl': 'https://resumes.indeed.com/resume/d53377828e23d884?s=l%3D%26q%3Dcat%2520sitter%26searchFields%3Djt', 'currentUserAccountKey': '86c56776fbc49dff', 'emailAddress': 'nyaa.studio.apps@gmail.com', 'featuredEmployer': False, 'isMasquerade': False, 'language': 'en', 'locale': 'en_US', 'loggedIn': True, 'masquerade': False, 'moderated': False, 'nonMonetizedMarket': False, 'privileged': False, 'showLaunchBanner': True, 'subscriptionInfo': {'admin': True, 'bulkContact': False, 'contactsRemaining': 0, 'hasUnassignedSubscription': False, 'hasUnlimitedContacts': False, 'subscriptionAssigned': False, 'trial': False}, 'subscriptionsEnabled': True}, 'contactRecord': {'allowContact': True, 'allowRepeatedContact': False, 'contactedByCoworkerDate': '', 'contactedByCoworkerEmail': '', 'contactedByUserDate': '', 'responseStatus': 'UNRESPONDED'}, 'countryOfEligibility': 'United States', 'eligibility': 'ELG', 'isSavedResume': False, 'resumeModel': {'accountKey': 'd53377828e23d884', 'additionalInformation': 'Skills\nWord 2010 and 2013, Excel, Powerpoint, computer and typing skills, interpersonal skills,\norganizational skills, some ASL knowledge, love of animals, previous animal care experience', 'assessments': [], 'awards': [], 'certifications': [], 'education': [{'dateRange': 'December 2015 to Present', 'degree': 'Liberal Arts degree', 'field': 'Liberal Arts', 'id': 'EecYz-PgixmaTKmUQsuaQg', 'location': 'Newtown, PA', 'university': 'Bucks County Community College'}], 'email': '', 'firstName': 'Cat Sitter', 'fullName': 'Cat Sitter', 'groups': [], 'headline': 'Cat Sitter - Local Residence', 'highlightedWords': ['sitters', 'cat', 'sitter'], 'id': 'EecYz-PgixWaTKmUQsuaQg', 'licenses': [], 'links': [], 'location': 'Perkasie, PA', 'militaryService': [], 'patents': [], 'phoneNumber': '', 'publications': [], 'skills': [{'id': 'EecYz-PgixqaTKmUQsuaQg', 'monthsOfExperience': 12, 'skill': 'Excel'}, {'id': 'EecYz-PgixuaTKmUQsuaQg', 'monthsOfExperience': 120, 'skill': 'organizational skills'}, {'id': 'EecYz-PgixyaTKmUQsuaQg', 'monthsOfExperience': 12, 'skill': 'Powerpoint'}, {'id': 'EecYz-Pgix2aTKmUQsuaQg', 'monthsOfExperience': 24, 'skill': 'typing'}, {'id': 'EecYz-Pgix6aTKmUQsuaQg', 'monthsOfExperience': 24, 'skill': 'Word'}, {'id': 'Eehfj2sVJeqeoM7c3iCmnw', 'monthsOfExperience': 72, 'skill': 'working with animals'}], 'summary': '', 'updatedDate': 'May 26, 2019', 'workExperience': [{'company': 'Local Residence', 'dateRange': '2015 to Present', 'description': 'Feed cats\nâ\x97\x8f\tClean litter boxes\nâ\x97\x8f\tDaily check-ins on cats', 'id': 'EecYz-PgixaaTKmUQsuaQg', 'location': 'Quakertown, PA', 'title': 'Cat Sitter'}, {'company': 'Local Residence - Dog Walker', 'dateRange': 'January 2014 to January 2016', 'description': 'Walk dogs\nâ\x97\x8f\tFeed dogs\nâ\x97\x8f\tCheck-ins and play time with dogs', 'id': 'EecYz-PgixeaTKmUQsuaQg', 'location': 'Quakertown, PA', 'title': 'Dog Sitter'}]}, 'tk': '1dckr2vhn3p22800'}
关于javascript - 如何从网页中抓取 `window.initialState`?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56486511/
如本answer所述,如果浏览器不支持 e,可以设置后备游标。 G。 光标:抓取;。我现在的问题是获取这些图像。在我的驱动器上本地搜索“.cur”只给了我系统光标,其中 grab.cur 和 grab
以下代码在计算机上运行以从 Instagram 帐户中抓取数据。当我尝试在 VPS 服务器上使用它时,我被重定向到 Instagram 登录页面,因此脚本不起作用。 为什么当我在电脑上或服务器上时,I
我在使用 Ruby 和 Mechanize 将 POST 查询传递到站点的网站上。访问站点的查询基于 firebug,如下所示 param.PrdNo=-1¶m.Type=Prop¶m
我正在尝试抓取一个具有多个页面结果的网站,例如“1、2、3、4、5...”。 每个分页号都是到另一个页面的链接,我需要抓取每个页面。 到目前为止,我想出了这个: while lien = page.l
我正在使用 HtmlAgilityPack 在 C# Asp.Net 中执行 Scraping,到目前为止,我在从多个 Web 执行 Scratch 时没有遇到问题,但是,尝试弹出以下代码时出现错误
如果我有一个 css 文件做这样的事情 #foo:after{content:"bar;} ,有没有办法用 javascript 获取 :after 的内容?获取父元素的内容只返回 #foo 元素的内
问题是这样的: 我有一个 Web 应用程序 - 一个经常更改的通知系统 - 在一系列本地计算机上运行。该应用程序每隔几秒刷新一次以显示新信息。计算机仅显示信息,没有键盘或任何输入设备。 问题是,如果与
我想制作一个程序来模拟用户浏览网站和点击链接。必须启用 Cookie 和 javascript。我已经在 python 中成功地做到了这一点,但我想把它写成一种可编译的语言(python ide 不会
我制作了这个小机器人,它通过搜索参数列表进行处理。它工作正常,直到页面上有几个结果: product_prices_euros 给出了一半为空的项目列表。因此,当我与 product_prices_c
我需要找到一个单词的匹配项,例如: 在网上找到所有单词“学习”https://www.georgetown.edu/(结果:4个字)(您可以看到它按CTRL + F并搜索) 我有我的 Python 代
有一个站点\资源提供一些一般统计信息以及搜索工具的界面。这种搜索操作成本高昂,因此我想限制频繁且连续(即自动)的搜索请求(来自人,而不是来自搜索引擎)。 我相信有很多现有的技术和框架可以执行一些情报抓
这并不是真正的抓取,我只是想在网页中找到类具有特定值的 URL。例如: 我想获取 href 值。关于如何做到这一点的任何想法?也许正则表达式?你能发布一些示例代码吗?我猜 html 抓取库,比如 B
我正在使用 scrapy。 我正在使用的网站具有无限滚动功能。 该网站有很多帖子,但我只抓取了 13 个。 如何抓取剩余的帖子? 这是我的代码: class exampleSpider(scrapy.
我正在尝试从这个 website 中抓取图像和新闻 url .我定义的标签是 root_tag=["div", {"class":"ngp_col ngp_col-bottom-gutter-2 ng
关闭。这个问题需要更多focused .它目前不接受答案。 想改进这个问题吗? 更新问题,使其只关注一个问题 editing this post . 关闭上个月。 Improve this ques
我在几个文件夹中有数千个 html 文件,我想从评论中提取数据并将其放入 csv 文件中。这将允许我为项目格式化和清理它。例如,我在这个文件夹中有 640 个 html 文件: D:\My Web S
我在编写用于抓取网页的实用程序时遇到了一个问题。 我正在发送 POST 请求来检索数据,我模仿我正在抓取的网络行为(根据使用 fiddler 收集的信息)。 我已经能够自动替换我的 POST 中除 V
对于 Googlebot 的 AJAX 抓取,我在我的网站中使用“_escaped_fragment_”参数。 现在我查看了 Yandex 对我网站的搜索结果。 我看到搜索结果中不存在 AJAX 响应
我正在尝试抓取网站的所有结果页面,它可以工作,但有时脚本会停止并显示此错误: 502 => Net::HTTPBadGateway for https://website.com/id/12/ --
我是一个学习网络爬虫的初学者,由于某种原因我无法爬网this地点。当我在 Chrome 中检查它时,代码看起来不错,但是当我用 BeautifulSoup 阅读它时,它不再是可刮的。汤提到“谷歌分析”
我是一名优秀的程序员,十分优秀!