gpt4 book ai didi

html - 将 HTML 解析为干净的 XML

转载 作者:行者123 更新时间:2023-11-27 23:05:08 27 4
gpt4 key购买 nike

我有一个作为“.EXCEL”文件引入的文件(伪造的 excel,但这超出了我们的控制范围)。它实际上是 HTML,但我很难将其转换为 XML。

HTML 看起来像这样:

<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY&nbsp;&nbsp; </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>

看起来像这样:

Cash Activity               | Friday 05-JAN-18
______________________________________________
Prior Day Available Balance | $123,456.58

无论如何我可以在 Powershell 中将其解析为如下所示的输出 XML:

<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>
<Cash Activity>
<Activity>Prior Day Available Balance</Activity>
<Balance>123456.58</Balance>
</Cash Activity>
</Cash Activities>

到目前为止,我的 Powershell 只能从电子邮件中提取它,并将其保存为 HTML 文件:

$account = "my.email@mycompany.com"
#date to append to new file name
$date = Get-Date -Format yyyyMMdd
$searchDate = Get-Date -Format M/dd/yyyy
Write-Host $searchDate
#file to save attachment as
$newFileName = "Balance_Import_$date.xml"
$newFilePath = "C:\MyDirectory\\"

#Go into Outlook and get the MAPI
$mail = New-Object -ComObject outlook.application
$mailNS = $mail.GetNamespace("MAPI")


#get the account and Inbox we want
$myAcount = $mailNS.Folders | ? {$_.Name -eq $account}
$myInbox = $myAcount.Folders | ? {$_.Name -eq "Inbox"};
$myItems = $myAcount.Items | ? {$_.ReceivedTime.Date -eq $searchDate};

#loop through the Inbox and get any Attachments with the extension of .EXCEL
foreach ($f in $myInbox)
{
foreach($i in $f.Items)
{
Write-Host "Checking "$i.Subject"..."

if($i.ReceivedTime.Date -eq $searchDate)
{
Write-Host "---"
Write-Host $i.Subject
Write-Host "---"

foreach($a in $i.Attachments)
{
if($a.FileName -like "*.EXCEL")
{
#Move the attachment to the desired directory
$a.SaveAsFile((Join-Path $newFilePath $newFileName))
Write-Host $a.FileName " Saved as HTML"

#TODO: PARSE HTML INTO XML

}
}
}

}
}

最佳答案

解析伪造的 Excel/HTML 输入可能会出现一些问题:

  1. HTML 格式不正确。
  2.   等 HTML 实体会破坏 XML 解析器。

假设上面的 HTML 示例解决了第一个问题,您可以通过像这样解码输入来强行解决第二个问题:

[xml]$html = [System.Net.WebUtility]::HtmlDecode(@'
<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY&nbsp;&nbsp; </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>
'@);

现在只需使用一些简单的 XPath 来选择您想要获得上面指定的所需 XML 的节点(已测试并有效):

$xml = @'
<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>

'@;
$rows = $html.DocumentElement.SelectNodes('//tr');
foreach ($row in $rows) {
if ($row.GetAttribute('class') -eq 'c12') {
$xml += "`t<Cash Activity>`n";
$spans = $row.SelectNodes('.//descendant::span[@class]');
if ($spans.Count -eq 2) {
$xml += "`t`t<Activity>$($spans[0].InnerText.Trim())</Activity>`n";
$xml += "`t`t<Balance>$($spans[1].InnerText.Trim())</Balance>`n";
}
$xml += "`t</Cash Activity>`n";
}
}

$xml += @'
</Cash Activities>
'@;

关于html - 将 HTML 解析为干净的 XML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50026451/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com