gpt4 book ai didi

html - 如何从网页解析 Gmail 聊天记录?

转载 作者:可可西里 更新时间:2023-11-01 14:58:25 27 4
gpt4 key购买 nike

从显示 Gmail 聊天记录的网页解析 Gmail 聊天记录的最佳方式是什么?据我所知,这仍然是访问服务器托管的 Gmail 聊天记录的唯一方式(通过桌面版 Gmail 或移动版 Gmail)。

当查看发生对话的生成源时,标记看起来像嵌套的 div 和跨度(页面上其他地方的 div 具有随机的双字符 ID 和没有模式的类)。以下是左边有时间戳的一行的摘录:

<div>
<span style="display:block;float:left;color:#888">
2:56 PM&nbsp;
</span>

<span style="display:block;padding-left:6em">
<span>

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs

</span>
</span>
</div>

但并不是每一行都有时间戳,所以那些没有时间戳的行似乎在其位置放置了不间断的空格:

<div>
<span style="display:block;float:left;color:#888">
&nbsp;&nbsp;
</span>

<span style="display:block;padding-left:6em">

<span>
and reformat that into something like an xml format
</span>

</span>
</div>

我应该使用 XPath 吗?有没有更有效的方法?

编辑:

仅作为数据,它看起来像这样:

12:43 AM John: Something something something.
Something something something.
me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
with something something something.
12:49 AM John: Something.

最佳答案

Should I use XPath? Is there something more efficient?

我会使用带有 Nokogiri 库的 Ruby,它比 XPath/XSLT 给你更多的灵 active :

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'

src = <<EOS
<div>
<span style="display:block;float:left;color:#888">
2:56 PM&nbsp;
</span>
<span style="display:block;padding-left:6em">
<span>
<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
</span>
</span>
<span style="display:block;float:left;color:#888">
&nbsp;&nbsp;
</span>
<span style="display:block;padding-left:6em">
<span>
and reformat that into something like an xml format
</span>
</span>
</div>
EOS

chatlog = []
last_timestamp = nil
doc = Nokogiri::HTML(src)

doc.xpath('//div/span').each do |span|
style = span.attributes['style'].value

if style.include?('color:')
last_timestamp = span.content.strip
elsif style.include?('padding-left:')
chatlog << {:timestamp => last_timestamp, :message => span.content.strip}
end
end

builder = Nokogiri::XML::Builder.new do |doc|
doc.chatlog {
chatlog.each do |line|
doc.line {
doc.time line[:timestamp]
doc.message line[:message]
}
end
}
end

返回:

<?xml version="1.0" encoding="UTF-8"?>
<chatlog>
<line>
<time>2:56 PM </time>
<message>me: i'm trying to think of a good way to parse gmail chat logs</message>
</line>
<line>
<time>  </time>
<message>and reformat that into something like an xml format</message>
</line>
</chatlog>

关于html - 如何从网页解析 Gmail 聊天记录?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3151860/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com