gpt4 book ai didi

java - 如何在 java 中读取或解析 MHTML (.mht) 文件

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:09:23 26 4
gpt4 key购买 nike

我需要挖掘大多数已知文档文件的内容,例如:

  1. pdf
  2. html
  3. doc/docx 等

对于我计划使用的大多数这些文件格式:

http://tika.apache.org/

但截至目前 Tika 不支持 MHTML (*.mht) 文件.. ( http://en.wikipedia.org/wiki/MHTML )C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) 中的示例很少,但我在 Java 中找不到任何示例。

我尝试在 7Zip 中打开 *.mht 文件但失败了...尽管 WinZip 能够将文件解压缩为图像和文本(CSS、HTML、脚本)作为文本和二进制文件...

根据 MSDN 页面 ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) 和我之前提到的 code project 页面 ... mht 文件使用 GZip 压缩 ...

尝试在 java 中解压缩会导致以下异常:使用 java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

还有 java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

请建议如何解压....

谢谢....

最佳答案

坦率地说,我没想到会在不久的将来找到解决方案并打算放弃,但我是如何偶然发现这个页面的:​​

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

虽然,乍一看不是很吸引人。但如果你仔细看,你就会发现线索。读完这篇文章后,我启动了我的 IE,并随机开始将页面保存为 *.mht 文件。让我一行一行...

但让我事先解释一下,我的最终目标是分离/提取出 html 内容并解析它...解决方案本身并不完整,因为它取决于 字符setencoding 我在保存时选择。但即使它会提取带有小故障的单个文件...

我希望这对任何试图解析/解压缩 *.mht/MHTML 文件的人有用:)

======= 解释 ========** 取自 mht 文件 **

From: "Saved by Windows Internet Explorer 7"

用于保存文件的软件

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

主题、日期和 mime 版本……很像邮件格式

  Content-Type: multipart/related;
type="text/html";

这是告诉我们它是一个multipart 文档的部分。多部分文档将一组或多组不同的数据组合在一个主体中,multipart Content-Type 字段必须出现在实体的标题中。在这里,我们还可以看到类型为 "text/html"

boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

这是最重要的部分。这是分隔两个不同部分(html、图像、css、脚本等)的唯一分隔符。 一旦你掌握了这个,一切都会变得简单......现在,我只需要遍历文档并找出不同的部分并根据它们的Content-Transfer-Encoding<保存它们 (base64, quoted-printable etc) ... . . .

样本

 ------=_NextPart_000_0007_01CB22D1.93BBD1A0
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =
.
.
.

** JAVA代码**

定义常量的接口(interface)。

public interface IConstants 
{
public String BOUNDARY = "boundary";
public String CHAR_SET = "charset";
public String CONTENT_TYPE = "Content-Type";
public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
public String CONTENT_LOCATION = "Content-Location";

public String UTF8_BOM = "=EF=BB=BF";

public String UTF16_BOM1 = "=FF=FE";
public String UTF16_BOM2 = "=FE=FF";
}

主要的解析器类...

/**
* This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
* which accompanies this distribution, and is available at
* http://www.eclipse.org/legal/epl-v10.html
*/
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

/**
* File to parse and decompose *.mts file in its constituting parts.
* @author Manish Shukla
*/

public class MHTParser implements IConstants
{
private File mhtFile;
private File outputFolder;

public MHTParser(File mhtFile, File outputFolder) {
this.mhtFile = mhtFile;
this.outputFolder = outputFolder;
}

/**
* @throws Exception
*/
public void decompress() throws Exception
{
BufferedReader reader = null;

String type = "";
String encoding = "";
String location = "";
String filename = "";
String charset = "utf-8";
StringBuilder buffer = null;

try
{
reader = new BufferedReader(new FileReader(mhtFile));

final String boundary = getBoundary(reader);
if(boundary == null)
throw new Exception("Failed to find document 'boundary'... Aborting");

String line = null;
int i = 1;
while((line = reader.readLine()) != null)
{
String temp = line.trim();
if(temp.contains(boundary))
{
if(buffer != null) {
writeBufferContentToFile(buffer,encoding,filename,charset);
buffer = null;
}

buffer = new StringBuilder();
}else if(temp.startsWith(CONTENT_TYPE)) {
type = getType(temp);
}else if(temp.startsWith(CHAR_SET)) {
charset = getCharSet(temp);
}else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
encoding = getEncoding(temp);
}else if(temp.startsWith(CONTENT_LOCATION)) {
location = temp.substring(temp.indexOf(":")+1).trim();
i++;
filename = getFileName(location,type);
}else {
if(buffer != null) {
buffer.append(line + "\n");
}
}
}

}finally
{
if(null != reader)
reader.close();
}

}

private String getCharSet(String temp)
{
String t = temp.split("=")[1].trim();
return t.substring(1, t.length()-1);
}

/**
* Save the file as per character set and encoding
*/
private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset)
throws Exception
{

if(!outputFolder.exists())
outputFolder.mkdirs();

byte[] content = null;

boolean text = true;

if(encoding.equalsIgnoreCase("base64")){
content = getBase64EncodedString(buffer);
text = false;
}else if(encoding.equalsIgnoreCase("quoted-printable")) {
content = getQuotedPrintableString(buffer);
}
else
content = buffer.toString().getBytes();

if(!text)
{
BufferedOutputStream bos = null;
try
{
bos = new BufferedOutputStream(new FileOutputStream(filename));
bos.write(content);
bos.flush();
}finally {
bos.close();
}
}else
{
BufferedWriter bw = null;
try
{
bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
bw.write(new String(content));
bw.flush();
}finally {
bw.close();
}
}
}

/**
* When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
* @see http://en.wikipedia.org/wiki/Byte_order_mark
*/
private byte[] getQuotedPrintableString(StringBuilder buffer)
{
//Set<String> uniqueHex = new HashSet<String>();
//final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*");

String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", "");

//Matcher m = p.matcher(temp);
//while(m.find()) {
// uniqueHex.add(m.group());
//}

//System.out.println(uniqueHex);

//for (String hex : uniqueHex) {
//temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));
//}

return temp.getBytes();
}

/*private String getASCIIValue(String hex) {
return ""+(char)Integer.parseInt(hex, 16);
}*/
/**
* Although system dependent..it works well
*/
private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
return new BASE64Decoder().decodeBuffer(buffer.toString());
}

/**
* Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
* Otherwise it returns 'unknown.<type>'
*/
private String getFileName(String location, String type)
{
final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+");
String ext = "";
String name = "";
if(type.toLowerCase().endsWith("jpeg"))
ext = "jpg";
else
ext = type.split("/")[1];

if(location.endsWith("/")) {
name = "main";
}else
{
name = location.substring(location.lastIndexOf("/") + 1);

Matcher m = p.matcher(name);
String fname = "";
while(m.find()) {
fname = m.group();
}

if(fname.trim().length() == 0)
name = "unknown";
else
return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
}
return getUniqueName(name,ext);
}

/**
* Returns a qualified unique output file path for the parsed path.</br>
* In case the file already exist it appends a numarical value a continues
*/
private String getUniqueName(String name,String ext)
{
int i = 1;
File file = new File(outputFolder,name + "." + ext);
if(file.exists())
{
while(true)
{
file = new File(outputFolder, name + i + "." + ext);
if(!file.exists())
return file.getAbsolutePath();
i++;
}
}

return file.getAbsolutePath();
}

private String getType(String line) {
return splitUsingColonSpace(line);
}

private String getEncoding(String line){
return splitUsingColonSpace(line);
}

private String splitUsingColonSpace(String line) {
return line.split(":\\s*")[1].replaceAll(";", "");
}

/**
* Gives you the boundary string
*/
private String getBoundary(BufferedReader reader) throws Exception
{
String line = null;

while((line = reader.readLine()) != null)
{
line = line.trim();
if(line.startsWith(BOUNDARY)) {
return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\""));
}
}

return null;
}
}

问候,

关于java - 如何在 java 中读取或解析 MHTML (.mht) 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3230305/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com