gpt4 book ai didi

java - 将 XML 拆分为指定大小的较小 XML 文件

转载 作者:行者123 更新时间:2023-12-01 13:28:29 26 4
gpt4 key购买 nike

我对 XML 还很陌生,坏消息是我有以下结构的 XML:

<record>
<record_id>200</record_id>
<record_rows>
<record_row>some text</record_row>
.................................
</record_rows>
</record>

每条记录的记录行数不同,因此每条记录的大小也有很大不同。我的任务是将文件(超过 1GB)拆分为指定大小的单独 xml 文件。哪个解析器是最好的?另外,我想我应该采用一些记录选择策略来接近目标大小(考虑到输入文件大小和下一个记录大小的不可预测性,我当时无法想象任何策略)

我的 friend 们,唯一的希望就在你们身上。你会如何处理这个问题?

最佳答案

假设您的记录行不大于单个文件所需的大小,您可以使用 SAX 解析器顺序读取文件并计算读取的字符数,将迄今为止读取的数据存储在缓冲区中。当字符计数达到接近大小限制的值时,它将创建一个仅包含迄今为止读取的记录的新文件,重置缓冲区和字符计数,并将继续读取另一组,直到再次达到限制,并且很快。最后,您将拥有一组大小大致相同的文件(最后一个文件除外,它可能更小)并且包含相同的数据。

要使用 SAX 解析器,您需要一个包含以下代码的可执行文件:

import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;

public class SAXReader {

public static final String PATH = "src/main/resources";

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader reader = sp.getXMLReader();
reader.setContentHandler(new DataSaxHandler()); // need to implement this file
reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml"))));
}
}

您的 XML 文件存储在 src/main/resources/data.xml 中(相对于您运行应用程序的位置)。您可能想改变这一点。

如果分割文件是格式良好的 XML,它们还应该有一个根元素,并且可能保留诸如 record_id 之类的信息,以便您可以知道它们来自哪个记录。我添加了一个属性 part ,其中包含对文件片段进行排序的序列号。生成的文件将如下所示:

data_part_1.xml

<record part='1'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>

data_part_2.xml

<record part='2'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>

...

data_part_n.xml

<record part='n'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row></record_rows></record>

其中“n”是创建的文件数。

实现此结果的 SAX ContentHandler 实现如下所示。您可能想要更改 DIRECTORYMAX_SIZE 常量:

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;

class DataSaxHandler extends DefaultHandler {

// Change this to the directory where the files will be stored
public static final String DIRECTORY = "target/results";

// Change this to the approximate size of the resulting files (in characters(
public static final long MAX_SIZE = 1024;


public static final long TAG_CHAR_SIZE = 5; //"<></>"

// counts number of files created
private int fileCount = 0;

// counts characters to decide where to split file
private long charCount = 0;
// data line buffer (is reset when the file is split)
private StringBuilder recordRowDataLines = new StringBuilder();

// temporary variables used for the parser events
private String currentElement = null;
private String currentRecordId = null;
private String currentRecordRowData = null;

@Override
public void startDocument() throws SAXException {
File dir = new File(DIRECTORY);
if (!dir.exists()) {
dir.mkdir();
}
}

@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
currentElement = qName;
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("record_rows")) { // no more records - save last file here!
try {
saveFragment();
} catch (IOException ex) {
throw new SAXException(ex);
}
}
if (qName.equals("record_row")) { // one record finished - save in buffer & calculate size so far
charCount += tagSize("record_row");
recordRowDataLines.append("<record_row>")
.append(currentRecordRowData)
.append("</record_row>");
if (charCount >= MAX_SIZE) { // if max size was reached, save what was read so far in a new file
try {
saveFragment();
} catch (IOException ex) {
throw new SAXException(ex);
}
}
}
currentElement = null;
}

@Override
public void characters(char[] ch, int start, int length) throws SAXException {
System.out.println(new String(ch, start, length));
if (currentElement == null) {
return;
}
if (currentElement.equals("record_id")) {
currentRecordId = new String(ch, start, length);
}
if (currentElement.equals("record_row")) {
currentRecordRowData = new String(ch, start, length);
charCount += currentRecordRowData.length(); // storing size so far
}
}

public long tagSize(String tagName) {
return TAG_CHAR_SIZE + tagName.length() * 2; // size of text + tags
}

/**
* Saves a new file containing approximately MAX_SIZE in chars
*/
public void saveFragment() throws IOException {
++fileCount;
StringBuilder fileContent = new StringBuilder();
fileContent.append("<record part='")
.append(fileCount)
.append("'><record_id>")
.append(currentRecordId)
.append("</record_id>")
.append("<record_rows>")
.append(recordRowDataLines)
.append("</record_rows></record>");
File fragment = new File(DIRECTORY, "data_part_" + fileCount + ".xml");
FileWriter out = new FileWriter(fragment);
out.write(fileContent.toString());
out.flush();
out.close();

// reset fragment data - record buffer and char count
recordRowDataLines = new StringBuilder();
charCount = 0;
}

}

关于java - 将 XML 拆分为指定大小的较小 XML 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21688898/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com