gpt4 book ai didi

java - 多线程读取大文件

转载 作者:行者123 更新时间:2023-11-30 07:53:28 24 4
gpt4 key购买 nike

我正在实现一个应该接收大型文本文件的类。我想把它分成 block ,每个 block 由一个不同的线程保存,该线程将计算这个 block 中每个字符的频率。我希望启动更多线程以获得更好的性能,但事实证明性能越来越差。这是我的代码:

public class Main {

public static void main(String[] args)
throws IOException, InterruptedException, ExecutionException, ParseException
{

// save the current run's start time
long startTime = System.currentTimeMillis();

// create options
Options options = new Options();
options.addOption("t", true, "number of threads to be start");

// variables to hold options
int numberOfThreads = 1;

// parse options
CommandLineParser parser = new DefaultParser();
CommandLine cmd;
cmd = parser.parse(options, args);
String threadsNumber = cmd.getOptionValue("t");
numberOfThreads = Integer.parseInt(threadsNumber);

// read file
RandomAccessFile raf = new RandomAccessFile(args[0], "r");
MappedByteBuffer mbb
= raf.getChannel().map(FileChannel.MapMode.READ_ONLY, 0, raf.length());

ExecutorService pool = Executors.newFixedThreadPool(numberOfThreads);
Set<Future<int[]>> set = new HashSet<Future<int[]>>();

long chunkSize = raf.length() / numberOfThreads;
byte[] buffer = new byte[(int) chunkSize];

while(mbb.hasRemaining())
{
int remaining = buffer.length;
if(mbb.remaining() < remaining)
{
remaining = mbb.remaining();
}
mbb.get(buffer, 0, remaining);
String content = new String(buffer, "ISO-8859-1");
@SuppressWarnings("unchecked")
Callable<int[]> callable = new FrequenciesCounter(content);
Future<int[]> future = pool.submit(callable);
set.add(future);

}

raf.close();

// let`s assume we will use extended ASCII characters only
int alphabet = 256;

// hold how many times each character is contained in the input file
int[] frequencies = new int[alphabet];

// sum the frequencies from each thread
for(Future<int[]> future: set)
{
for(int i = 0; i < alphabet; i++)
{
frequencies[i] += future.get()[i];
}
}
}

}

//help class for multithreaded frequencies` counting
class FrequenciesCounter implements Callable
{
private int[] frequencies = new int[256];
private char[] content;

public FrequenciesCounter(String input)
{
content = input.toCharArray();
}

public int[] call()
{
System.out.println("Thread " + Thread.currentThread().getName() + "start");

for(int i = 0; i < content.length; i++)
{
frequencies[(int)content[i]]++;
}

System.out.println("Thread " + Thread.currentThread().getName() + "finished");

return frequencies;
}
}

最佳答案

正如评论中所建议的,当从多个线程读取时,您(通常)不会获得更好的性能。相反,您应该处理您在多个线程上读取的 block 。通常处理会执行一些阻塞、I/O 操作(保存到另一个文件?保存到数据库?HTTP 调用?)如果你在多线程上处理,你的性能会更好。

对于处理,您可能有 ExecutorService(具有合理数量的线程)。使用 java.util.concurrent.Executors 获取 java.util.concurrent.ExecutorService

实例

拥有 ExecutorService 实例,您可以 submit你的 block 进行处理。提交 block 不会阻塞。 ExecutorService 将开始在单独的线程中处理每个 block (详细信息取决于 ExecutorService 的配置)。您可以提交 RunnableCallable 的实例。

最后,在您提交所有项目后,您应该调用 awaitTermination在你的 ExecutorService。它将等到所有提交的项目处理完成。在 awaitTermination 返回后,您应该调用 shutdownNow() 来中止处理(否则它可能会无限期挂起,处理一些流氓任务)。

关于java - 多线程读取大文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44734483/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com