gpt4 book ai didi

java - WebCrawler stop方法逻辑【并发实战7.2.5】

转载 作者:行者123 更新时间:2023-11-30 10:33:04 25 4
gpt4 key购买 nike

我已经阅读了 concurrency in practice(Limitations of shutdownNow) 中的 7.2.5 章节

shutdown的问题 现在它只返回未启动的任务。

首先我们创建 ExecutorService 来跟踪关闭后取消的任务。

跟踪执行器:

/**
* TrackingExecutor
* <p/>
* ExecutorService that keeps track of cancelled tasks after shutdown
*
* @author Brian Goetz and Tim Peierls
*/
public class TrackingExecutor extends AbstractExecutorService {
private final ExecutorService exec;
private final Set<Runnable> tasksCancelledAtShutdown =
Collections.synchronizedSet(new HashSet<Runnable>());

public TrackingExecutor(ExecutorService exec) {
this.exec = exec;
}

public void shutdown() {
exec.shutdown();
}

public List<Runnable> shutdownNow() {
return exec.shutdownNow();
}

public boolean isShutdown() {
return exec.isShutdown();
}

public boolean isTerminated() {
return exec.isTerminated();
}

public boolean awaitTermination(long timeout, TimeUnit unit)
throws InterruptedException {
return exec.awaitTermination(timeout, unit);
}

public List<Runnable> getCancelledTasks() {
if (!exec.isTerminated())
throw new IllegalStateException(/*...*/);
return new ArrayList<Runnable>(tasksCancelledAtShutdown);
}

public void execute(final Runnable runnable) {
exec.execute(new Runnable() {
public void run() {
try {
runnable.run();
} finally {
if (isShutdown()
&& Thread.currentThread().isInterrupted())
tasksCancelledAtShutdown.add(runnable);
}
}
});
}
}

然后我们创建使用 TrackingExecutor 的 Crawler:

爬虫:

/**
* WebCrawler
* <p/>
* Using TrackingExecutorService to save unfinished tasks for later execution
*
* @author Brian Goetz and Tim Peierls
*/
public abstract class WebCrawler {
private volatile TrackingExecutor exec;
@GuardedBy("this") private final Set<URL> urlsToCrawl = new HashSet<URL>();

private final ConcurrentMap<URL, Boolean> seen = new ConcurrentHashMap<URL, Boolean>();
private static final long TIMEOUT = 500;
private static final TimeUnit UNIT = MILLISECONDS;

public WebCrawler(URL startUrl) {
urlsToCrawl.add(startUrl);
}

public synchronized void start() {
exec = new TrackingExecutor(Executors.newCachedThreadPool());
for (URL url : urlsToCrawl) submitCrawlTask(url);
urlsToCrawl.clear();
}

public synchronized void stop() throws InterruptedException {
try {
saveUncrawled(exec.shutdownNow());
if (exec.awaitTermination(TIMEOUT, UNIT))
saveUncrawled(exec.getCancelledTasks());
} finally {
exec = null;
}
}

protected abstract List<URL> processPage(URL url);

private void saveUncrawled(List<Runnable> uncrawled) {
for (Runnable task : uncrawled)
urlsToCrawl.add(((CrawlTask) task).getPage());
}

private void submitCrawlTask(URL u) {
exec.execute(new CrawlTask(u));
}

private class CrawlTask implements Runnable {
private final URL url;

CrawlTask(URL url) {
this.url = url;
}

private int count = 1;

boolean alreadyCrawled() {
return seen.putIfAbsent(url, true) != null;
}

void markUncrawled() {
seen.remove(url);
System.out.printf("marking %s uncrawled%n", url);
}

public void run() {
for (URL link : processPage(url)) {
if (Thread.currentThread().isInterrupted())
return;
submitCrawlTask(link);
}
}

public URL getPage() {
return url;
}
}
}

让研究停止方法:

 public synchronized void stop() throws InterruptedException {
try {
saveUncrawled(exec.shutdownNow()); //1
if (exec.awaitTermination(TIMEOUT, UNIT)) //2
saveUncrawled(exec.getCancelledTasks()); //3
} finally {
exec = null;
}
}
}

saveUncrawled(exec.shutdownNow()); //1

1 行中,我们执行 shutdownNow 并保存返回(未启动)的任务。
如果我理解正确 shutdownNow 返回未开始的任务并中断已经开始的任务

exec.awaitTermination(TIMEOUT, UNIT) //2

此外,我们想将已取消的任务添加到此集合中。在 2 行,我们给出时间并等待超时终止。

问题一

为什么我们要为此操作设置超时时间?

据我了解 - shutdownNow 无论如何都会中断正在进行的任务。而且我看不出有什么理由等待。

exec.getCancelledTasks() 

awaitTermination 方法在任务成功完成的情况下返回 true 因此我不清楚为什么我们在这种情况下尝试添加已取消的任务。

请阐明stop方法的逻辑。

最佳答案

关于 boolean awaitTermination(long timeout, TimeUnit unit) 的超时:

中断一个线程并不一定会立即(或根本)停止它。引用Java Tutorial on Interrupts :

An interrupt is an indication to a thread that it should stop what it is doing and do something else. It's up to the programmer to decide exactly how a thread responds to an interrupt, but it is very common for the thread to terminate. This is the usage emphasized in this lesson.

ExecutorService#shutdownNow() 的 javadoc 中也有直接说明:

There are no guarantees beyond best-effort attempts to stop processing actively executing tasks. For example, typical implementations will cancel via Thread.interrupt(), so any task that fails to respond to interrupts may never terminate.

Thread#interrupt()的javadoc中提到了线程在中断后可能仍然存活的其他原因。 .例如:

Unless the current thread is interrupting itself, which is always permitted, the checkAccess method of this thread is invoked, which may cause a SecurityException to be thrown.

如果不仔细研究ExecutorService的javadoc,stop()方法的逻辑并不明显。 (参见“使用示例”部分,第二个示例)。 shutdownNow() 的问题在于它试图取消所有线程,但是 (a) 这可能需要一些时间,并且 (b) 不能保证它会成功(见上文)。 awaitTermination(long, TimeUnit) 允许跟踪此进度。我将逐行通过 stop() 方法:

saveUncrawled(exec.shutdownNow());

启动 ExecutorService 的关闭并收集等待执行的任务。已完成的任务将被忽略,当前正在执行的任务也将被忽略。

if (exec.awaitTermination(TIMEOUT, UNIT))

shutdownNow() 只是向当前运行的任务发出信号,它们应该通过中断停止。它不会杀死他们。此外,在被打扰后停止工作也需要时间。因此,您必须等待执行完成。超时是为了防止您永远阻塞,以防某些任务永远无法完成(无论出于何种原因)。请记住,线程可以忽略中断,否则它们可能需要比剩余超时时间更长的时间才能停止工作。因此在 awaitTermination(TIMEOUT, UNIT) 之后可能还有一些任务。 TrackingExecutor 只收集可以取消的任务。但不是那些在超时到期后可能仍在执行的。

saveUncrawled(exec.getCancelledTasks());

如果所有任务都可以取消,awaitTermination() 返回 true。在这种情况下,将收集所有已取消的任务。如果不是所有任务都可以取消(即 awaitTermination() 返回 false),仍然会有一些任务未处理。

关于java - WebCrawler stop方法逻辑【并发实战7.2.5】,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42385664/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com