LLM应用实战-财经新闻自动聚合-6ren

LLM应用实战-财经新闻自动聚合

转载作者：撒哈拉更新时间：2024-12-16 14:47:52

1. 背景

这段时间项目比较忙，所以本qiang~有些耽误了学习，不过也算是百忙之中，抽取时间来支撑一个读者的需求，即爬取一些财经网站的新闻并自动聚合.

该读者看了之前的《AI资讯的自动聚合及报告生成》文章后，想要将这一套流程嵌套在财经领域，因此满打满算耗费了2-3天时间，来完成了该需求.

注意：爬虫不是本人的强项，只是一丢丢兴趣而已; 其次，本篇文章主要是用于个人学习，客官们请勿直接商业使用.

2. 面临的难点

1. 爬虫框架选取: 采用之前现学现用的crawl4ai作为基础框架，使用其高阶技能来逼近模拟人访问浏览器，因为网站都存在反爬机制，如鉴权、cookie等；。

2. 外网新闻: 需要kexue上网；。

3. 新闻内容解析: 此处耗费的工作量最多，并不是html的页面解析有多难，主要是动态页面加载如何集成crawl4ai来实现，且每个新闻网站五花八门.

3. 数据源

数据源。	url 。	备注。
财lian社。	https://www.cls.cn/depth?id=1000 。 https://www.cls.cn/depth?id=1003 。 https://www.cls.cn/depth?id=1007 。	1000: 头条. 1003: A股. 1007: 环球。
凤huang网。	https://finance.ifeng.com/shanklist/1-64-/ 。	。
新lang 。	https://finance.sina.com.cn/roll/#pageid=384&lid=2519&k=&num=50&page=1 。 https://finance.sina.com.cn/roll/#pageid=384&lid=2672&k=&num=50&page=1 。	2519: 财经。 2672: 美股。
环qiu时报。	https://finance.huanqiu.com 。	。
zaobao 。	https://www.zaobao.com/finance/china 。 https://www.zaobao.com/finance/world 。	国内及世界。
fox 。	https://www.foxnews.com/category/us/economy 。 https://www.foxnews.com//world/global-economy 。	美国及世界。
cnn 。	https://edition.cnn.com/business 。 https://edition.cnn.com/business/china 。	国内及世界。
reuters 。	https://www.reuters.com/business 。	。

4. 部分源码

为了减少风险，本qiang~只列出财lian社网页的解析代码，读者如想进一步交流沟通，可私信联系.

代码片段解析

1. schema是以json格式叠加css样式的策略，crawl4ai基于schema可以实现特定元素的结构化解析。

2. js_commands是js代码，主要用于模拟浏览新闻时的下翻页。

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
from typing import Dict, Any, Union, List
import os
import datetime
import re
import hashlib


def md5(text):
    m = hashlib.md5()
    m.update(text.encode('utf-8'))
    return m.hexdigest()


def get_datas(file_path, json_flag=True, all_flag=False, mode='r'):
    """读取文本文件"""
    results = []
    
    with open(file_path, mode, encoding='utf-8') as f:
        for line in f.readlines():
            if json_flag:
                results.append(json.loads(line))
            else:
                results.append(line.strip())
        if all_flag:
            if json_flag:
                return json.loads(''.join(results))
            else:
                return '\n'.join(results)
        return results
    

def save_datas(file_path, datas, json_flag=True, all_flag=False, with_indent=False, mode='w'):
    """保存文本文件"""
    with open(file_path, mode, encoding='utf-8') as f:
        if all_flag:
            if json_flag:
                f.write(json.dumps(datas, ensure_ascii=False, indent= 4 if with_indent else None))
            else:
                f.write(''.join(datas))
        else:
            for data in datas:
                if json_flag:
                    f.write(json.dumps(data, ensure_ascii=False) + '\n') 
                else:
                    f.write(data + '\n')


class AbstractAICrawler():
    
    def __init__(self) -> None:
        pass
    def crawl():
        raise NotImplementedError()


class AINewsCrawler(AbstractAICrawler):
    def __init__(self, domain) -> None:
        super().__init__()
        self.domain = domain
        self.file_path = f'data/{self.domain}.json'
        self.history = self.init()
    
    def init(self):
        if not os.path.exists(self.file_path):
            return {}
        return {ele['id']: ele for ele in get_datas(self.file_path)}
    
    def save(self, datas: Union[List, Dict]):
        if isinstance(datas, dict):
            datas = [datas]
        self.history.update({ele['id']: ele for ele in datas})
        save_datas(self.file_path, datas=list(self.history.values()))
    
    async def crawl(self, url:str, 
                    schema: Dict[str, Any]=None, 
                    always_by_pass_cache=True, 
                    bypass_cache=True,
                    headless=True,
                    verbose=False,
                    magic=True,
                    page_timeout=15000,
                    delay_before_return_html=2.0,
                    wait_for='',
                    js_code=None,
                    js_only=False,
                    screenshot=False,
                    headers={}):
        
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=verbose) if schema else None
        
        async with AsyncWebCrawler(verbose=verbose, 
                                   headless=headless, 
                                   always_by_pass_cache=always_by_pass_cache, headers=headers) as crawler:
            result = await crawler.arun(
                url=url,
                extraction_strategy=extraction_strategy,
                bypass_cache=bypass_cache,
                page_timeout=page_timeout,
                delay_before_return_html=delay_before_return_html,
                wait_for=wait_for,
                js_code=js_code,
                magic=magic,
                remove_overlay_elements=True,
                process_iframes=True,
                exclude_external_links=True,
                js_only=js_only,
                screenshot=screenshot
            )

            assert result.success, "Failed to crawl the page"
            if schema:
                res = json.loads(result.extracted_content)
                if screenshot:
                    return res, result.screenshot
                return res
            return result.html


class FinanceNewsCrawler(AINewsCrawler):
    
    def __init__(self, domain='') -> None:
        super().__init__(domain)
    
    def save(self, datas: Union[List, Dict]):
        if isinstance(datas, dict):
            datas = [datas]
        self.history.update({ele['id']: ele for ele in datas})
        save_datas(self.file_path, datas=datas, mode='a')
    
    async def get_last_day_data(self):
        last_day = (datetime.date.today() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')
        datas = self.init()
        return [v for v in datas.values() if last_day in v['date']]
    

class CLSCrawler(FinanceNewsCrawler):
    """
        财某社新闻抓取
    """
    def __init__(self) -> None:
        self.domain = 'cls'
        super().__init__(self.domain)
        self.url = 'https://www.cls.cn'
        
    async def crawl_url_list(self, url='https://www.cls.cn/depth?id=1000'):
        schema = {
            'name': 'caijingwang toutiao page crawler',
            'baseSelector': 'div.f-l.content-left',
            'fields': [
                {
                    'name': 'top_titles',
                    'selector': 'div.depth-top-article-list',
                    'type': 'nested_list',
                    'fields': [
                        {'name': 'href', 'type': 'attribute', 'attribute':'href', 'selector': 'a[href]'}
                    ]
                },
                {
                    'name': 'sec_titles',
                    'selector': 'div.depth-top-article-list  li.f-l',
                    'type': 'nested_list',
                    'fields': [
                        {'name': 'href', 'type': 'attribute', 'attribute':'href', 'selector': 'a[href]'}
                    ]
                },
                {
                    'name': 'bottom_titles',
                    'selector': 'div.b-t-1 div.clearfix',
                    'type': 'nested_list',
                    'fields': [
                        {'name': 'href', 'type': 'attribute', 'attribute':'href', 'selector': 'a[href]'}
                    ]
                }
            ]
        }
        
        js_commands = [
            """
            (async () => {{
                
                await new Promise(resolve => setTimeout(resolve, 500));
                
                const targetItemCount = 100;
                
                let currentItemCount = document.querySelectorAll('div.b-t-1 div.clearfix a.f-w-b').length;
                let loadMoreButton = document.querySelector('.list-more-button.more-button');
                
                while (currentItemCount < targetItemCount) {{
                    window.scrollTo(0, document.body.scrollHeight);
                    
                    await new Promise(resolve => setTimeout(resolve, 1000));
                    
                    if (loadMoreButton) {
                        loadMoreButton.click();
                    } else {
                        console.log('没有找到加载更多按钮');
                        break;
                    }
                    
                    await new Promise(resolve => setTimeout(resolve, 1000));
                    
                    currentItemCount = document.querySelectorAll('div.b-t-1 div.clearfix a.f-w-b').length;
                    
                    loadMoreButton = document.querySelector('.list-more-button.more-button');
                }}
                console.log(`已加载 ${currentItemCount} 个item`);
                return currentItemCount;
            }})();
            """
        ]
        wait_for = ''
        
        results = {}
        
        menu_dict = {
            '1000': '头条',
            '1003': 'A股',
            '1007': '环球'
        }
        for k, v in menu_dict.items():
            url = f'https://www.cls.cn/depth?id={k}'
            try:
                links = await super().crawl(url, schema, always_by_pass_cache=True, bypass_cache=True, js_code=js_commands, wait_for=wait_for, js_only=False)
            except Exception as e:
                print(f'error {url}')
                links = []
            if links:
                links = [ele['href'] for eles in links[0].values() for ele in eles if 'href' in ele]
            links = sorted(list(set(links)), key=lambda x: x)
            results.update({f'{self.url}{ele}': v for ele in links})
        return results
    
    async def crawl_newsletter(self, url, category):
        schema = {
            'name': '财联社新闻详情页',
            'baseSelector': 'div.f-l.content-left',
            'fields': [
                {
                    'name': 'title',
                    'selector': 'span.detail-title-content',
                    'type': 'text'
                },
                {
                    'name': 'time',
                    'selector': 'div.m-r-10',
                    'type': 'text'
                },
                {
                    'name': 'abstract',
                    'selector': 'pre.detail-brief',
                    'type': 'text',
                    'fields': [
                        {'name': 'href', 'type': 'attribute', 'attribute':'href', 'selector': 'a[href]'}
                    ]
                },
                {
                    'name': 'contents',
                    'selector': 'div.detail-content p',
                    'type': 'list',
                    'fields': [
                        {'name': 'content', 'type': 'text'}
                    ]
                },
                {
                    'name': 'read_number',
                    'selector': 'div.detail-option-readnumber',
                    'type': 'text'
                }
            ]
        }
        
        wait_for = 'div.detail-content'
        try:
            results = await super().crawl(url, schema, always_by_pass_cache=True, bypass_cache=True, wait_for=wait_for)
            result = results[0]
        except Exception as e:
            print(f'crawler error: {url}')
            return {}
        
        return {
            'title': result['title'],
            'abstract': result['abstract'],
            'date': result['time'],
            'link': url,
            'content': '\n'.join([ele['content'] for ele in result['contents'] if 'content' in ele and ele['content']]),
            'id': md5(url),
            'type': category,
            'read_number': await self.get_first_float_number(result['read_number'], r'[-+]?\d*\.\d+|\d+'),
            'time': datetime.datetime.now().strftime('%Y-%m-%d')
        }
    
    async def get_first_float_number(self, text, pattern):
        match = re.search(pattern, text)
        if match:
            return round(float(match.group()), 4)
        return 0
    
    async def crawl(self):
        link_2_category = await self.crawl_url_list()
        for link, category in link_2_category.items():
            _id = md5(link)
            if _id in self.history:
                continue
            news = await self.crawl_newsletter(link, category)
            if news:
                self.save(news)
        return await self.get_last_day_data()
    
if __name__ == '__main__':
    asyncio.run(CLSCrawler().crawl())

5. 总结。

一句话足矣~ 。

开发了一款新闻资讯的自动聚合的工具，基于crawl4ai框架实现.

有问题可以私信或留言沟通！。

6. 参考

(1) Crawl4ai: https://github.com/unclecode/crawl4ai 。

。

最后此篇关于LLM应用实战-财经新闻自动聚合的文章就讲到这里了,如果你想了解更多关于LLM应用实战-财经新闻自动聚合的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

文章推荐： GraphRAG+文档结构：打造高性能实体溯源方案

文章推荐：【杂谈】如何选择：Session还是JWT？

文章推荐：大话《权限设计》全篇，领略不同设计模式的魅力

文章推荐：推荐一款强大的开源物联网Web组态软件

javascript - 新闻/更新下拉部分
我最近开始接触网络编程，我完全不知所措。我已经开始学习基础知识、html、css 和 javascript。在我的第一个网页上，我有兴趣为我发布更新的地方实现一个下拉新闻部分。我能举出的最好的例子就是
css - 新闻 |内容区域不会自动展开
我的 Wordpress 主题在我放大评论框时没有扩展其内容区域，因此整个内容都在页脚上。页脚保持固定在页面底部，但当我展开评论框时不会自行向下推... 我尝试阅读其他问题，但我没有解决那个问题。代
TYPO3 新闻 (tx_news) 记录排序不起作用
我正在运行 TYPO3 V6 和最新版本的 tx_news(不是 tt_news)，当我尝试更改 LIST 显示的排序顺序时，插件中的设置不会覆盖 Typoscript 设置。似乎没有办法更改 Lis
php - Google 新闻 - 网址方案和主键
根据 Google 允许您的文章/新闻出现在 Google 新闻中: Display a three-digit number. The URL for each article must conta
php - 新闻 | Woocommerce 自定义注册表格
我想问一下是否可以使用表单提交后发送到用户邮件的唯一代码创建注册，我不确定如何正确地做。 for example : The user enters his email and the system
python - 微调预训练的 word2vec Google 新闻
我目前正在使用在 Google 新闻语料库上训练的 Word2Vec 模型(来自 here)由于这只针对 2013 年之前的新闻进行训练，因此我需要根据 2013 年之后的新闻更新向量并在词汇表中添加
css - 新闻| span tag 发生的地方是不可见的
所有的麻烦都开始了，我无法按画廊的右箭头(右箭头出现在图片中) 我看到只有当我将栏移到右侧时，我才能点击箭头。如您所见，我在 Firefox 中打开了 F12，指向了网站的右侧部分。我看到它是空的
html - 纯 CSS 新闻/信息提要
我有一些代码运行良好，它只是添加了一个水平新闻提要(或我列出的任何信息)它运行良好，没有闪烁，但是当我向它添加更多数据时，它似乎需要一段时间才能加载并且速度变化？我还有很多信息要添加到其中，但我不想在
apache - URL 新闻 ID 的重写规则
我有点坚持 RewriteRule 301，从旧新闻 ID 更改为新新闻 ID 这是我尝试过的: RewriteRule ^/news/0(.*)$ /news/$1 [L,R=301] 假设新闻 U
rss - 我可以在我的网站上自定义 Google 新闻 RSS 提要的外观吗？
我对 Google 新闻 RSS 提要的使用有疑问。 Google 新闻帮助说明了这一点: Why Google might block an RSS feed In some cases, Goog
rss - 通过 Google 新闻 RSS 选择自定义主题
我想在我的网站上加入新闻，但主要主题必须是“书籍”或“作者”等。基本上，我需要选择我提到的这两个或作者姓名等自定义主题。但我不知道怎么做，因为文档真的很差(或者我找不到)。添加它的查询参数是什么？
java - 用 Java 解析 Google 新闻
最好的方法是什么？我想解析新闻，然后使用关键字之类的内容过滤它们并找到匹配项。有人已经这样做了吗？而且，这是合法的吗？最佳答案您可以使用 google 新闻网址 http://news.goo
javascript - js 新闻 api promise 无法获取某些值
我有一个获取热门新闻头条的 js 函数。它已按 promise 返回，但我无法访问这些文章。这是我的代码 function news09() { var url = 'https://ne
swift - 添加加载更多动画(如 Facebook 新闻)的最佳方式？
我想让我的 TableView 加载更多动画，例如 Facebook 新闻，并在底部显示动画指示器事件。有什么指导可以帮助我做到这一点吗？谢谢。最佳答案有几种方法可以做到这一点在最后一个 in
php - 新闻系统问题(PHP 和 Mysql)
我正在为我的网站创建一个新闻系统。我有一个主页 {index.php}(显示所有文章)和一个文章页面 (article.php) 我遇到的问题是在文章页面上选择内容。当你点击 index.php 上
python - 新闻 API - 将输出输出到 Pandas DataFrame
我已成功调用新闻 API 并将结果放入 DataFrame，但仅限于第 1 页。 def get_articles(keyword): all_articles = newsapi.get_eve
jquery - 新闻 slider CSS 和 HTML
我有一个适合我网站的新闻 slider ，我想使用它，但我遇到了一个小问题。完成 HTML 和 CSS 后，我需要旋转“展示柜”，现在我已经尝试使用 Jquery 的不同指令，但一点运气都没有。有人可
javascript - 使用 JSON 新闻 api 创建搜索查询
因此，我必须根据编写的 javascript 文件(如下)创建搜索查询，并且还必须使用此 URL 来创建搜索查询。在 URL 末尾，您可以添加任何您喜欢的搜索词。例如，我们将搜索食物:https://
image - TYPO3 新闻 : show first image in preview
我在 TYPO3 8.7.13 中使用来自 Georg Ringer 的新闻扩展。如果没有选择图像进行预览，扩展程序会显示一个虚拟图像。是否可以改用文章中的第一张图片？谢谢最佳答案当然，您需要
configuration - Typo3 6.0 - TCA - 新闻 - 在选择字段中隐藏某些类型的新闻
我是typo3 的新手，我需要有关新闻扩展和$TCA 配置的帮助。我做了一个名为“Activité”的扩展，它从 News 扩展而来。这很顺利。我创建了一些自定义字段，并且能够从“常规”选项卡中已经

撒哈拉

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城