gpt4 book ai didi

php - 通过 PHP 检测 excel .xlsx 文件 mimetype

转载 作者:行者123 更新时间:2023-12-04 20:47:43 30 4
gpt4 key购买 nike

我无法通过 PHP 检测 xlsx Excel 文件的 mimetype,因为它是 zip 存档。

文件实用程序

file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract

PECL 文件信息
$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip

如何验证它?解压并查看结构?但如果是弧形炸弹呢?

最佳答案

概述

PHP 使用 libmagic。当 Magic 检测到 MIME 类型为“application/zip”而不是“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”时,这是因为添加到 ZIP 存档的文件需要按特定顺序排列。

在将文件上传到强制匹配文件扩展名和 MIME 类型的服务时,这会导致问题。例如,基于 Mediawiki 的 wiki(使用 PHP 编写)阻止上传某些 XLSX 文件,因为它们被检测为 ZIP 文件。

您需要做的是通过重新排序写入 ZIP 存档的文件来修复您的 XLSX,以便 Magic 可以正确检测 MIME 类型。

分析文件

对于此示例,我们将分析使用 Openpyxl 和 Excel 创建的 XLSX 文件。

可以使用解压缩查看文件列表:

$ unzip -l Openpyxl.xlsx
Archive: Openpyxl.xlsx
Length Date Time Name
--------- ---------- ----- ----
177 2019-12-21 04:34 docProps/app.xml
452 2019-12-21 04:34 docProps/core.xml
10140 2019-12-21 04:34 xl/theme/theme1.xml
22445 2019-12-21 04:34 xl/worksheets/sheet1.xml
586 2019-12-21 04:34 xl/tables/table1.xml
238 2019-12-21 04:34 xl/worksheets/_rels/sheet1.xml.rels
951 2019-12-21 04:34 xl/styles.xml
534 2019-12-21 04:34 _rels/.rels
552 2019-12-21 04:34 xl/workbook.xml
507 2019-12-21 04:34 xl/_rels/workbook.xml.rels
1112 2019-12-21 04:34 [Content_Types].xml
--------- -------
37694 11 files

$ unzip -l Excel.xlsx
Archive: Excel.xlsx
Length Date Time Name
--------- ---------- ----- ----
1476 1980-01-01 00:00 [Content_Types].xml
732 1980-01-01 00:00 _rels/.rels
831 1980-01-01 00:00 xl/_rels/workbook.xml.rels
1159 1980-01-01 00:00 xl/workbook.xml
239 1980-01-01 00:00 xl/sharedStrings.xml
293 1980-01-01 00:00 xl/worksheets/_rels/sheet1.xml.rels
6796 1980-01-01 00:00 xl/theme/theme1.xml
1540 1980-01-01 00:00 xl/styles.xml
1119 1980-01-01 00:00 xl/worksheets/sheet1.xml
39574 1980-01-01 00:00 docProps/thumbnail.wmf
785 1980-01-01 00:00 docProps/app.xml
169 1980-01-01 00:00 xl/calcChain.xml
513 1980-01-01 00:00 xl/tables/table1.xml
601 1980-01-01 00:00 docProps/core.xml
--------- -------
55827 14 files

请注意,文件顺序不同。

可以使用 PHP 查看 MIME 类型:
<?php
echo mime_content_type('Openpyxl.xlsx') . "<br/>\n";
echo mime_content_type('Excel.xlsx');

或使用 python-magic:
pip install python-magic

在 Windows 上:
pip install python-magic-bin==0.4.14

代码:
import magic
mime = magic.Magic(mime=True)
print(mime.from_file("Openpyxl.xlsx"))
print(mime.from_file("Excel.xlsx"))

输出:
application/zip
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

解决方案

@adrilo已经研究了这个问题并制定了解决方案。

Hey @garak,

After pulling my hair out for a few hours, I finally figured out why the mime type is wrong. It turns out the order in which the XML files gets added to the final ZIP file (an XLSX file being a ZIP file with the xlsx extension) matters for the heuristics used to detect types.

Currently, files are added in this order:

[Content_Types].xml
_rels/.rels
docProps/app.xml
docProps/core.xml
xl/_rels/workbook.xml.rels
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
xl/worksheets/sheet1.xml

The problem comes from inserting the "docProps" related files. It seems like the heuristic is to look at the first few bytes and check if it finds Content_Types and xl. By having the "docProps" files inserted in between, the first xl occurrence must happen outside of the first bytes the algorithm looks at and therefore concludes it's a simple zip file.

I'll try to fix this nicely


  • https://github.com/box/spout/issues/149#issuecomment-162049588

  • Fixes #149

    Heuristics to detect proper mime type for XLSX files expect to see certain files at the beginning of the XLSX archive. The order in which the XML files are added therefore matters. Specifically, "[Content_Types].xml" should be added first, followed by the files located in the "xl" folder (at least 1 file).


  • https://github.com/box/spout/pull/152

  • 根据 Spout's FileSystemHelper.php :

    In order to have the file's mime type detected properly, files need to be added to the zip file in a particular order. "[Content_Types].xml" then at least 2 files located in "xl" folder should be zipped first.


  • https://github.com/box/spout/blob/master/src/Spout/Writer/XLSX/Helper/FileSystemHelper.php#L382

  • 解决方案是依次添加文件“[Content_Types].xml”、“xl/workbook.xml”和“xl/styles.xml”,然后添加其余文件。

    代码

    此 Python 脚本将重写一个 XLSX 文件,该文件具有正确顺序的存档文件。
    #!/usr/bin/env python

    from io import BytesIO
    from zipfile import ZipFile, ZIP_DEFLATED

    XL_FOLDER_NAME = "xl"

    CONTENT_TYPES_XML_FILE_NAME = "[Content_Types].xml"
    WORKBOOK_XML_FILE_NAME = "workbook.xml"
    STYLES_XML_FILE_NAME = "styles.xml"

    FIRST_NAMES = [
    CONTENT_TYPES_XML_FILE_NAME,
    f"{XL_FOLDER_NAME}/{WORKBOOK_XML_FILE_NAME}",
    f"{XL_FOLDER_NAME}/{STYLES_XML_FILE_NAME}"
    ]


    def fix_workbook_mime_type(file_path):
    buffer = BytesIO()

    with ZipFile(file_path) as zip_file:
    names = zip_file.namelist()
    print(names)

    remaining_names = [name for name in names if name not in FIRST_NAMES]
    ordered_names = FIRST_NAMES + remaining_names
    print(ordered_names)

    with ZipFile(buffer, "w", ZIP_DEFLATED, allowZip64=True) as buffer_zip_file:
    for name in ordered_names:
    try:
    file = zip_file.open(name)
    buffer_zip_file.writestr(file.name, file.read())
    except KeyError:
    pass

    with open(file_path, "wb") as file:
    file.write(buffer.getvalue())


    def main(*args):
    fix_workbook_mime_type("File.xlsx")


    if __name__ == "__main__":
    main()

    关于php - 通过 PHP 检测 excel .xlsx 文件 mimetype,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7274030/

    30 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com