gpt4 book ai didi

php - 如何用PHP从word文档中提取文本内容?

转载 作者:IT王子 更新时间:2023-10-29 00:08:46 26 4
gpt4 key购买 nike

我想用PHP从word文档中提取文本内容。

我在 Microsoft Word for Mac 2011 中创建了一个新的 Word 文档。编辑:还通过在 Windows 7 下的 Microsoft Word 中创建相同的文档进行了测试。

文档的内容是

The quick brown fox jumps over the lazy dog

我已将它作为 Word 97-2004 文档 (.doc) 保存到磁盘。

我正在使用 phpoffice/phpword和这段代码来提取文本:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
$els = $s->getElements();
foreach ($els as $e) {
if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
$text .= $e->getText();
} elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
$text .= " \n";
} else {
throw new Exception('Unknown class type ' . get_class($e));
}
}
}

print $text;

这段代码的输出只是部分文本:

The quick brown fox j

是代码有问题,还是某种兼容性问题?

编辑:

如果我在 foreach ($els as $e) { 之前添加一个 var_dump($els); 输出是这样的:

array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
["text":protected]=>
string(21) "The quick brown fox j"
["fontStyle":protected]=>
object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["type":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "text"
["name":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["hint":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["size":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["color":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["bold":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["italic":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["underline":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "none"
["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["scale":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["shading":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["paragraphStyle":protected]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["phpWord":protected]=>
object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
["sections":"PhpOffice\PhpWord\PhpWord":private]=>
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
["container":protected]=>
string(7) "Section"
["style":"PhpOffice\PhpWord\Element\Section":private]=>
object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
string(8) "portrait"
["paper":"PhpOffice\PhpWord\Style\Section":private]=>
object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
array(6) {
["A3"]=>
array(3) {
[0]=>
int(297)
[1]=>
int(420)
[2]=>
string(2) "mm"
}
["A4"]=>
array(3) {
[0]=>
int(210)
[1]=>
int(297)
[2]=>
string(2) "mm"
}
["A5"]=>
array(3) {
[0]=>
int(148)
[1]=>
int(210)
[2]=>
string(2) "mm"
}
["Folio"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(13)
[2]=>
string(2) "in"
}
["Legal"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(14)
[2]=>
string(2) "in"
}
["Letter"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(11)
[2]=>
string(2) "in"
}
}
["size":"PhpOffice\PhpWord\Style\Paper":private]=>
string(2) "A4"
["width":"PhpOffice\PhpWord\Style\Paper":private]=>
int(11870)
["height":"PhpOffice\PhpWord\Style\Paper":private]=>
int(16787)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
int(11906)
["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
int(16838)
["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
int(0)
["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
int(1)
["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["headers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["footers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["elements":protected]=>
array(1) {
[0]=>
*RECURSION*
}
["phpWord":protected]=>
*RECURSION*
["sectionId":protected]=>
int(1)
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
NULL
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
NULL
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
["collections":"PhpOffice\PhpWord\PhpWord":private]=>
array(5) {
["Bookmarks"]=>
object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Titles"]=>
object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Footnotes"]=>
object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Endnotes"]=>
object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Charts"]=>
object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
}
["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
array(3) {
["DocInfo"]=>
object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
array(0) {
}
}
["Protection"]=>
object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
NULL
}
["Compatibility"]=>
object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
int(12)
}
}
}
["sectionId":protected]=>
NULL
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
string(6) "5d531b"
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
string(7) "Section"
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}

最佳答案

尝试先创建你的阅读器

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

答案基于此 exampleAPI documentation

关于php - 如何用PHP从word文档中提取文本内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41216935/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com