[PHP] P.S. html parsing from html file through php

victor Wed, 14 Aug 2002 14:25:35 -0700

Actually I think there is even less work to be done, the paragraph
spacing problem I talked about earlier can be avoided if I leave all
that foobar tags and erase only:


<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40";>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="community_files/filelist.xml">
<title>Community</title>
<o:SmartTagType
namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="City"/>
<o:SmartTagType
namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="place"/>

<--------------[Need everything here]-------------> <- in there, not the
following tags ;)

</head>

<body lang=EN-US style='tab-interval:.5in'>

<---------------[Need everything in here]--------------> <- in there,
not the following tags ;)

</body>

</html>

Thanks, maybe if someone can tell me how to learn to work with some
search and replace function that would be neat0

- Vic


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, August 14, 2002 5:10 PM
To: '[EMAIL PROTECTED]'
Subject: html parsing from html file through php

Hello, I am making an app that read from an html file outputted by MS
word (ya its for those people that need to make webpages but don't know
how o write html) anyway, using MS word is a requirement; After the user
saves their .doc file as a web page (now and htm file) the php will take
that html file from a dir on the server, open it, read it, and ignore
anything that is from the beginning of the file up to and right after
the body tag ends, then it must ignore anything at the end of the page
up and including the body tags and the closing html tag. So basically
after its done doing its thing I would have all the content of the page
ready to be echoed inside another page that would be a sort of shell or
template.

I am loocking right now at regular expressions and file_open etc, but
just to give you an idea and to see if anybody has any helpful pointers,
this (yes, can u believe it?) is the beginning of the word2html
translation that MS word does: (BAH!) (i have to get rid of this
remember?)

----------------------------------------------------------------------

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40";>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="community_files/filelist.xml">
<title>Community</title>
<o:SmartTagType
namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="City"/>
<o:SmartTagType
namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="place"/>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Jim Weathers</o:Author>
  <o:LastAuthor>vic</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:TotalTime>1</o:TotalTime>
  <o:Created>2002-08-14T19:54:00Z</o:Created>
  <o:LastSaved>2002-08-14T19:54:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>79</o:Words>
  <o:Characters>451</o:Characters>
  <o:Company>x-core</o:Company>
  <o:Lines>3</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>529</o:CharactersWithSpaces>
  <o:Version>10.2625</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:SpellingState>Clean</w:SpellingState>
  <w:GrammarState>Clean</w:GrammarState>
 
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEv
ery>
 
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
  <w:UseMarginsForDrawingGridOrigin/>
  <w:Compatibility>
   <w:FootnoteLayoutLikeWW8/>
   <w:ShapeLayoutLikeWW8/>
   <w:AlignTablesRowByRow/>
   <w:ForgetLastTabAlignment/>
   <w:LayoutRawTableWidth/>
   <w:LayoutTableRowsApart/>
   <w:UseWord97LineBreakingRules/>
  </w:Compatibility>
  <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
 </w:WordDocument>
</xml><![endif]--><!--[if !mso]><object
 classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D"
id=ieooui></object>
<style>
st1\:*{behavior:url(#ieooui) }
</style>
<![endif]-->
<style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:Times;
        panose-1:2 2 6 3 5 4 5 2 3 4;
        mso-font-charset:0;
        mso-generic-font-family:roman;
        mso-font-pitch:variable;
        mso-font-signature:536902279 -2147483648 8 0 511 0;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;
        mso-font-charset:0;
        mso-generic-font-family:swiss;
        mso-font-pitch:variable;
        mso-font-signature:536871559 0 0 0 415 0;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {mso-style-parent:"";
        margin:0in;
        margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:12.0pt;
        mso-bidi-font-size:10.0pt;
        font-family:Times;
        mso-fareast-font-family:Times;
        mso-bidi-font-family:"Times New Roman";}
p.MsoBodyTextIndent, li.MsoBodyTextIndent, div.MsoBodyTextIndent
        {margin:0in;
        margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:12.0pt;
        mso-bidi-font-size:10.0pt;
        font-family:Verdana;
        mso-fareast-font-family:"Times New Roman";
        mso-bidi-font-family:"Times New Roman";
        mso-ansi-language:EN-AU;
        mso-fareast-language:EN-US;}
span.SpellE
        {mso-style-name:"";
        mso-spl-e:yes;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.25in 1.0in 1.25in;
        mso-header-margin:.5in;
        mso-footer-margin:.5in;
        mso-paper-source:0;}
div.Section1
        {page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
        {mso-style-name:"Table Normal";
        mso-tstyle-rowband-size:0;
        mso-tstyle-colband-size:0;
        mso-style-noshow:yes;
        mso-style-parent:"";
        mso-padding-alt:0in 5.4pt 0in 5.4pt;
        mso-para-margin:0in;
        mso-para-margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:10.0pt;
        font-family:Times;
        mso-bidi-font-family:"Times New Roman";}
</style>
<![endif]-->
</head>

<body lang=EN-US style='tab-interval:.5in'>

----------------------------------------------------------

Right after this tag comes:
(Everything before this tag must go)

<div class=Section1>

And I observed that this is pretty much constant in the other html
pages, so I guess the script could take this as a stopping queue.

And the very end is 

</div> 

So after this tag, everything must go.

Erasing these tags, and previewing the document in the browser I saw
that for some reason all paragraphs had a huge spage between them
(perhaps 3 par spans) so it would be kewl if someone could also tell me
how I could get rid of tags inside the document and replace them with
"NORMAL" html tags.


Yes, if you just tell me to RTFM it's all cool, but tell me where at
least!

THANQOOMAGIG!

- Vic



______________________________________________________________________ 
Post your ad for free now! http://personals.yahoo.ca

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] P.S. html parsing from html file through php

Reply via email to