Actually I think there is even less work to be done, the paragraph spacing problem I talked about earlier can be avoided if I leave all that foobar tags and erase only:
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 10"> <meta name=Originator content="Microsoft Word 10"> <link rel=File-List href="community_files/filelist.xml"> <title>Community</title> <o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="City"/> <o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="place"/> <--------------[Need everything here]-------------> <- in there, not the following tags ;) </head> <body lang=EN-US style='tab-interval:.5in'> <---------------[Need everything in here]--------------> <- in there, not the following tags ;) </body> </html> Thanks, maybe if someone can tell me how to learn to work with some search and replace function that would be neat0 - Vic -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 14, 2002 5:10 PM To: '[EMAIL PROTECTED]' Subject: html parsing from html file through php Hello, I am making an app that read from an html file outputted by MS word (ya its for those people that need to make webpages but don't know how o write html) anyway, using MS word is a requirement; After the user saves their .doc file as a web page (now and htm file) the php will take that html file from a dir on the server, open it, read it, and ignore anything that is from the beginning of the file up to and right after the body tag ends, then it must ignore anything at the end of the page up and including the body tags and the closing html tag. So basically after its done doing its thing I would have all the content of the page ready to be echoed inside another page that would be a sort of shell or template. I am loocking right now at regular expressions and file_open etc, but just to give you an idea and to see if anybody has any helpful pointers, this (yes, can u believe it?) is the beginning of the word2html translation that MS word does: (BAH!) (i have to get rid of this remember?) ---------------------------------------------------------------------- <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 10"> <meta name=Originator content="Microsoft Word 10"> <link rel=File-List href="community_files/filelist.xml"> <title>Community</title> <o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="City"/> <o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="place"/> <!--[if gte mso 9]><xml> <o:DocumentProperties> <o:Author>Jim Weathers</o:Author> <o:LastAuthor>vic</o:LastAuthor> <o:Revision>2</o:Revision> <o:TotalTime>1</o:TotalTime> <o:Created>2002-08-14T19:54:00Z</o:Created> <o:LastSaved>2002-08-14T19:54:00Z</o:LastSaved> <o:Pages>1</o:Pages> <o:Words>79</o:Words> <o:Characters>451</o:Characters> <o:Company>x-core</o:Company> <o:Lines>3</o:Lines> <o:Paragraphs>1</o:Paragraphs> <o:CharactersWithSpaces>529</o:CharactersWithSpaces> <o:Version>10.2625</o:Version> </o:DocumentProperties> </xml><![endif]--><!--[if gte mso 9]><xml> <w:WordDocument> <w:SpellingState>Clean</w:SpellingState> <w:GrammarState>Clean</w:GrammarState> <w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEv ery> <w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery> <w:UseMarginsForDrawingGridOrigin/> <w:Compatibility> <w:FootnoteLayoutLikeWW8/> <w:ShapeLayoutLikeWW8/> <w:AlignTablesRowByRow/> <w:ForgetLastTabAlignment/> <w:LayoutRawTableWidth/> <w:LayoutTableRowsApart/> <w:UseWord97LineBreakingRules/> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]--><!--[if !mso]><object classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id=ieooui></object> <style> st1\:*{behavior:url(#ieooui) } </style> <![endif]--> <style> <!-- /* Font Definitions */ @font-face {font-family:Times; panose-1:2 2 6 3 5 4 5 2 3 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:536902279 -2147483648 8 0 511 0;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:536871559 0 0 0 415 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; mso-bidi-font-size:10.0pt; font-family:Times; mso-fareast-font-family:Times; mso-bidi-font-family:"Times New Roman";} p.MsoBodyTextIndent, li.MsoBodyTextIndent, div.MsoBodyTextIndent {margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; mso-bidi-font-size:10.0pt; font-family:Verdana; mso-fareast-font-family:"Times New Roman"; mso-bidi-font-family:"Times New Roman"; mso-ansi-language:EN-AU; mso-fareast-language:EN-US;} span.SpellE {mso-style-name:""; mso-spl-e:yes;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> </style> <!--[if gte mso 10]> <style> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:Times; mso-bidi-font-family:"Times New Roman";} </style> <![endif]--> </head> <body lang=EN-US style='tab-interval:.5in'> ---------------------------------------------------------- Right after this tag comes: (Everything before this tag must go) <div class=Section1> And I observed that this is pretty much constant in the other html pages, so I guess the script could take this as a stopping queue. And the very end is </div> So after this tag, everything must go. Erasing these tags, and previewing the document in the browser I saw that for some reason all paragraphs had a huge spage between them (perhaps 3 par spans) so it would be kewl if someone could also tell me how I could get rid of tags inside the document and replace them with "NORMAL" html tags. Yes, if you just tell me to RTFM it's all cool, but tell me where at least! THANQOOMAGIG! - Vic ______________________________________________________________________ Post your ad for free now! http://personals.yahoo.ca -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php