On 23-Jun-98 Luiz Otavio L. Zorzella wrote: > > "strings" would do a good job for me, but... > > > and then edit wordfile.txt to clean it up. Raw "strings" will skip > sequences of > > fewer than 4 ASCII characters but these are unlikely to occur in a Word > > document. This method will suppress all formatting info except > end-of-line, so > > you are likely to get long lines (= Word paragraphs). It will also fail to > > recognise any non-US-ASCII character codes (above 127) so accented > characters > > and special symbols, etc, will be missed. But if you simply need to read > the > > text content of a Word document containing plain English text, then this > method > > works fine. > > ... my text is in portuguese, and does have non-US chars. Is there a > way to tell "strings" to accept some non-US chars?
Unfortunately not, or not well ... "strings" works by extracting sequences of US-ASCII characters (codes 32-126) of length (by default) at least 4. If you KNEW that codes outside this range really did represent characters (such as the accented characters in Portuguese) then you wouldn't need to use "strings"! However, in word-processor files (such as Word's or WordPerfect's) the codes outside that range have various "binary" significances as well (in Word's case) as representing "special" characters. The only way you could get at these would be to interpret the binary codes so as to locate stretches of text. Otherwise, an approach as simple as the one used by "strings" would simply have to output every byte in the file. Useless. Sorry. And apologies for blindly assuming you were after plain ASCII! Without going for programs (such as those suggested by others) which really can at least partially interpret a Word file, the best you could do with "strings" would be to edit-in the missing accented characters afterwards. Possible, but tedious, and maybe error-prone. However, the handiness of "strings" as a quick utility is such that it might be worth re-coding it so that, as well as the ASCII codes, it also included the spcific codes for the accented characters in a specific language. This would increase the amount of garbage in the output but, so long as one was selective, perhaps by not too much. (If you're going to do this for Word, remember that there are two different encodings for accented characters: "Win"-encoding and "Mac"-encoding). The best of luck with the other options! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Date: 23-Jun-98 Time: 21:45:54 -------------------------------------------------------------------- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]