I have this problem too, in indexing some of persian pdf files. 2011/10/4 Héctor Trujillo <hecto...@gmail.com>
> Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But > with > some files I’ve got problems because they stored estrange characters. I got > stored this content: > +++++++ > > Starting a Search Application > > > Abstract > > Starting > a Search Application A Lucid Imagination White Paper ¥ April 2009 Page i > > > Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 > Page ii Do You Need Full-text Search? > > ∞ > > ∞ > ∞ > > Starting > a Search Application A Lucid Imagination White Paper ¥ April 2009 Page 1 > > Identifying > Ideal Results > > Starting > a Search Application A Lucid Imagination White Paper ¥ April 2009 Page 2 > > Starting > a Search Application A Lucid Imagination White Paper > > > +++++++ > > But if I open the pdf file I have no problem to see the content correctly. > > I think this is a question of the charset encoding, but I don't know if I > can avoid this behaviour with a different analyzer o tokenizer to be > applied > in indexing time, may be. > > I've got this problem with some documents downloaded from Lucid's Web. > > > > I don't know if some have had the same problem and know how to solve this. > > Thanks > > Best regards >