PS: \n characters are not shown in browser but breaks how highlighter work. \n characters are considered at fragsize too.
On Sat, Nov 26, 2016 at 9:47 PM, Furkan KAMACI <furkankam...@gmail.com> wrote: > Hi Erick, > > I resolved my metadata problem with configuring solrconfig.xml However > even I post data with post.sh I see content as like: > > CANADA �1 \n \n \n \n Place > > I have newline characters as \n and some non-ASCII characters. As far as I > understand it is usual to have such characters because that is a pdf file > and its newline characters are interpreted as *\n* at Solr. How can I > remove them (\n and non-ASCII characters). > > Kind Regards, > Furkan KAMACI > > On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Not sure. What have you tried? >> >> For production situations or when you want to take total control of >> the indexing process,I strongly recommend that you put the Tika >> parsing on the _client_. >> >> Here's a writeup on this topic: >> >> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ >> >> Best, >> Erick >> >> On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI <furkankam...@gmail.com> >> wrote: >> > Hi Erick, >> > >> > When I check the *Solr* documentation I see that [1]: >> > >> > *In addition to Tika's metadata, Solr adds the following metadata >> (defined >> > in ExtractingMetadataConstants):* >> > >> > *"stream_name" - The name of the ContentStream as uploaded to Solr. >> > Depending on how the file is uploaded, this may or may not be set.* >> > *"stream_source_info" - Any source info about the stream. See >> > ContentStream.* >> > *"stream_size" - The size of the stream in bytes(?)* >> > *"stream_content_type" - The content type of the stream, if available.* >> > >> > So, it seems that these may not be added by Tika, but Solr. Do you know >> how >> > to enable/disable this feature? >> > >> > Kind Regards, >> > Furkan KAMACI >> > >> > [1] https://wiki.apache.org/solr/ExtractingRequestHandler >> > >> > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson < >> erickerick...@gmail.com> >> > wrote: >> > >> >> about PatternCaptureGroupFilterFactory. This isn't going to help. The >> >> data you see when you return stored data is _before_ any analysis so >> >> the Pattern....Factory won't be applied. You could do this in a >> >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have >> >> the real app deal with it. >> >> >> >> I don't particularly know about the Tika settings, that's largely a >> guess. >> >> >> >> Best, >> >> Erick >> >> >> >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankam...@gmail.com >> > >> >> wrote: >> >> > Hi Erick, >> >> > >> >> > 1) I am looking stored data via Solr Admin UI. I send the query and >> check >> >> > what is in content field. >> >> > >> >> > 2) I can debug the Tika settings if you think that this is not the >> >> desired >> >> > behaviour to have such metadata fields combined into content field. >> >> > >> >> > *PS: *Is there any solution to get rid of it except for >> >> > using PatternCaptureGroupFilterFactory? >> >> > >> >> > Kind Regards, >> >> > Furkan KAMACI >> >> > >> >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson < >> erickerick...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> >> 1> I'm assuming when you "see" this data you're looking at the >> stored >> >> >> data, right? It's a verbatim copy of whatever you sent to the field. >> >> >> I'm guessing it's a character-encoding mismatch between the source >> and >> >> >> what you use to display. >> >> >> >> >> >> 2> How are you extracting this data? There are Tika options I think >> >> >> that can/do mush fields together. >> >> >> >> >> >> Best, >> >> >> Erick >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI < >> furkankam...@gmail.com> >> >> >> wrote: >> >> >> > Hi, >> >> >> > >> >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content >> field at >> >> >> > schema has text_general field type which is not modified from >> >> original. I >> >> >> > do not copy any fields to content. When I check the data I see >> >> content >> >> >> > values as like: >> >> >> > >> >> >> > " \n \nstream_source_info MARLON BRANDO.rtf >> \nstream_content_type >> >> >> > application/rtf \nstream_size 13580 \nstream_name MARLON >> >> BRANDO.rtf >> >> >> > \nContent-Type application/rtf \nresourceName MARLON >> BRANDO.rtf \n >> >> >> \n >> >> >> > \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named >> Desire\" >> >> >> > directed by Elia Kazan \n" >> >> >> > >> >> >> > My questions: >> >> >> > >> >> >> > 1) Is it usual to have that newline characters? >> >> >> > 2) Is it usual to have file metadata at the beginning of the >> content >> >> >> (i.e. >> >> >> > stream source, stream_content_type) or related to tool that I post >> >> data >> >> >> to >> >> >> > Solr? >> >> >> > >> >> >> > Kind Regards, >> >> >> > Furkan KAMACI >> >> >> >> >> >> > >