Hi Erick, I resolved my metadata problem with configuring solrconfig.xml However even I post data with post.sh I see content as like:
CANADA �1 \n \n \n \n Place I have newline characters as \n and some non-ASCII characters. As far as I understand it is usual to have such characters because that is a pdf file and its newline characters are interpreted as *\n* at Solr. How can I remove them (\n and non-ASCII characters). Kind Regards, Furkan KAMACI On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Not sure. What have you tried? > > For production situations or when you want to take total control of > the indexing process,I strongly recommend that you put the Tika > parsing on the _client_. > > Here's a writeup on this topic: > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > Best, > Erick > > On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI <furkankam...@gmail.com> > wrote: > > Hi Erick, > > > > When I check the *Solr* documentation I see that [1]: > > > > *In addition to Tika's metadata, Solr adds the following metadata > (defined > > in ExtractingMetadataConstants):* > > > > *"stream_name" - The name of the ContentStream as uploaded to Solr. > > Depending on how the file is uploaded, this may or may not be set.* > > *"stream_source_info" - Any source info about the stream. See > > ContentStream.* > > *"stream_size" - The size of the stream in bytes(?)* > > *"stream_content_type" - The content type of the stream, if available.* > > > > So, it seems that these may not be added by Tika, but Solr. Do you know > how > > to enable/disable this feature? > > > > Kind Regards, > > Furkan KAMACI > > > > [1] https://wiki.apache.org/solr/ExtractingRequestHandler > > > > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> about PatternCaptureGroupFilterFactory. This isn't going to help. The > >> data you see when you return stored data is _before_ any analysis so > >> the Pattern....Factory won't be applied. You could do this in a > >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have > >> the real app deal with it. > >> > >> I don't particularly know about the Tika settings, that's largely a > guess. > >> > >> Best, > >> Erick > >> > >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankam...@gmail.com> > >> wrote: > >> > Hi Erick, > >> > > >> > 1) I am looking stored data via Solr Admin UI. I send the query and > check > >> > what is in content field. > >> > > >> > 2) I can debug the Tika settings if you think that this is not the > >> desired > >> > behaviour to have such metadata fields combined into content field. > >> > > >> > *PS: *Is there any solution to get rid of it except for > >> > using PatternCaptureGroupFilterFactory? > >> > > >> > Kind Regards, > >> > Furkan KAMACI > >> > > >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> 1> I'm assuming when you "see" this data you're looking at the stored > >> >> data, right? It's a verbatim copy of whatever you sent to the field. > >> >> I'm guessing it's a character-encoding mismatch between the source > and > >> >> what you use to display. > >> >> > >> >> 2> How are you extracting this data? There are Tika options I think > >> >> that can/do mush fields together. > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> > >> >> > >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI < > furkankam...@gmail.com> > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content > field at > >> >> > schema has text_general field type which is not modified from > >> original. I > >> >> > do not copy any fields to content. When I check the data I see > >> content > >> >> > values as like: > >> >> > > >> >> > " \n \nstream_source_info MARLON BRANDO.rtf > \nstream_content_type > >> >> > application/rtf \nstream_size 13580 \nstream_name MARLON > >> BRANDO.rtf > >> >> > \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf > \n > >> >> \n > >> >> > \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named > Desire\" > >> >> > directed by Elia Kazan \n" > >> >> > > >> >> > My questions: > >> >> > > >> >> > 1) Is it usual to have that newline characters? > >> >> > 2) Is it usual to have file metadata at the beginning of the > content > >> >> (i.e. > >> >> > stream source, stream_content_type) or related to tool that I post > >> data > >> >> to > >> >> > Solr? > >> >> > > >> >> > Kind Regards, > >> >> > Furkan KAMACI > >> >> > >> >