Re: Metadata and Newline Characters at Content

Erick Erickson Thu, 24 Nov 2016 10:59:56 -0800

Not sure. What have you tried?

 For production situations or when you want to take total control of
the indexing process,I strongly recommend that you put the Tika
parsing on the _client_.


Here's a writeup on this topic:

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI <furkankam...@gmail.com> wrote:
> Hi Erick,
>
> When I check the *Solr* documentation I see that [1]:
>
> *In addition to Tika's metadata, Solr adds the following metadata (defined
> in ExtractingMetadataConstants):*
>
> *"stream_name" - The name of the ContentStream as uploaded to Solr.
> Depending on how the file is uploaded, this may or may not be set.*
> *"stream_source_info" - Any source info about the stream. See
> ContentStream.*
> *"stream_size" - The size of the stream in bytes(?)*
> *"stream_content_type" - The content type of the stream, if available.*
>
> So, it seems that these may not be added by Tika, but Solr. Do you know how
> to enable/disable this feature?
>
> Kind Regards,
> Furkan KAMACI
>
> [1] https://wiki.apache.org/solr/ExtractingRequestHandler
>
> On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> about PatternCaptureGroupFilterFactory. This isn't going to help. The
>> data you see when you return stored data is _before_ any analysis so
>> the Pattern....Factory won't be applied. You could do this in a
>> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
>> the real app deal with it.
>>
>> I don't particularly know about the Tika settings, that's largely a guess.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankam...@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > 1) I am looking stored data via Solr Admin UI. I send the query and check
>> > what is in content field.
>> >
>> > 2) I can debug the Tika settings if you think that this is not the
>> desired
>> > behaviour to have such metadata fields combined into content field.
>> >
>> > *PS: *Is there any solution to get rid of it except for
>> > using PatternCaptureGroupFilterFactory?
>> >
>> > Kind Regards,
>> > Furkan KAMACI
>> >
>> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <erickerick...@gmail.com
>> >
>> > wrote:
>> >
>> >> 1> I'm assuming when you "see" this data you're looking at the stored
>> >> data, right? It's a verbatim copy of whatever you sent to the field.
>> >> I'm guessing it's a character-encoding mismatch between the source and
>> >> what you use to display.
>> >>
>> >> 2> How are you extracting this data? There are Tika options I think
>> >> that can/do mush fields together.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>
>> >>
>> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <furkankam...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content field at
>> >> > schema has text_general field type which is not modified from
>> original. I
>> >> > do not copy any fields to content. When I check the data  I see
>> content
>> >> > values as like:
>> >> >
>> >> >  " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
>> BRANDO.rtf
>> >> > \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n
>> >> \n
>> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>> >> > directed by Elia Kazan \n"
>> >> >
>> >> > My questions:
>> >> >
>> >> > 1) Is it usual to have that newline characters?
>> >> > 2) Is it usual to have file metadata at the beginning of the content
>> >> (i.e.
>> >> > stream source, stream_content_type) or related to tool that I post
>> data
>> >> to
>> >> > Solr?
>> >> >
>> >> > Kind Regards,
>> >> > Furkan KAMACI
>> >>
>>

Re: Metadata and Newline Characters at Content

Reply via email to