Re: The Streaming API (Solrj.io) : id must have DocValues?

sudsport s Tue, 26 Apr 2016 10:36:28 -0700

@Joel
>Can you describe how you're planning on using Streaming?

I am mostly using it for distirbuted join case. We were planning to use
similar logic (hash id and join) in Spark for our usecase. but since data
is stored in solr , I will be using solr stream to perform same operation.


I have similar user cases to build probabilistic data-structures while
streaming results. I might have to spend some time in exploring query
optimization (while doing join decide sort order etc)

Please let me know if you have any feedback.

On Tue, Apr 26, 2016 at 10:30 AM, sudsport s <sudssf2...@gmail.com> wrote:

> Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> mentioned.
>
> Thanks Joel
> I used some of examples for streaming client from your blog. I got basic
> tuple stream working but I get following exception while running parallel
> string.
>
>
> java.io.IOException: java.util.concurrent.ExecutionException:
> org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> BEFORE='<' AFTER='html> <head> <meta http-equiv="Content-'
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:332)
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:231)
>
>
>
> I tried to look into solr logs but after turning on debug mode I found
> following
> POST /solr/collection_shard20_replica1/stream HTTP/1.1
> "HTTP/1.1 404 Not Found[\r][\n]"
>
>
> looks like Parallel stream is trying to access /stream on shard. can
> someone tell me how to enable stream handler? I have export handler
> enabled. I will look at latest solrconfig to see if I can turn that on.
>
>
>
> @Joel I am running sizing exercises already , I will run new one with
> solr5.5+ and docValues on id enabled.
>
> BTW Solr streaming has amazing response times thanks for making it so
> FAST!!!
>
>
>
>
>
>
>
> On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
>> Can you describe how you're planning on using Streaming? I can provide
>> some
>> feedback on how it will perform for your use use.
>>
>> When scaling out Streaming you'll get large performance boosts when you
>> increase the number of shards, replicas and workers. This is particularly
>> true if you're doing parallel relational algebra or map/reduce operations.
>>
>> As far a DocValues being expensive with unique fields, you'll want to do a
>> sizing exercise to see how many documents per-shard work best for your use
>> case. There are different docValues implementations that will allow you to
>> trade off memory for performance.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM <reth.ik...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > So, is the concern related to same field value being stored twice: with
>> > stored=true and docValues=true? If that is the case, there is a jira
>> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
>> > possible to read non-stored fields from docValues index., check out.
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>> >
>> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s <sudssf2...@gmail.com>
>> wrote:
>> >
>> > > Thanks Erik for reply,
>> > >
>> > > Since I was storing Id (its stored field) and after enabling
>> docValues my
>> > > guess is it will be stored in 2 places. also as per my understanding
>> > > docValues are great when you have values which repeat. I am not sure
>> how
>> > > beneficial it would be for uniqueId field.
>> > > I am looking at collection of few hundred billion documents , that is
>> > > reason I really want to care about expense from design phase.
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > In a word, "yes".
>> > > >
>> > > > DocValues aren't particularly expensive, or expensive at all. The
>> idea
>> > > > is that when you sort by a field or facet, the field has to be
>> > > > "uninverted" which builds the entire structure in Java's JVM (this
>> is
>> > > > when the field is _not_ DocValues).
>> > > >
>> > > > DocValues essentially serialize this structure to disk. So your
>> > > > on-disk index size is larger, but that size is MMaped rather than
>> > > > stored on Java's heap.
>> > > >
>> > > > Really, the question I'd have to ask though is "why do you care
>> about
>> > > > the expense?". If you have a functional requirement that has to be
>> > > > served by returning the id via the /export handler, you really have
>> no
>> > > > choice.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > >
>> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s <sudssf2...@gmail.com>
>> > > wrote:
>> > > > > I was trying to use Streaming for reading basic tuple stream. I am
>> > > using
>> > > > > sort by id asc ,
>> > > > > I am getting following exception
>> > > > >
>> > > > > I am using export search handler as per
>> > > > >
>> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>> > > > >
>> > > > > null:java.io.IOException: id must have DocValues to use this
>> feature.
>> > > > >         at
>> > > >
>> > >
>> >
>> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
>> > > > >         at
>> > > >
>> >
>> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
>> > > > >         at
>> > > > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> > > > >         at
>> > > >
>> > >
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> > > > >         at
>> > > >
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> > > > >         at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(
>> > > > >
>> > > > >
>> > > > > does it make sense to enable docValues for unique field? How
>> > expensive
>> > > > is it?
>> > > > >
>> > > > >
>> > > > > if I have existing collection can I update schema and optimize
>> > > > > collection to get docvalues enabled for id?
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > Thanks
>> > > >
>> > >
>> >
>>
>
>

Re: The Streaming API (Solrj.io) : id must have DocValues?

Reply via email to