Roman,

It's covered in http://wiki.apache.org/solr/ContentStream
     | For POST requests where the content-type is not
"application/x-www-form-urlencoded", the raw POST body is passed as a
stream.

So, there is no need for encoding of binary data inside the body.

Regarding encoding, I have a positive experience of passing such ids
encoded by vInt, but they need to be presorted.



On Tue, Jul 2, 2013 at 10:46 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

> Hello Mikhail,
>
> Yes, GET is limited, but POST is not - so I just wanted that it works in
> both the same way. But I am not sure if I am understanding your question
> completely. Could you elaborate on the parameters/body part? Is there no
> need for encoding of binary data inside the body? Or do you mean it is
> treated as a string? Or is it just a bytestream and other parameters are
> seen as string?
>
> On a general note: my main concern was to send many ids fast, if we use
> ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb
> check numbers please :)). But certainly, if the bitset is sparse or the
> collection of ids just a 'a few thousands', stream of ints/longs will be
> smaller, better to use.
>
> roman
>
>
>
> On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com
> > wrote:
>
> > Hello Roman,
> >
> > Don't you consider to pass long id sequence as body and access internally
> > in solr as a content stream? It makes base64 compression not necessary.
> > AFAIK url length is limited somehow, anyway.
> >
> >
> > On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
> >
> > > Wrong link to the parser, should be:
> > >
> > >
> >
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
> > >
> > >
> > > On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla <roman.ch...@gmail.com>
> > wrote:
> > >
> > > > Hello @,
> > > >
> > > > This thread 'kicked' me into finishing som long-past task of
> > > > sending/receiving large boolean (bitset) filter. We have been using
> > > bitsets
> > > > with solr before, but now I sat down and wrote it as a qparser. The
> use
> > > > cases, as you have discussed are:
> > > >
> > > >  - necessity to send loooong list of ids as a query (where it is not
> > > > possible to do it the 'normal' way)
> > > >  - or filtering ACLs
> > > >
> > > >
> > > > It works in the following way:
> > > >
> > > >   - external application constructs bitset and sends it as a query to
> > > solr
> > > > (q or fq, depends on your needs)
> > > >   - solr unpacks the bitset (translated bits into lucene ids, if
> > > > necessary), and wraps this into a query which then has the easy job
> of
> > > > 'filtering' wanted/unwanted items
> > > >
> > > > Therefore it is good only if you can search against something that is
> > > > indexed as integer (id's often are).
> > > >
> > > > A simple benchmark shows acceptable performance, to send the bitset
> > > > (randomly populated, 10M, with 4M bits set), it takes 110ms
> (25+64+20)
> > > >
> > > > To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
> > > > (5+14+68ms)
> > > >
> > > > But I haven't tested latency of sending it over the network and the
> > query
> > > > performance, but since the query is very similar as MatchAllDocs, it
> is
> > > > probably very fast (and I know that sending many Mbs to Solr is fast
> as
> > > > well)
> > > >
> > > > I know this is not exactly 'standard' solution, and it is probably
> not
> > > > something you want to see with hundreds of millions of docs, but
> people
> > > > seem to be doing 'not the right thing' all the time;)
> > > > So if you think this is something useful for the community, please
> let
> > me
> > > > know. If somebody would be willing to test it, i can file a JIRA
> > ticket.
> > > >
> > > > Thanks!
> > > >
> > > > Roman
> > > >
> > > >
> > > > The code, if no JIRA is needed, can be found here:
> > > >
> > > >
> > >
> >
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
> > > >
> > > >
> > >
> >
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
> > > >
> > > > 839ms.  run
> > > > 154ms.  Building random bitset indexSize=10000000 fill=0.5 --
> > > > Size=15054208,cardinality=3934477 highestBit=9999999
> > > >  25ms.  Converting bitset to byte array -- resulting array
> > length=1250000
> > > > 20ms.  Encoding byte array into base64 -- resulting array
> > length=1666668
> > > > ratio=1.3333344
> > > >  62ms.  Compressing byte array with GZIP -- resulting array
> > > > length=1218602 ratio=0.9748816
> > > > 20ms.  Encoding gzipped byte array into base64 -- resulting string
> > > > length=1624804 ratio=1.2998432
> > > >  5ms.  Decoding gzipped byte array from base64
> > > > 14ms.  Uncompressing decoded byte array
> > > > 68ms.  Converting from byte array to bitset
> > > >  743ms.  running
> > > >
> > > >
> > > > On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson <
> > erickerick...@gmail.com
> > > >wrote:
> > > >
> > > >> Not necessarily. If the auth tokens are available on some
> > > >> other system (DB, LDAP, whatever), one could get them
> > > >> in the PostFilter and cache them somewhere since,
> > > >> presumably, they wouldn't be changing all that often. Or
> > > >> use a UserCache and get notified whenever a new searcher
> > > >> was opened and regenerate or purge the cache.
> > > >>
> > > >> Of course you're right if the post filter does NOT have
> > > >> access to the source of truth for the user's privileges.
> > > >>
> > > >> FWIW,
> > > >> Erick
> > > >>
> > > >> On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
> > > >> <otis.gospodne...@gmail.com> wrote:
> > > >> > Hi,
> > > >> >
> > > >> > The unfortunate thing about this is what you still have to *pass*
> > that
> > > >> > filter from the client to the server every time you want to use
> that
> > > >> > filter.  If that filter is big/long, passing that in all the time
> > has
> > > >> > some price that could be eliminated by using "server-side named
> > > >> > filters".
> > > >> >
> > > >> > Otis
> > > >> > --
> > > >> > Solr & ElasticSearch Support
> > > >> > http://sematext.com/
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson <
> > > >> erickerick...@gmail.com> wrote:
> > > >> >> You might consider "post filters". The idea
> > > >> >> is to write a custom filter that gets applied
> > > >> >> after all other filters etc. One use-case
> > > >> >> here is exactly ACL lists, and can be quite
> > > >> >> helpful if you're not doing *:* type queries.
> > > >> >>
> > > >> >> Best
> > > >> >> Erick
> > > >> >>
> > > >> >> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
> > > >> >> <otis.gospodne...@gmail.com> wrote:
> > > >> >>> Btw. ElasticSearch has a nice feature here.  Not sure what it's
> > > >> >>> called, but I call it "named filter".
> > > >> >>>
> > > >> >>> http://www.elasticsearch.org/blog/terms-filter-lookup/
> > > >> >>>
> > > >> >>> Maybe that's what OP was after?
> > > >> >>>
> > > >> >>> Otis
> > > >> >>> --
> > > >> >>> Solr & ElasticSearch Support
> > > >> >>> http://sematext.com/
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
> > > >> >>> <arafa...@gmail.com> wrote:
> > > >> >>>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov <
> > ivkus...@gmail.com>
> > > >> wrote:
> > > >> >>>>> So I'm using query like
> > > >> >>>>>
> > > >>
> > >
> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29
> > <
> > >
> >
> http://127.0.0.1:8080/solr/select?q=*:*&fq=%7B!mqparser%7Did:%281%202%203%29
> > > >
> > > >> >>>>
> > > >> >>>> If the IDs are purely numeric, I wonder if the better way is to
> > > send
> > > >> a
> > > >> >>>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if
> > > >> ID:2000
> > > >> >>>> is included. Even using URL-encoding rules, you can fit at
> least
> > 65
> > > >> >>>> sequential ID flags per character and I am sure there are more
> > > >> >>>> efficient encoding schemes for long empty sequences.
> > > >> >>>>
> > > >> >>>> Regards,
> > > >> >>>>    Alex.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> Personal website: http://www.outerthoughts.com/
> > > >> >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > > >> >>>> - Time is the quality of nature that keeps events from
> happening
> > > all
> > > >> >>>> at once. Lately, it doesn't seem to be working.  (Anonymous  -
> > via
> > > >> GTD
> > > >> >>>> book)
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mkhlud...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Reply via email to