I hope, to complete https://issues.apache.org/jira/browse/SOLR-7188 that
makes DIH a world class ETL in near future.
For those who already faced the bottleneck there is a kind of steroids
https://issues.apache.org/jira/browse/SOLR-3585 .

On Tue, Dec 1, 2015 at 9:05 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Yes, DIH works  with SolrCloud. I don't particularly like
> it as it doesn't parallelize well, i.e. all the action
> happens on one Solr server. Admittedly it does
> send the docs to the correct shards etc.
>
> But often the bottleneck becomes acquiring the data,
> and there DIH will be the bottleneck since it's a single
> threaded process running on one server.
>
> Here's a starter for using SolrJ:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> There's some bits with Tika in there, but you should be able
> to remove them pretty easily.
>
> Best,
> Erick
>
> On Tue, Dec 1, 2015 at 12:26 AM, Upayavira <u...@odoko.co.uk> wrote:
> > I've never used DIH in earnest. I'm not sure if/how it works with
> > SolrCloud. There is a ticket somewhere to make a 'standalone' DIH that
> > sits outside of Solr and pushes to it, which would be a much better
> > idea, I think, and would work better with SolrCloud. Others here
> > (perhaps in a separate thread with a clear "DIH and SolrCloud" title)
> > will probably be able to help better.
> >
> > Do you use SolrJ at present? What language do you use to do your
> > interaction with Solr? If you are using Java, you should be using SolrJ.
> > If not, ignore that suggestion.
> >
> > If you aren't using Java, then you have two options for interacting with
> > SolrCloud:
> >  * put all of your boxes behind a load balancer and maintain the load
> >  balancer as your network changes
> >  * use (or create) a Zookeeper aware client for your language of choice
> >  (e.g. I have a pull request open for pysolr that adds ZK awareness).
> >  With this, you point your client at Zookeeper, not Solr. It works out
> >  the location of the correct Solr to hit based upon the information in
> >  ZK.
> >
> > Upayavira
> >
> > On Tue, Dec 1, 2015, at 06:29 AM, William Bell wrote:
> >> ok.
> >>
> >> What about using DIH handler? Does it index in a SolrCloud setup ? Or
> how
> >> would I convert a query to use SolrJ ?
> >>
> >> On Mon, Nov 30, 2015 at 5:36 AM, Upayavira <u...@odoko.co.uk> wrote:
> >>
> >> >
> >> >
> >> > On Sun, Nov 29, 2015, at 07:38 PM, William Bell wrote:
> >> > > OK. Been using Cores for 4 years. Want to migrate to collections /
> Cloud.
> >> > >
> >> > > Do we have to change our queries?
> >> > >
> >> > > http://loadbalancer:8983/solr/corename/select?q=*:*
> >> > >
> >> > > What does this become once we have the collection sharded? Do we
> need a
> >> > > Load Balancer or just point to one box and run the new query? Or
> would it
> >> > > be better to hit the LB in case one machine is no longer good to go?
> >> > >
> >> > > http://loadbalancer:8983/solr/collectionname/select?q=*:*
> >> > >
> >> > > What features would not yet be ready for sharded setups with
> SolrCloud?
> >> > > In
> >> > > the past, facet counts were an issue, grouping? stats? as well as
> IDF for
> >> > > sorting by scores. i.e. facet.field=specialties. We want the
> Cardiologist
> >> > > specialty to have unique numbers across shards. So if shard1 has 4
> people
> >> > > with Cardiology, and shard2 has 2 people with Cardiology, we would
> want
> >> > > the
> >> > > number to be 6. We would want facet.sort to work on counts... I
> guess we
> >> > > could index another collection for facets and just use 1 machine for
> >> > > that?
> >> > > But doesn't that defeat the purpose?
> >> > >
> >> > > What is the best walk thru for SOLR 5.3.1 ?
> >> > >
> >> > > Looking at https://wiki.apache.org/solr/SolrCloud
> >> >
> >> > 1. Your queries should stay (more or less) the same
> >> > 2. If you name a collection the same as what you are using for a core,
> >> > your base URL will remain the same
> >> > 3. If you use SolrJ, then you would change to CloudSolrClient, which
> >> > would feel quite different, but the SolrQuery objects should be
> >> > interchangeable
> >> > 4. If you use SolrJ, then you don't need a load balancer - SolrJ will
> do
> >> > round robin against the Solr nodes for that collection. It will
> respond
> >> > to failures far faster than an LB ever could (I've seen downed
> machines
> >> > pulled in <200ms)
> >> > 5. Regarding sharded setups, there's two scenarios to consider -
> >> > distributed in general, and solrcloud in particular. Every search
> >> > component must be enabled for distributed search (faceting,
> >> > highlighting, grouping, etc, etc). Some of the newer ones may not have
> >> > had distributed support implemented yet. Others, such as Joining, will
> >> > require particular concern, and will work in only a subset of
> >> > conditions.
> >> > 6. For IDF, mostly, IDF balances itself across the shards. If it
> >> > doesn't, then distributed IDF is available, but that has a cost in
> terms
> >> > of additional network traffic.
> >> > 7. Faceting should work just fine (as you describe) across shards. I
> >> > would check specifically on newer faceting features though before
> >> > assuming anything.
> >> > 8. facet.sort+counts, have you tried it?
> >> > 9. I would consider this to be a more up-to-date place to go:
> >> > https://cwiki.apache.org/confluence/display/solr/SolrCloud
> >> >
> >> > Upayavira
> >> >
> >>
> >>
> >>
> >> --
> >> Bill Bell
> >> billnb...@gmail.com
> >> cell 720-256-8076
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhlud...@griddynamics.com>

Reply via email to