I hope, to complete https://issues.apache.org/jira/browse/SOLR-7188 that makes DIH a world class ETL in near future. For those who already faced the bottleneck there is a kind of steroids https://issues.apache.org/jira/browse/SOLR-3585 .
On Tue, Dec 1, 2015 at 9:05 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Yes, DIH works with SolrCloud. I don't particularly like > it as it doesn't parallelize well, i.e. all the action > happens on one Solr server. Admittedly it does > send the docs to the correct shards etc. > > But often the bottleneck becomes acquiring the data, > and there DIH will be the bottleneck since it's a single > threaded process running on one server. > > Here's a starter for using SolrJ: > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > There's some bits with Tika in there, but you should be able > to remove them pretty easily. > > Best, > Erick > > On Tue, Dec 1, 2015 at 12:26 AM, Upayavira <u...@odoko.co.uk> wrote: > > I've never used DIH in earnest. I'm not sure if/how it works with > > SolrCloud. There is a ticket somewhere to make a 'standalone' DIH that > > sits outside of Solr and pushes to it, which would be a much better > > idea, I think, and would work better with SolrCloud. Others here > > (perhaps in a separate thread with a clear "DIH and SolrCloud" title) > > will probably be able to help better. > > > > Do you use SolrJ at present? What language do you use to do your > > interaction with Solr? If you are using Java, you should be using SolrJ. > > If not, ignore that suggestion. > > > > If you aren't using Java, then you have two options for interacting with > > SolrCloud: > > * put all of your boxes behind a load balancer and maintain the load > > balancer as your network changes > > * use (or create) a Zookeeper aware client for your language of choice > > (e.g. I have a pull request open for pysolr that adds ZK awareness). > > With this, you point your client at Zookeeper, not Solr. It works out > > the location of the correct Solr to hit based upon the information in > > ZK. > > > > Upayavira > > > > On Tue, Dec 1, 2015, at 06:29 AM, William Bell wrote: > >> ok. > >> > >> What about using DIH handler? Does it index in a SolrCloud setup ? Or > how > >> would I convert a query to use SolrJ ? > >> > >> On Mon, Nov 30, 2015 at 5:36 AM, Upayavira <u...@odoko.co.uk> wrote: > >> > >> > > >> > > >> > On Sun, Nov 29, 2015, at 07:38 PM, William Bell wrote: > >> > > OK. Been using Cores for 4 years. Want to migrate to collections / > Cloud. > >> > > > >> > > Do we have to change our queries? > >> > > > >> > > http://loadbalancer:8983/solr/corename/select?q=*:* > >> > > > >> > > What does this become once we have the collection sharded? Do we > need a > >> > > Load Balancer or just point to one box and run the new query? Or > would it > >> > > be better to hit the LB in case one machine is no longer good to go? > >> > > > >> > > http://loadbalancer:8983/solr/collectionname/select?q=*:* > >> > > > >> > > What features would not yet be ready for sharded setups with > SolrCloud? > >> > > In > >> > > the past, facet counts were an issue, grouping? stats? as well as > IDF for > >> > > sorting by scores. i.e. facet.field=specialties. We want the > Cardiologist > >> > > specialty to have unique numbers across shards. So if shard1 has 4 > people > >> > > with Cardiology, and shard2 has 2 people with Cardiology, we would > want > >> > > the > >> > > number to be 6. We would want facet.sort to work on counts... I > guess we > >> > > could index another collection for facets and just use 1 machine for > >> > > that? > >> > > But doesn't that defeat the purpose? > >> > > > >> > > What is the best walk thru for SOLR 5.3.1 ? > >> > > > >> > > Looking at https://wiki.apache.org/solr/SolrCloud > >> > > >> > 1. Your queries should stay (more or less) the same > >> > 2. If you name a collection the same as what you are using for a core, > >> > your base URL will remain the same > >> > 3. If you use SolrJ, then you would change to CloudSolrClient, which > >> > would feel quite different, but the SolrQuery objects should be > >> > interchangeable > >> > 4. If you use SolrJ, then you don't need a load balancer - SolrJ will > do > >> > round robin against the Solr nodes for that collection. It will > respond > >> > to failures far faster than an LB ever could (I've seen downed > machines > >> > pulled in <200ms) > >> > 5. Regarding sharded setups, there's two scenarios to consider - > >> > distributed in general, and solrcloud in particular. Every search > >> > component must be enabled for distributed search (faceting, > >> > highlighting, grouping, etc, etc). Some of the newer ones may not have > >> > had distributed support implemented yet. Others, such as Joining, will > >> > require particular concern, and will work in only a subset of > >> > conditions. > >> > 6. For IDF, mostly, IDF balances itself across the shards. If it > >> > doesn't, then distributed IDF is available, but that has a cost in > terms > >> > of additional network traffic. > >> > 7. Faceting should work just fine (as you describe) across shards. I > >> > would check specifically on newer faceting features though before > >> > assuming anything. > >> > 8. facet.sort+counts, have you tried it? > >> > 9. I would consider this to be a more up-to-date place to go: > >> > https://cwiki.apache.org/confluence/display/solr/SolrCloud > >> > > >> > Upayavira > >> > > >> > >> > >> > >> -- > >> Bill Bell > >> billnb...@gmail.com > >> cell 720-256-8076 > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>