Solr packages in Apache BigTop.
Hi Solr. I work on the apache bigtop project, and am interested in integrating it deeper with Solr, for example, for testing spark / solr integration cases. Is anyone in the Solr community interested in collborating on testing releases with us and maintaining Solr packagins in bigtop (with our help of course) ? The advantage here is that we can synergize efforts: When new SOLR releases come out, we can test them in bigtop to gaurantee that there are rpm/deb packages which work well with the hadoop ecosystem. For those that don't know, bigtop is the upstream apache bigdata packaging project, we build hadoop, spark, solr, hbase and so on in rpm/deb format, and supply puppet provisioners along with vagrant recipse for testing. -- jay vyas
Solr on S3FileSystem, Kosmos, GlusterFS, etc….
Hi folks. Does anyone deploy solr indices on other HCFS implementations (S3FileSystem, for example) regularly ? If so I'm wondering 1) Where are the docs for doing this - or examples? Seems like everything, including parameter names for dfs setup, are based around "hdfs". Maybe I should file a JIRA similar to https://issues.apache.org/jira/browse/FLUME-2410 (to make the generic deployment of SOLR on any file system explicit / obvious). 2) if there are any interesting requirements (i.e. createNonRecursive, Atomic mkdirs, sharing, blocking expectations etc etc) which need to be implemented
Re: Solr on S3FileSystem, Kosmos, GlusterFS, etc….
Hi Solr ! I got this working . Here's how : With the example jetty runner, you can Extract the tarball, and go to the examples/ directory, where you can launch an embedded core. Then, find the solrconfig.xml file. Edit it to contain the following xml: myhcfs:///solr /etc/hadoop/conf the confdir is important: That is where you will have something like a core-site.xml that defines all the parameters for your filesystem (fs.defaultFS, fs.mycfs.impl…. and so on). This tells solr, when launched, to use myhcfs as the underlying file store. You also should make sure that the jar for your plugin (in our case glisters, but hadoop will reference it by looking up the dynamically generated parameters that come from the base uri "myhcfs"… classes are on the class path, and the hadoop-common jar is also there (Some HCFS shims will need FilterFileSystem to run correctly, which is only in hadoop-common.jar). So - how to modify the running sold core's class path? To do so – you can update the solrconfig.xml jar directives. There are a bunch of regular expression templates you can modify in the examples/.../solrconfig.xml file. You can also copy the jars in at runtime, to be really safe. Once your example core with gluster configuration is setup, launch it with the following properties: java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.data.dir=glusterfs:///solr -Dsolr.updatelog=glusterfs:///solr -Dlog4j.configuration=file:/opt/solr-4.4.0-cdh5.0.2/example/etc/logging.properties -jar start.jar This starts a basic SOLR server on port 8983. If you are running from the simple jetty based examples which I've used to describe this above, then you should see the collection1 core up and running, and you should see its index sitting inside the /solr directory of your file system. Hope this helps those interested in expanding the use of SolrCloud outside of a single FS. On Jun 23, 2014, at 6:16 PM, Jay Vyas wrote: > Hi folks. Does anyone deploy solr indices on other HCFS implementations > (S3FileSystem, for example) regularly ? If so I'm wondering > > 1) Where are the docs for doing this - or examples? Seems like everything, > including parameter names for dfs setup, are based around "hdfs". Maybe I > should file a JIRA similar to > https://issues.apache.org/jira/browse/FLUME-2410 (to make the generic > deployment of SOLR on any file system explicit / obvious). > > 2) if there are any interesting requirements (i.e. createNonRecursive, Atomic > mkdirs, sharing, blocking expectations etc etc) which need to be implemented
Re: Solr on S3FileSystem, Kosmos, GlusterFS, etc….
Hi paul. Im not using it on S3 -- But yes - I dont think S3 would be ideal for Solr at all. There are several other Hadoop Compatible File Systems, however, some of which might be ideal for certain types of SolrCloud workloads. Anyways... would love to see a Solr wiki page on FileSystem compatibiity, possibly an entry linking here https://wiki.apache.org/hadoop/HCFS. In the meantime, I will update this thread if I find anything interesting when we increase load size. On Wed, Jun 25, 2014 at 1:34 AM, Paul Libbrecht wrote: > I've always been under the impression that file-system-access-speed is > crucial for Lucene-based storage and have always advocated to not use NFS > for that (for which we had slowness of a factor of 5 approximately). Has > there any performance measurement made for such a setting? Is FS-caching > suddenly getting so much better that it is not a problem. > > Also, as far as I know S3 bills by the amount of (giga-)bytes exchanged…. > this gives plenty of room but if each starts needs to exchange a big part > of the index from the storage to the solr server because of cache filling, > it looks like it won't be that cheap. > > thanks for experience report. > > paul > > > On 25 juin 2014, at 07:16, Jay Vyas wrote: > > > Hi Solr ! > > > > I got this working . Here's how : > > > > With the example jetty runner, you can Extract the tarball, and go to > the examples/ directory, where you can launch an embedded core. Then, find > the solrconfig.xml file. Edit it to contain the following xml: > > > > class="org.apache.solr.core.HdfsDirectoryFactory"> > > myhcfs:///solr > > /etc/hadoop/conf > > > > > > the confdir is important: That is where you will have something like a > core-site.xml that defines all the parameters for your filesystem > (fs.defaultFS, fs.mycfs.impl…. and so on). > > > > > > This tells solr, when launched, to use myhcfs as the underlying file > store. > > > > You also should make sure that the jar for your plugin (in our case > glisters, but hadoop will reference it by looking up the dynamically > generated parameters that come from the base uri "myhcfs"… classes are on > the class path, and the hadoop-common jar is also there (Some HCFS shims > will need FilterFileSystem to run correctly, which is only in > hadoop-common.jar). > > > > So - how to modify the running sold core's class path? > > > > To do so – you can update the solrconfig.xml jar directives. There are a > bunch of regular expression templates you can modify in the > examples/.../solrconfig.xml file. You can also copy the jars in at runtime, > to be really safe. > > > > Once your example core with gluster configuration is setup, launch it > with the following properties: > > > > java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs > -Dsolr.data.dir=glusterfs:///solr -Dsolr.updatelog=glusterfs:///solr > -Dlog4j.configuration=file:/opt/solr-4.4.0-cdh5.0.2/example/etc/logging.properties > -jar start.jar > > > > This starts a basic SOLR server on port 8983. > > > > If you are running from the simple jetty based examples which I've used > to describe this above, then you should see the collection1 core up and > running, and you should see its index sitting inside the /solr directory of > your file system. > > > > Hope this helps those interested in expanding the use of SolrCloud > outside of a single FS. > > > > > > On Jun 23, 2014, at 6:16 PM, Jay Vyas > wrote: > > > >> Hi folks. Does anyone deploy solr indices on other HCFS > implementations (S3FileSystem, for example) regularly ? If so I'm wondering > >> > >> 1) Where are the docs for doing this - or examples? Seems like > everything, including parameter names for dfs setup, are based around > "hdfs". Maybe I should file a JIRA similar to > https://issues.apache.org/jira/browse/FLUME-2410 (to make the generic > deployment of SOLR on any file system explicit / obvious). > >> > >> 2) if there are any interesting requirements (i.e. createNonRecursive, > Atomic mkdirs, sharing, blocking expectations etc etc) which need to be > implemented > > > > -- jay vyas
Re: Integrating solr with Hadoop
Minor clarification: The storage of indices uses the Hadoop file system API- not hdfs specifically - so connection is actually not to hdfs ... Solr can distribute indices for failover / reliability/ scaling to any hcfs compliant filesystem. > On Jun 30, 2014, at 11:55 AM, Erick Erickson wrote: > > Whoa! You're confusing a couple of things I think. > > The only real connection Solr <-> Hadoop _may_ > be that Solr can have its indexes stored on HDFS. > Well, you can also create map/reduce jobs that > will index the data via M/R and merge them > into a live index in Solr (assuming it's storing its > indexes there). > > But this question is very confused: > "Is this a better option for large data or better > to go ahead with tomcat or jetty server with solr." > > No matter what, you're still running Solr > in a tomcat or Jetty server. Hadoop has > nothing to do with that. Except, as I mentioned > earlier, the actual index _may_ be stored > on HDFS if you select the right directory > implementation in your solroconfig.xml file. > > So we need a better statement of what you're > trying to accomplish before anyone can say > much useful here. > > Best, > Erick > >> On Mon, Jun 30, 2014 at 2:19 AM, gurunath wrote: >> Hi, >> >> I want to setup solr in production, Initially the data set i am using is of >> small scale, the size of data will grow gradually. I have heard about using >> "*Big Data Work for Hadoop and Solr*", Is this a better option for large >> data or better to go ahead with tomcat or jetty server with solr. >> >> Thanks >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Integrating-solr-with-Hadoop-tp4144715.html >> Sent from the Solr - User mailing list archive at Nabble.com.
Re: [ANN] SIREn, a Lucene/Solr plugin for rich JSON data search
Querying nested data is very difficult in any modern db that I have seen. If It works as you suggest then It would be cool if the feature was it going to be eventually maintained inside solr. > On Jul 23, 2014, at 7:13 AM, Renaud Delbru wrote: > > One of the coolest features of Lucene/Solr is its ability to index nested > documents using a Blockjoin approach. > > While this works well for small documents and document collections, it > becomes unsustainable for larger ones: Blockjoin works by splitting the > original document in many documents, one per nested record. > > For example, a single USPTO patent (XML format converted to JSON) will end up > being over 1500 documents in the index. This has massive implications on > performance and scalability. > > Introducing SIREn > > SIREn is an open source plugin for Solr for indexing and searching rich > nested JSON data. > > SIREn uses a sophisticated "tree indexing" design which ensures that the > index is not artificially inflated. This ensures that querying on many types > of nested queries can be up to 3x faster. Further, depending on the data, > memory requirements for faceting can be up to 10x higher. As such, SIREn > allows you to use Solr for larger and more complex datasets, especially so > for sophisticated analytics. (You can read our whitepaper to find out more > [1]) > > SIREn is also truly schemaless - it even allows you to change the type of a > property between documents without being restricted by a defined mapping. > This can be very useful for data integration scenarios where data is > described in different ways in different sources. > > You only need a few minutes to download and try SIREn [2]. It comes with a > detailed manual [3] and you have access to the code on GitHub [4]. > > We look forward to hear about your feedbacks. > > [1] > http://siren.solutions/siren/resources/whitepapers/comparing-siren-1-2-and-lucenes-blockjoin-performance-a-uspto-patent-search-scenario/ > [2] http://siren.solutions/siren/downloads/ > [3] http://siren.solutions/manual/preface.html > [4] https://github.com/sindicetech/siren > -- > Renaud Delbru > CTO > SIREn Solutions