Re: Loading lineshape data into Solr
Anyone? On 15-04-29 09:07 PM, Arthur Zubarev wrote: Hi Solr community, My immediate task at hand is to load lienshape data into Solr (the lineshape data is a set of points on a curve in form of lat. + long. coordinates). The data sits in a SQL Server 2012 table. Extracting the data to a flat file is impossible as it is becoming binary (not readable). The other columns have streets, points of interest, etc.. The end result of the undertaking would be a query to Solr to locate an address based on lat+long. Any hints/tips are welcome! Thank you! Regards, Arthur
Loading lineshape data into Solr
Hi Solr community, My immediate task at hand is to load lienshape data into Solr (the lineshape data is a set of points on a curve in form of lat. + long. coordinates). The data sits in a SQL Server 2012 table. Extracting the data to a flat file is impossible as it is becoming binary (not readable). The other columns have streets, points of interest, etc.. The end result of the undertaking would be a query to Solr to locate an address based on lat+long. Any hints/tips are welcome! Thank you! Regards, Arthur
Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.
Please bear with me here, I'm pretty new to Solr with most of me DB experience being of the relational variety. I'm planning a new project, which I believe Solr (and Nutch) will solve well. Although I've installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit, I'd be grateful for advice and tips regarding my plan. I'm looking to build a vertical search engine to cover a very specific and narrow dataset. Sources will number in the hundreds and mostly managed by hand, these will be a mixture of forums and product based e-commerce sites. For some of these I was hoping to leverage the SOLR DataImportHandler system with their RSS feeds primarily for the ease of acquiring clean, reasonably sanitised and well structured data. For the rest, I'm going to fall back to Nutch crawling them, with some heavy regulation via Regex of urls. So to sum up, a Solr DB populated through a couple of different ways, then search via some custom user facing PHP webpages. Finally a cronjob script would delete any docs older than X weeks, to keep on top of data retention. Does that sound sensible at all? Regarding RSS feeds:- Many only provide a limited number of recent items, however I'd like to retain items for many weeks. I've already discovered the clean=false param on DataImport, after wondering why old rss items vanished! Question 1) is there an easy way to filter items to import in the URLDataSource entity? Or is it best to go down route of XSLT preprocessing? Question 2) Multiple URLDataSources: reference all in one DataImport handler? Or have multiple DataImport handlers? What's the best approach to supplement imported data with additional static fields/keywords based associated with the source feed or crawled site? e.g. all docs from sites A, B & C are of subcategory Foo. I'm guessing with RSS feeds this would be straightforward via the XSLT preprocessor. But for Nutch submitted docs - I've no idea? Scheduling import: Do people just cron up a curl post request (or shell execute of Nutch crawl script)? Or is there a more elegant solution available? Any other more general tips and advice on the above greatly appreciated. -- Arthur Yarwood
Stuck on SEVERE: Error filterStart
I am trying Solr for the first time, and I am stuck at the error "SEVERE: Error filterStart" My setup: - Centos 6.x - OpenJDK 1.7 - Tomcat 7 >From reading [1] I believe the issue is missing JAR files, but I have no idea where to put them, even the wiki is a bit vague on that. Lib directories that I am aware of - /usr/share/tomcat/lib (for tomcat) - /opt/solr/example/solr/collection1/lib (for my instance) This is the error I get: Apr 15, 2014 11:35:36 PM org.apache.catalina.core.StandardContext filterStart SEVERE: Exception starting filter SolrRequestFilter java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging at org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28) at org.apache.solr.servlet.BaseSolrFilter.(BaseSolrFilter.java:31) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:374) at org.apache.catalina.core.DefaultInstanceManager.newInstance(DefaultInstanceManager.java:134) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:256) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382) at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4650) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5306) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:657) at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1637) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) I would like to get past this so I can try out Solr. I have gone as far as putting `` into /opt/solr/example/solr/collection1/conf/solrconfig.xml but that did not help. I have used Java before, but purely for academic purposes, so I do not have experience resolving these dependencies. [1] https://wiki.apache.org/solr/SolrLogging
Re: Stuck on SEVERE: Error filterStart
Thank you. I had evidently misunderstood where it needed to be copied to. That helped, though that directory already contained all but one file. On Wed, Apr 16, 2014 at 12:45 PM, David Santamauro < david.santama...@gmail.com> wrote: > > You need to copy /example/lib/ext/*.jar into your tomcat lib > directory (/usr/share/tomcat/lib) > > Also make sure a /usr/share/tomcat/conf/log4j.properties is there as well. > > ... then restart. > > HTH > > David > > > > On 4/16/2014 11:47 AM, Arthur Pemberton wrote: > >> I am trying Solr for the first time, and I am stuck at the error "SEVERE: >> Error filterStart" >> >> My setup: >> - Centos 6.x >> - OpenJDK 1.7 >> - Tomcat 7 >> >> From reading [1] I believe the issue is missing JAR files, but I have no >> idea where to put them, even the wiki is a bit vague on that. >> >> Lib directories that I am aware of >> - /usr/share/tomcat/lib (for tomcat) >> - /opt/solr/example/solr/collection1/lib (for my instance) >> >> >> This is the error I get: >> >> Apr 15, 2014 11:35:36 PM org.apache.catalina.core.StandardContext >> filterStart >> SEVERE: Exception starting filter SolrRequestFilter >> java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could >> not >> find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars >> need to go in the jetty lib/ext directory. For other containers, the >> corresponding directory should be used. For more information, see: >> http://wiki.apache.org/solr/SolrLogging >> at >> org.apache.solr.servlet.CheckLoggingConfiguration.check( >> CheckLoggingConfiguration.java:28) >> at >> org.apache.solr.servlet.BaseSolrFilter.(BaseSolrFilter.java:31) >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance( >> NativeConstructorAccessorImpl.java:57) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance( >> DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java: >> 526) >> at java.lang.Class.newInstance(Class.java:374) >> at >> org.apache.catalina.core.DefaultInstanceManager.newInstance( >> DefaultInstanceManager.java:134) >> at >> org.apache.catalina.core.ApplicationFilterConfig.getFilter( >> ApplicationFilterConfig.java:256) >> at >> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef( >> ApplicationFilterConfig.java:382) >> at >> org.apache.catalina.core.ApplicationFilterConfig. >> (ApplicationFilterConfig.java:103) >> at >> org.apache.catalina.core.StandardContext.filterStart( >> StandardContext.java:4650) >> at >> org.apache.catalina.core.StandardContext.startInternal( >> StandardContext.java:5306) >> at >> org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) >> at >> org.apache.catalina.core.ContainerBase.addChildInternal( >> ContainerBase.java:901) >> at >> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) >> at >> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633) >> at >> org.apache.catalina.startup.HostConfig.deployDescriptor( >> HostConfig.java:657) >> at >> org.apache.catalina.startup.HostConfig$DeployDescriptor. >> run(HostConfig.java:1637) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:744) >> >> I would like to get past this so I can try out Solr. >> >> I have gone as far as putting `> dir="/opt/solr/example/solr/collection1/lib/" regex="*\.jar" />` >> into /opt/solr/example/solr/collection1/conf/solrconfig.xml but that did >> not help. >> >> I have used Java before, but purely for academic purposes, so I do not >> have >> experience resolving these dependencies. >> >> >> >> [1] https://wiki.apache.org/solr/SolrLogging >> >> -- Fedora 13 (www.pembo13.com)
Rerank for distributed requests
Hi. We're LTR and after switching to multiple shards we found that rerank happens on individual shards and during the merge phase the first pass score isn't used. Currently our LTR model doesn't use textual match and assumes that reranked documents are already more or less good in terms of textual score, which is not always the case when documents are distributed across shards. To avoid it I've tried to use sort by function that replicates actual query and results I get is somewhat interesting - on individual shards first pass happens by my sorting, then documents are reranked and during the merge documents from the same shard are compared by "orderInShard" and from different shards by value from sort, so that final order is neither sort value nor score. For example let's assume that documents coming from shard 1 are: doc1(first_pass_score = 1, second_pass_score = 2) doc2(first_pass_score = 4, second_pass_score = 1) and documents coming from shard 2 are: doc4(first_pass_score = 3, second_pass_score = 4) doc3(first_pass_score = 2, second_pass_score = 3) where first_pass_score is doc.sort_values[0] and second_pass_score is doc.score when we try to merge all documents this will happen queue.insertWithOverflow(doc1) queue.insertWithOverflow(doc2) queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 < doc2.orderInShard = 2) queue.insertWithOverflow(doc4) queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 > doc2.first_pass_score = 3) queue.insertWithOverflow(doc3) queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 < doc3.orderInShard = 2) and final documents result will be: doc1(first_pass_score = 1, second_pass_score = 2) doc2(first_pass_score = 4, second_pass_score = 1) doc4(first_pass_score = 3, second_pass_score = 4) doc3(first_pass_score = 2, second_pass_score = 3) Ideally I would want to see rerank happening based on global order across all shards, I've implemented custom component that asks shards to return *Math.max(reRankDocs, offset + rows)* documents, which are first sorted by first pass score and then only top *reRankDocs *are sorted by second pass score. I understand that it might not be the best way in terms of performance (we rerank only top 60 documents so it's not that big of a deal), but it's functionally equivalent to the single shard behavior. I'm curious if current behavior is intended or not, typically I would expect either something I described above or at least ignoring sort during the merge and using only doc.score that was generated by LTR rescorer. Maybe the community would be interested in the approach I've implemented? Or is it considered bad design to rely on first pass score and our LTR model should use fields from first pass / use OriginalScoreFeature?