Re: Loading lineshape data into Solr

2015-05-02 Thread Arthur

Anyone?

On 15-04-29 09:07 PM, Arthur Zubarev wrote:

Hi Solr community,
My immediate task at hand is to load lienshape data into Solr (the lineshape 
data is a set of points on a curve in form of lat. + long. coordinates).
The data sits in a SQL Server 2012 table. Extracting the data to a flat file is 
impossible as it is becoming binary (not readable).

The other columns have streets, points of interest, etc..
The end result of the undertaking would be a query to Solr to locate an address 
based on lat+long.
Any hints/tips are welcome!
Thank you!
Regards,
Arthur







Loading lineshape data into Solr

2015-04-29 Thread Arthur Zubarev
Hi Solr community,
My immediate task at hand is to load lienshape data into Solr (the lineshape 
data is a set of points on a curve in form of lat. + long. coordinates).
The data sits in a SQL Server 2012 table. Extracting the data to a flat file is 
impossible as it is becoming binary (not readable).

The other columns have streets, points of interest, etc..
The end result of the undertaking would be a query to Solr to locate an address 
based on lat+long.
Any hints/tips are welcome!
Thank you!
Regards,
Arthur




Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.

2015-06-29 Thread Arthur Yarwood
Please bear with me here, I'm pretty new to Solr with most of me DB 
experience being of the relational variety. I'm planning a new project, 
which I believe Solr (and Nutch) will solve well. Although I've 
installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit, 
I'd be grateful for advice and tips regarding my plan.


I'm looking to build a vertical search engine to cover a very specific 
and narrow dataset. Sources will number in the hundreds and mostly 
managed by hand, these will be a mixture of forums and product based 
e-commerce sites. For some of these I was hoping to leverage the SOLR 
DataImportHandler system with their RSS feeds primarily for the ease of 
acquiring clean, reasonably sanitised and well structured data. For the 
rest, I'm going to fall back to Nutch crawling them, with some heavy 
regulation via Regex of urls. So to sum up, a Solr DB populated through 
a couple of different ways, then search via some custom user facing PHP 
webpages. Finally a cronjob script would delete any docs older than X 
weeks, to keep on top of data retention.


Does that sound sensible at all?

Regarding RSS feeds:-
Many only provide a limited number of recent items, however I'd like to 
retain items for many weeks. I've already discovered the clean=false 
param on DataImport, after wondering why old rss items vanished!
Question 1) is there an easy way to filter items to import in the 
URLDataSource entity? Or is it best to go down route of XSLT 
preprocessing?
Question 2) Multiple URLDataSources: reference all in one DataImport 
handler? Or have multiple DataImport handlers?


What's the best approach to supplement imported data with additional 
static fields/keywords based associated with the source feed or crawled 
site? e.g. all docs from sites A, B & C are of subcategory Foo. I'm 
guessing with RSS feeds this would be straightforward via the XSLT 
preprocessor. But for Nutch submitted docs - I've no idea?


Scheduling import: Do people just cron up a curl post request (or shell 
execute of Nutch crawl script)? Or is there a more elegant solution 
available?


Any other more general tips and advice on the above greatly appreciated.

--
Arthur Yarwood


Stuck on SEVERE: Error filterStart

2014-04-16 Thread Arthur Pemberton
I am trying Solr for the first time, and I am stuck at the error "SEVERE:
Error filterStart"

My setup:
 - Centos 6.x
 - OpenJDK 1.7
 - Tomcat 7

>From reading [1] I believe the issue is missing JAR files, but I have no
idea where to put them, even the wiki is a bit vague on that.

Lib directories that I am aware of
 - /usr/share/tomcat/lib (for tomcat)
 - /opt/solr/example/solr/collection1/lib (for my instance)


This is the error I get:

Apr 15, 2014 11:35:36 PM org.apache.catalina.core.StandardContext
filterStart
SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not
find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars
need to go in the jetty lib/ext directory. For other containers, the
corresponding directory should be used. For more information, see:
http://wiki.apache.org/solr/SolrLogging
at
org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28)
at
org.apache.solr.servlet.BaseSolrFilter.(BaseSolrFilter.java:31)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at
org.apache.catalina.core.DefaultInstanceManager.newInstance(DefaultInstanceManager.java:134)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:256)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4650)
at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5306)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:657)
at
org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1637)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I would like to get past this so I can try out Solr.

I have gone as far as putting ``
into /opt/solr/example/solr/collection1/conf/solrconfig.xml but that did
not help.

I have used Java before, but purely for academic purposes, so I do not have
experience resolving these dependencies.



[1] https://wiki.apache.org/solr/SolrLogging


Re: Stuck on SEVERE: Error filterStart

2014-04-16 Thread Arthur Pemberton
Thank you. I had evidently misunderstood where it needed to be copied to.
That helped, though that directory already contained all but one file.


On Wed, Apr 16, 2014 at 12:45 PM, David Santamauro <
david.santama...@gmail.com> wrote:

>
> You need to copy /example/lib/ext/*.jar into your tomcat lib
> directory (/usr/share/tomcat/lib)
>
> Also make sure a /usr/share/tomcat/conf/log4j.properties is there as well.
>
> ... then restart.
>
> HTH
>
> David
>
>
>
> On 4/16/2014 11:47 AM, Arthur Pemberton wrote:
>
>> I am trying Solr for the first time, and I am stuck at the error "SEVERE:
>> Error filterStart"
>>
>> My setup:
>>   - Centos 6.x
>>   - OpenJDK 1.7
>>   - Tomcat 7
>>
>>  From reading [1] I believe the issue is missing JAR files, but I have no
>> idea where to put them, even the wiki is a bit vague on that.
>>
>> Lib directories that I am aware of
>>   - /usr/share/tomcat/lib (for tomcat)
>>   - /opt/solr/example/solr/collection1/lib (for my instance)
>>
>>
>> This is the error I get:
>>
>> Apr 15, 2014 11:35:36 PM org.apache.catalina.core.StandardContext
>> filterStart
>> SEVERE: Exception starting filter SolrRequestFilter
>> java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could
>> not
>> find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars
>> need to go in the jetty lib/ext directory. For other containers, the
>> corresponding directory should be used. For more information, see:
>> http://wiki.apache.org/solr/SolrLogging
>>  at
>> org.apache.solr.servlet.CheckLoggingConfiguration.check(
>> CheckLoggingConfiguration.java:28)
>>  at
>> org.apache.solr.servlet.BaseSolrFilter.(BaseSolrFilter.java:31)
>>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>  at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.java:57)
>>  at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAccessorImpl.java:45)
>>  at java.lang.reflect.Constructor.newInstance(Constructor.java:
>> 526)
>>  at java.lang.Class.newInstance(Class.java:374)
>>  at
>> org.apache.catalina.core.DefaultInstanceManager.newInstance(
>> DefaultInstanceManager.java:134)
>>  at
>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(
>> ApplicationFilterConfig.java:256)
>>  at
>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
>> ApplicationFilterConfig.java:382)
>>  at
>> org.apache.catalina.core.ApplicationFilterConfig.
>> (ApplicationFilterConfig.java:103)
>>  at
>> org.apache.catalina.core.StandardContext.filterStart(
>> StandardContext.java:4650)
>>  at
>> org.apache.catalina.core.StandardContext.startInternal(
>> StandardContext.java:5306)
>>  at
>> org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
>>  at
>> org.apache.catalina.core.ContainerBase.addChildInternal(
>> ContainerBase.java:901)
>>  at
>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
>>  at
>> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)
>>  at
>> org.apache.catalina.startup.HostConfig.deployDescriptor(
>> HostConfig.java:657)
>>  at
>> org.apache.catalina.startup.HostConfig$DeployDescriptor.
>> run(HostConfig.java:1637)
>>  at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:744)
>>
>> I would like to get past this so I can try out Solr.
>>
>> I have gone as far as putting `> dir="/opt/solr/example/solr/collection1/lib/" regex="*\.jar" />`
>> into /opt/solr/example/solr/collection1/conf/solrconfig.xml but that did
>> not help.
>>
>> I have used Java before, but purely for academic purposes, so I do not
>> have
>> experience resolving these dependencies.
>>
>> 
>>
>> [1] https://wiki.apache.org/solr/SolrLogging
>>
>>


-- 
Fedora 13
(www.pembo13.com)


Rerank for distributed requests

2020-07-21 Thread Arthur Gavlyukovskiy
Hi.

We're LTR and after switching to multiple shards we found that rerank
happens on individual shards and during the merge phase the first pass
score isn't used. Currently our LTR model doesn't use textual match and
assumes that reranked documents are already more or less good in terms of
textual score, which is not always the case when documents are distributed
across shards.

To avoid it I've tried to use sort by function that replicates actual query
and results I get is somewhat interesting - on individual shards first pass
happens by my sorting, then documents are reranked and during the merge
documents from the same shard are compared by "orderInShard" and from
different shards by value from sort, so that final order is neither sort
value nor score.
For example let's assume that documents coming from shard 1 are:
doc1(first_pass_score = 1, second_pass_score = 2)
doc2(first_pass_score = 4, second_pass_score = 1)
and documents coming from shard 2 are:
doc4(first_pass_score = 3, second_pass_score = 4)
doc3(first_pass_score = 2, second_pass_score = 3)
where first_pass_score is doc.sort_values[0] and second_pass_score is
doc.score

when we try to merge all documents this will happen
queue.insertWithOverflow(doc1)
queue.insertWithOverflow(doc2)
queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 <
doc2.orderInShard = 2)
queue.insertWithOverflow(doc4)
queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 >
doc2.first_pass_score = 3)
queue.insertWithOverflow(doc3)
queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 <
doc3.orderInShard = 2)

and final documents result will be:
doc1(first_pass_score = 1, second_pass_score = 2)
doc2(first_pass_score = 4, second_pass_score = 1)
doc4(first_pass_score = 3, second_pass_score = 4)
doc3(first_pass_score = 2, second_pass_score = 3)

Ideally I would want to see rerank happening based on global order across
all shards, I've implemented custom component that asks shards to
return *Math.max(reRankDocs,
offset + rows)* documents, which are first sorted by first pass score and
then only top *reRankDocs *are sorted by second pass score. I understand
that it might not be the best way in terms of performance (we rerank only
top 60 documents so it's not that big of a deal), but it's functionally
equivalent to the single shard behavior.

I'm curious if current behavior is intended or not, typically I would
expect either something I described above or at least ignoring sort during
the merge and using only doc.score that was generated by LTR rescorer.
Maybe the community would be interested in the approach I've implemented?
Or is it considered bad design to rely on first pass score and our LTR
model should use fields from first pass / use OriginalScoreFeature?