Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Chris Morley
Hey Solr people:
  
 Suppose that we did not want to break up our document set into separate 
indexes, but had certain cases where many versions of a document were not 
relevant for certain searches.
  
 I guess this could be thought of as a "authorization" class of problem, 
however it is not that for us.  We have a few other fields that determine 
relevancy to the current query, based on what page the query is coming 
from.  It's kind of like authorization, but not really.
  
 Anyway, I think the answer for how you would do it for authorization would 
solve it for our case too.
  
 So I guess suppose you had 99 users and 100 documents and Document 1 
everybody could see it the same, but for the 99 documents, there was a 
slightly different document, and it was unique for each of 99 users, but 
not "very" unique.  Suppose for instance that the only thing different in 
the text of the 99 different documents was that it was watermarked with the 
users name.  Aren't you spamming your tf/idf at that point?  Is there a way 
around this?  Is there a way to say, hey, group these 99 documents together 
and only count 1 of them for tf/idf purposes?
  
 When doing queries, each user would only ever see 2 documents, Document 1 
, plus whichever other document they specifically owned.
  
 If there are web pages or book chapters I can read or re-read that address 
this class of problem, those references would be great.
  
  
 -Chris.
  
  



German Compound Splitter words.fst causing problems.

2015-03-25 Thread Chris Morley
Hello, Chris Morley here, of Wayfair.com. I am working on the German 
compound-splitter by Dawid Weiss.

  I tried to "upgrade" the words.fst file that comes with the German 
compound-splitter using Solr 3.5, but it doesn't work. Below is the 
IndexNotFoundException that I get.

 cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp 
lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader 
wordsFst
 Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: 
org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
 at 
org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
 at 
org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)

 The reason I'm attempting this at all is due to the answer here, 
http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
 which says to do the upgrade in a two step process, first using Solr 3.5, and 
then the latest Solr version (4.10.3).  When I try this running the unit tests 
for my modified German compound-splitter I'm getting this same type of error.  
The thing is, this is an FST, not an index, which is a little confusing.  The 
reason why I'm following this answer though, is because I'm getting that exact 
same message when trying to build the (modified) project with mavenat the 
point at which it tries to load in words.fst. Below.

 [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - 
Format version is not supported (resource: 
com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 
(needs to be between 3 and 4). This version of Lucene only supports indexes 
created with release 3.0 and later.  Failed to initialize static data 
structures for German compound splitter.

 Thanks,
 -Chris.




re: A Synonym Searching for Phrase?

2015-05-14 Thread Chris Morley
I have implemented that but it's not open sourced yet.  It will be soon.
  
 -Chris.
  
  
  


 From: "Ryan Yacyshyn" 
Sent: Thursday, May 14, 2015 12:07 PM
To: solr-user@lucene.apache.org
Subject: A Synonym Searching for Phrase?   
Hi All,

I'm running into an issue where I have some tokens that really mean the
same thing as two. For example, there are a couple ways users might want 
to
search for certain type of visa called the "s pass", but they might query
for spass or s-pass.

I thought I could add a line in my synonym file to solve this, such as:

s-pass, spass => s pass

This doesn't seem to work. I found an Auto Phrase TokenFilter (
https://github.com/LucidWorks/auto-phrase-tokenfilter) that looks like it
might help, but it sounds like it needs to use a specific query parser as
well (we're using edismax).

Has anyone came across this specific problem before? Would really
appreciate your suggestions / help.

We're using Solr 4.8.x (and lucidWorks 2.9).

Thanks!
Ryan
 



Multiple Word Synonyms with Autophrasing

2015-06-01 Thread Chris Morley
Hello everyone @ solr-user,
  
 At Wayfair, I have implemented multiple word synonyms in a clean and 
efficient way in conjunction with with a slightly modified version of the 
LucidWorks' Autophrasing plugin by also tacking on a modified version of 
edismax.  It is not released or on use on our public website yet, but it 
will be very soon.  While it is not ready to officially open source yet, I 
know some people out there are anxious to implement this type of thing.  
Please feel free to contact me if you are interested in learning about how 
to theoretically accomplish this on your own.  Note that while this may 
have some concepts in common with Named Entity Recognition implementations, 
I think it really is a completely different thing.  I get a lot of spam, so 
if you please, would you write me privately your questions with the subject 
line being "MWSwA" so I can easily compile everyone's questions about this. 
 I will respond to everyone at some point soon with some beta documentation 
or possibly with an invitation to a private github or something so that you 
can review an example.
  
 Thanks!
 -Chris.
  



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-05-26 Thread Chris Morley
Chris Morley here, from Wayfair.  (Depahelix = my domain)

 Suyash Sonawane and I have worked on multiple word synonyms at Wayfair.  We 
worked mostly off of Ted Sullivan's work and also off of some suggestions from 
Koorosh Vakhshoori.  We have gotten to a point where we have a more 
sophisticated internal implementation, however, we've found that it is very 
difficult to make it do what you want it to do, and also be sufficiently 
performant.  Watch out for exceptional situations with mm (minimum should 
match).

 Trey Grainger (now at Lucidworks) and Simon Hughes of Dice.com have also done 
work in this area.

 It should be very possible to get this kind of thing working on SolrCloud.  I 
haven't tried it yet but I think theoretically, it should just work.  The 
synonyms stuff is mostly about doing things at index time and query time.  The 
index time stuff should translate to SolrCloud directly, while the query time 
stuff might pose some issues, but probably not too bad, if there are any issues 
at all.

 I've had decent luck porting our various plugins from 4.10.x to 5.5.0 because 
a lot of stuff is just Java, and it still works within the Jetty context.

 -Chris.





 From: "John Bickerstaff" 
Sent: Thursday, May 26, 2016 1:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser  
Hey Jeff (or anyone interested in multi-word synonyms) here are some
potentially interesting links...

http://wiki.apache.org/solr/QueryParser (search the page for
synonum_edismax)

https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ (blog
post about what became the synonym_edissmax Query Parser)

https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

This last was useful for lots of reasons and contains links to other
interesting, related web pages...

On Thu, May 26, 2016 at 11:45 AM, Jeff Wartes 
wrote:

> Oh, interesting. I've certainty encountered issues with multi-word
> synonyms, but I hadn't come across this. If you end up using it with a
> recent solr verison, I'd be glad to hear your experience.
>
> I haven't used it, but I am aware of one other project in this vein that
> you might be interested in looking at:
> https://github.com/LucidWorks/auto-phrase-tokenfilter
>
>
> On 5/26/16, 9:29 AM, "John Bickerstaff"  wrote:
>
> >Ahh - for question #3 I may have spoken too soon. This line from the
> >github repository readme suggests a way.
> >
> >Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and
> >it works (Jetty).
> >
> >I'll try that and only respond back if that doesn't work.
> >
> >Questions 1 and 2 still stand of course... If anyone on the list has
> >experience in this area...
> >
> >Thanks.
> >
> >On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff <
> j...@johnbickerstaff.com
> >> wrote:
> >
> >> Hi all,
> >>
> >> I'm creating a Solr Cloud that will index and search medical text.
> >> Multi-word synonyms are a pretty important factor.
> >>
> >> I find that there are some challenges around multi-word synonyms and I
> >> also found on the wiki that there is a recommended 3rd-party parser
> >> (synonym_edismax parser) created by Nolan Lawson and found here:
> >> https://github.com/healthonnet/hon-lucene-synonyms
> >>
> >> Here's the thing - the instructions on the github site involve bringing
> >> the jar file into the war file - which is not applicable any more... at
> >> least I think it's not...
> >>
> >> I have three questions:
> >>
> >> 1. Is this still a good solution for multi-word synonyms (I.e. Solr
> Cloud
> >> doesn't break it in some way)
> >> 2. Is there a tool or plug-in out there that the contributors would
> >> recommend above this one?
> >> 3. Assuming 1 = yes and 2 = no, can anyone tell me an updated procedure
> >> for bringing it in to Solr Cloud (I'm running 5.4.x)
> >>
> >> Thanks
> >>
>
>




tlogs not deleting as usual in Solr 5.5.1?

2016-06-16 Thread Chris Morley
 The repetition below is on purpose to show the contrast between solr 
versions.
  
 In Solr 4.10.3, we have autocommits disabled.  We do a dataimport of a few 
hundred thousand records and have a tlog that grows to ~1.2G.
  
 In Solr 5.5.1,  we have autocommits disabled.  We do a dataimport of a few 
hundred thousand records and have a tlog that grows to ~1.6G. (same exact 
data, slightly larger tlog but who knows, that's fine)
  
 In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true.  
(And deleted immediately.)
  
 In Solr 5.5.1  tlogs ARE NOT deleted after issuing update?commit=true.
  
 We want the tlog to delete like it did in Solr 4.10.3.  Perhaps there is a 
configuration setting or feature of Solr 5.5.1 that causes this?
  
 Would appreciate any tips on configuration or code we could change to 
ensure the tlog will delete after a hard commit.
  

  
  



re: tlogs not deleting as usual in Solr 5.5.1?

2016-06-17 Thread Chris Morley
After some more searching, I found a thread online where Erick Erickson is 
telling someone about how there are old tlogs left around in case there is 
a need for a peer to sync even if SolrCloud is not enabled.  That makes 
sense, but we'll probably want to enable autoCommit and then trigger 
replication on the slaves when we know everything is committed after a full 
import.  (We disable polling.)
  
  
  


 From: "Chris Morley" 
Sent: Thursday, June 16, 2016 3:20 PM
To: "Solr Newsgroup" 
Subject: tlogs not deleting as usual in Solr 5.5.1?   
The repetition below is on purpose to show the contrast between solr
versions.

In Solr 4.10.3, we have autocommits disabled. We do a dataimport of a few
hundred thousand records and have a tlog that grows to ~1.2G.

In Solr 5.5.1, we have autocommits disabled. We do a dataimport of a few
hundred thousand records and have a tlog that grows to ~1.6G. (same exact
data, slightly larger tlog but who knows, that's fine)

In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true.
(And deleted immediately.)

In Solr 5.5.1 tlogs ARE NOT deleted after issuing update?commit=true.

We want the tlog to delete like it did in Solr 4.10.3. Perhaps there is a
configuration setting or feature of Solr 5.5.1 that causes this?

Would appreciate any tips on configuration or code we could change to
ensure the tlog will delete after a hard commit.

 



Re: tlogs not deleting as usual in Solr 5.5.1?

2016-06-17 Thread Chris Morley
Thanks Erick - that's what we have settled on doing until we are using 
SolrCloud, which will be later this year with any luck.  We want to get up 
onto Solr 5.5.1 first (ASAP) and we tried disabling tlogs today and that 
seems to fit the bill.
  
  
  


 From: "Erick Erickson" 
Sent: Friday, June 17, 2016 2:36 PM
To: "solr-user" , ch...@depahelix.com
Subject: Re: tlogs not deleting as usual in Solr 5.5.1?   
If you are NOT using SolrCloud and don't
care about Real Time Get, you can just disable the
tlogs entirely. They're not doing you all that much
good in that case...

The tlogs are irrelevant when it comes to master/slave
replication.

FWIW,
Erick

On Fri, Jun 17, 2016 at 9:14 AM, Chris Morley  wrote:
> After some more searching, I found a thread online where Erick Erickson 
is
> telling someone about how there are old tlogs left around in case there 
is
> a need for a peer to sync even if SolrCloud is not enabled. That makes
> sense, but we'll probably want to enable autoCommit and then trigger
> replication on the slaves when we know everything is committed after a 
full
> import. (We disable polling.)
>
>
>
>
> 
> From: "Chris Morley" 
> Sent: Thursday, June 16, 2016 3:20 PM
> To: "Solr Newsgroup" 
> Subject: tlogs not deleting as usual in Solr 5.5.1?
> The repetition below is on purpose to show the contrast between solr
> versions.
>
> In Solr 4.10.3, we have autocommits disabled. We do a dataimport of a 
few
> hundred thousand records and have a tlog that grows to ~1.2G.
>
> In Solr 5.5.1, we have autocommits disabled. We do a dataimport of a few
> hundred thousand records and have a tlog that grows to ~1.6G. (same 
exact
> data, slightly larger tlog but who knows, that's fine)
>
> In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true.
> (And deleted immediately.)
>
> In Solr 5.5.1 tlogs ARE NOT deleted after issuing update?commit=true.
>
> We want the tlog to delete like it did in Solr 4.10.3. Perhaps there is 
a
> configuration setting or feature of Solr 5.5.1 that causes this?
>
> Would appreciate any tips on configuration or code we could change to
> ensure the tlog will delete after a hard commit.
>
>
>
 



changing the /solr path, additional steps needed for 6.1

2016-08-25 Thread Chris Morley
This might help some people:
  
 To change the URL to server:port/ourspecialpath from server:port/solr is a 
bit inconvenient.  You have to change several files where the solr part of 
the request path is hardcoded:
  
 server/solr-webapp/webapp/WEB-INF/web.xml
 server/solr/solr.xml
 server/contexts/solr-jetty-context.xml
  
 Now, with the release of the New UI defaulted to on in 6.1, you also have 
to change:
 server/solr-webapp/webapp/js/angular/services.js
 (in a bunch of places)
  
 -Chris.
  
  



re: Implementing custom analyzer for multi-language stemming

2014-07-30 Thread Chris Morley
I know BasisTech.com has a plugin for elasticsearch that extends 
stemming/lemmatization to work across 40 natural languages.
I'm not sure what they have for Solr, but I think something like that may 
exist as well.

Cheers,
-Chris.


 From: "Eugene" 
Sent: Wednesday, July 30, 2014 1:48 PM
To: solr-user@lucene.apache.org
Subject: Implementing custom analyzer for multi-language stemming

Hello, fellow Solr and Lucene users and developers!

In our project we receive text from users in different languages. We
detect language automatically and use Google Translate APIs a lot (so
having arbitrary number of languages in our system doesn't concern us).
However we need to be able to search using stemming. Having nearly hundred
of fields (several fields for each language with language-specific
stemmers) listed in our search query is not an option. So we need a way to
have a single index which has stemmed tokens for different languages. I
have two questions:

1. Are there already (third-party) custom multi-language stemming
analyzers? (I doubt that no one else ran into this issue)

2. If I'm going to implement such analyzer myself, could you please
suggest a better way to 'pass' detected language value into such analyzer?
Detecting language in analyzer itself is not an option, because: a) we
already detect it in other place b) we do it based on combined values of
many fields ('name', 'topic', 'description', etc.), while current field 
can
be to short for reliable detection c) sometimes we just want to specify
language explicitly. The obvious hack would be to prepend ISO 639-1 code 
to
field value. But I'd like to believe that Solr allows for cleaner 
solution.
I could think about either: a) custom query parameter (but I guess, it 
will
require modifying request handlers, etc. which is highly undesirable) b)
getting value from other field (we obviously have 'language' field and we
do not have mixed-language records). If it is possible, could you please
describe the mechanism for doing this or point to relevant code examples?
Thank you very much and have a good day!



re: Solr is working very slow after certain time

2014-07-31 Thread Chris Morley
A page Solr Performance Factors mentions 2 big tips that may help you, but 
you have to read the rest of the page to make sure you understand the 
caveats there.

In general, adding many documents per update request is faster than one per 
update request.  

Reducing the frequency of automatic commits or disabling them entirely may 
speed indexing.  
 Source:
 http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance
  
  


 From: "Ameya Aware" 
Sent: Thursday, July 31, 2014 1:56 PM
To: solr-user@lucene.apache.org
Subject: Solr is working very slow after certain time   
Hi,

i could index around 10 documents in couple of hours. But after that
the time for indexing very large (around just 15-20 documents per minute).

i have taken care of garbage collection.

i am passing below parameters to Solr:
-Xms6144m -Xmx6144m -XX:MaxPermSize=128m -XX:+UseConcMarkSweepGC
-XX:ConcGCThreads=6 -XX:ParallelGCThreads=6
-XX:CMSInitiatingOccupancyFraction=70 -XX:NewRatio=3
-XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
-XX:+UseCompressedOops -XX:+ParallelRefProcEnabled -XX:+UseLargePages
-XX:+AggressiveOpts -XX:-UseGCOverheadLimit

Can anyone help to solve this problem?

Thanks,
Ameya
 



re: How to accomadate huge data

2014-08-28 Thread Chris Morley
Look into SolrCloud.
  
  
  


 From: "Ethan" 
Sent: Thursday, August 28, 2014 1:59 PM
To: "solr-user" 
Subject: How to accomadate huge data   
Our index size is 110GB and growing, crossed RAM capacity of 96GB, and we
are seeing a lot of disk and network IO resulting in huge latencies and
instability(one of the server used to shutdown and stay in recovery mode
when restarted). Our admin added swap space and that seemed to have
mitigated the issue.

But what is the usual practice in such scenario? Index size eventually
outgrows RAM and is pushed on to disk. Is it advisable to shard(solr forum
says no)? Or is there a different mechanism?

System config:
We have 3 node cluster with RAID1 SSD. Two nodes are running solr and the
other is to maintain Quorum.

-E
 



Re: reg: efficient querying using solr

2013-06-11 Thread Chris Morley
This might help (indirectly):
http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lu
cene-solr.xls


 From: "gururaj kosuru" 
Sent: Wednesday, June 12, 2013 12:28 AM
To: "solr-user" 
Subject: Re: reg: efficient querying using solr

Thanks Walter, Shawn and Otis for the assistance, I will look into tuning
the parameters by experimenting as seems to be the only way to go.

On 11 June 2013 19:17, Shawn Heisey  wrote:

> On 6/11/2013 12:15 AM, gururaj kosuru wrote:
> > How can one calculate an ideal max shard size for a solr core instance
> if I
> > am running a cloud with multiple systems of 4GB?
>
> That question is impossible to answer without experimentation, but
> here's a good starting point.  That's all it is, a starting point:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>