Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?
Hey Solr people: Suppose that we did not want to break up our document set into separate indexes, but had certain cases where many versions of a document were not relevant for certain searches. I guess this could be thought of as a "authorization" class of problem, however it is not that for us. We have a few other fields that determine relevancy to the current query, based on what page the query is coming from. It's kind of like authorization, but not really. Anyway, I think the answer for how you would do it for authorization would solve it for our case too. So I guess suppose you had 99 users and 100 documents and Document 1 everybody could see it the same, but for the 99 documents, there was a slightly different document, and it was unique for each of 99 users, but not "very" unique. Suppose for instance that the only thing different in the text of the 99 different documents was that it was watermarked with the users name. Aren't you spamming your tf/idf at that point? Is there a way around this? Is there a way to say, hey, group these 99 documents together and only count 1 of them for tf/idf purposes? When doing queries, each user would only ever see 2 documents, Document 1 , plus whichever other document they specifically owned. If there are web pages or book chapters I can read or re-read that address this class of problem, those references would be great. -Chris.
German Compound Splitter words.fst causing problems.
Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter by Dawid Weiss. I tried to "upgrade" the words.fst file that comes with the German compound-splitter using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get. cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader wordsFst Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e at org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118) at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85) The reason I'm attempting this at all is due to the answer here, http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7, which says to do the upgrade in a two step process, first using Solr 3.5, and then the latest Solr version (4.10.3). When I try this running the unit tests for my modified German compound-splitter I'm getting this same type of error. The thing is, this is an FST, not an index, which is a little confusing. The reason why I'm following this answer though, is because I'm getting that exact same message when trying to build the (modified) project with mavenat the point at which it tries to load in words.fst. Below. [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - Format version is not supported (resource: com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 (needs to be between 3 and 4). This version of Lucene only supports indexes created with release 3.0 and later. Failed to initialize static data structures for German compound splitter. Thanks, -Chris.
re: A Synonym Searching for Phrase?
I have implemented that but it's not open sourced yet. It will be soon. -Chris. From: "Ryan Yacyshyn" Sent: Thursday, May 14, 2015 12:07 PM To: solr-user@lucene.apache.org Subject: A Synonym Searching for Phrase? Hi All, I'm running into an issue where I have some tokens that really mean the same thing as two. For example, there are a couple ways users might want to search for certain type of visa called the "s pass", but they might query for spass or s-pass. I thought I could add a line in my synonym file to solve this, such as: s-pass, spass => s pass This doesn't seem to work. I found an Auto Phrase TokenFilter ( https://github.com/LucidWorks/auto-phrase-tokenfilter) that looks like it might help, but it sounds like it needs to use a specific query parser as well (we're using edismax). Has anyone came across this specific problem before? Would really appreciate your suggestions / help. We're using Solr 4.8.x (and lucidWorks 2.9). Thanks! Ryan
Multiple Word Synonyms with Autophrasing
Hello everyone @ solr-user, At Wayfair, I have implemented multiple word synonyms in a clean and efficient way in conjunction with with a slightly modified version of the LucidWorks' Autophrasing plugin by also tacking on a modified version of edismax. It is not released or on use on our public website yet, but it will be very soon. While it is not ready to officially open source yet, I know some people out there are anxious to implement this type of thing. Please feel free to contact me if you are interested in learning about how to theoretically accomplish this on your own. Note that while this may have some concepts in common with Named Entity Recognition implementations, I think it really is a completely different thing. I get a lot of spam, so if you please, would you write me privately your questions with the subject line being "MWSwA" so I can easily compile everyone's questions about this. I will respond to everyone at some point soon with some beta documentation or possibly with an invitation to a private github or something so that you can review an example. Thanks! -Chris.
Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser
Chris Morley here, from Wayfair. (Depahelix = my domain) Suyash Sonawane and I have worked on multiple word synonyms at Wayfair. We worked mostly off of Ted Sullivan's work and also off of some suggestions from Koorosh Vakhshoori. We have gotten to a point where we have a more sophisticated internal implementation, however, we've found that it is very difficult to make it do what you want it to do, and also be sufficiently performant. Watch out for exceptional situations with mm (minimum should match). Trey Grainger (now at Lucidworks) and Simon Hughes of Dice.com have also done work in this area. It should be very possible to get this kind of thing working on SolrCloud. I haven't tried it yet but I think theoretically, it should just work. The synonyms stuff is mostly about doing things at index time and query time. The index time stuff should translate to SolrCloud directly, while the query time stuff might pose some issues, but probably not too bad, if there are any issues at all. I've had decent luck porting our various plugins from 4.10.x to 5.5.0 because a lot of stuff is just Java, and it still works within the Jetty context. -Chris. From: "John Bickerstaff" Sent: Thursday, May 26, 2016 1:51 PM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser Hey Jeff (or anyone interested in multi-word synonyms) here are some potentially interesting links... http://wiki.apache.org/solr/QueryParser (search the page for synonum_edismax) https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ (blog post about what became the synonym_edissmax Query Parser) https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ This last was useful for lots of reasons and contains links to other interesting, related web pages... On Thu, May 26, 2016 at 11:45 AM, Jeff Wartes wrote: > Oh, interesting. I've certainty encountered issues with multi-word > synonyms, but I hadn't come across this. If you end up using it with a > recent solr verison, I'd be glad to hear your experience. > > I haven't used it, but I am aware of one other project in this vein that > you might be interested in looking at: > https://github.com/LucidWorks/auto-phrase-tokenfilter > > > On 5/26/16, 9:29 AM, "John Bickerstaff" wrote: > > >Ahh - for question #3 I may have spoken too soon. This line from the > >github repository readme suggests a way. > > > >Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and > >it works (Jetty). > > > >I'll try that and only respond back if that doesn't work. > > > >Questions 1 and 2 still stand of course... If anyone on the list has > >experience in this area... > > > >Thanks. > > > >On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff < > j...@johnbickerstaff.com > >> wrote: > > > >> Hi all, > >> > >> I'm creating a Solr Cloud that will index and search medical text. > >> Multi-word synonyms are a pretty important factor. > >> > >> I find that there are some challenges around multi-word synonyms and I > >> also found on the wiki that there is a recommended 3rd-party parser > >> (synonym_edismax parser) created by Nolan Lawson and found here: > >> https://github.com/healthonnet/hon-lucene-synonyms > >> > >> Here's the thing - the instructions on the github site involve bringing > >> the jar file into the war file - which is not applicable any more... at > >> least I think it's not... > >> > >> I have three questions: > >> > >> 1. Is this still a good solution for multi-word synonyms (I.e. Solr > Cloud > >> doesn't break it in some way) > >> 2. Is there a tool or plug-in out there that the contributors would > >> recommend above this one? > >> 3. Assuming 1 = yes and 2 = no, can anyone tell me an updated procedure > >> for bringing it in to Solr Cloud (I'm running 5.4.x) > >> > >> Thanks > >> > >
tlogs not deleting as usual in Solr 5.5.1?
The repetition below is on purpose to show the contrast between solr versions. In Solr 4.10.3, we have autocommits disabled. We do a dataimport of a few hundred thousand records and have a tlog that grows to ~1.2G. In Solr 5.5.1, we have autocommits disabled. We do a dataimport of a few hundred thousand records and have a tlog that grows to ~1.6G. (same exact data, slightly larger tlog but who knows, that's fine) In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true. (And deleted immediately.) In Solr 5.5.1 tlogs ARE NOT deleted after issuing update?commit=true. We want the tlog to delete like it did in Solr 4.10.3. Perhaps there is a configuration setting or feature of Solr 5.5.1 that causes this? Would appreciate any tips on configuration or code we could change to ensure the tlog will delete after a hard commit.
re: tlogs not deleting as usual in Solr 5.5.1?
After some more searching, I found a thread online where Erick Erickson is telling someone about how there are old tlogs left around in case there is a need for a peer to sync even if SolrCloud is not enabled. That makes sense, but we'll probably want to enable autoCommit and then trigger replication on the slaves when we know everything is committed after a full import. (We disable polling.) From: "Chris Morley" Sent: Thursday, June 16, 2016 3:20 PM To: "Solr Newsgroup" Subject: tlogs not deleting as usual in Solr 5.5.1? The repetition below is on purpose to show the contrast between solr versions. In Solr 4.10.3, we have autocommits disabled. We do a dataimport of a few hundred thousand records and have a tlog that grows to ~1.2G. In Solr 5.5.1, we have autocommits disabled. We do a dataimport of a few hundred thousand records and have a tlog that grows to ~1.6G. (same exact data, slightly larger tlog but who knows, that's fine) In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true. (And deleted immediately.) In Solr 5.5.1 tlogs ARE NOT deleted after issuing update?commit=true. We want the tlog to delete like it did in Solr 4.10.3. Perhaps there is a configuration setting or feature of Solr 5.5.1 that causes this? Would appreciate any tips on configuration or code we could change to ensure the tlog will delete after a hard commit.
Re: tlogs not deleting as usual in Solr 5.5.1?
Thanks Erick - that's what we have settled on doing until we are using SolrCloud, which will be later this year with any luck. We want to get up onto Solr 5.5.1 first (ASAP) and we tried disabling tlogs today and that seems to fit the bill. From: "Erick Erickson" Sent: Friday, June 17, 2016 2:36 PM To: "solr-user" , ch...@depahelix.com Subject: Re: tlogs not deleting as usual in Solr 5.5.1? If you are NOT using SolrCloud and don't care about Real Time Get, you can just disable the tlogs entirely. They're not doing you all that much good in that case... The tlogs are irrelevant when it comes to master/slave replication. FWIW, Erick On Fri, Jun 17, 2016 at 9:14 AM, Chris Morley wrote: > After some more searching, I found a thread online where Erick Erickson is > telling someone about how there are old tlogs left around in case there is > a need for a peer to sync even if SolrCloud is not enabled. That makes > sense, but we'll probably want to enable autoCommit and then trigger > replication on the slaves when we know everything is committed after a full > import. (We disable polling.) > > > > > > From: "Chris Morley" > Sent: Thursday, June 16, 2016 3:20 PM > To: "Solr Newsgroup" > Subject: tlogs not deleting as usual in Solr 5.5.1? > The repetition below is on purpose to show the contrast between solr > versions. > > In Solr 4.10.3, we have autocommits disabled. We do a dataimport of a few > hundred thousand records and have a tlog that grows to ~1.2G. > > In Solr 5.5.1, we have autocommits disabled. We do a dataimport of a few > hundred thousand records and have a tlog that grows to ~1.6G. (same exact > data, slightly larger tlog but who knows, that's fine) > > In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true. > (And deleted immediately.) > > In Solr 5.5.1 tlogs ARE NOT deleted after issuing update?commit=true. > > We want the tlog to delete like it did in Solr 4.10.3. Perhaps there is a > configuration setting or feature of Solr 5.5.1 that causes this? > > Would appreciate any tips on configuration or code we could change to > ensure the tlog will delete after a hard commit. > > >
changing the /solr path, additional steps needed for 6.1
This might help some people: To change the URL to server:port/ourspecialpath from server:port/solr is a bit inconvenient. You have to change several files where the solr part of the request path is hardcoded: server/solr-webapp/webapp/WEB-INF/web.xml server/solr/solr.xml server/contexts/solr-jetty-context.xml Now, with the release of the New UI defaulted to on in 6.1, you also have to change: server/solr-webapp/webapp/js/angular/services.js (in a bunch of places) -Chris.
re: Implementing custom analyzer for multi-language stemming
I know BasisTech.com has a plugin for elasticsearch that extends stemming/lemmatization to work across 40 natural languages. I'm not sure what they have for Solr, but I think something like that may exist as well. Cheers, -Chris. From: "Eugene" Sent: Wednesday, July 30, 2014 1:48 PM To: solr-user@lucene.apache.org Subject: Implementing custom analyzer for multi-language stemming Hello, fellow Solr and Lucene users and developers! In our project we receive text from users in different languages. We detect language automatically and use Google Translate APIs a lot (so having arbitrary number of languages in our system doesn't concern us). However we need to be able to search using stemming. Having nearly hundred of fields (several fields for each language with language-specific stemmers) listed in our search query is not an option. So we need a way to have a single index which has stemmed tokens for different languages. I have two questions: 1. Are there already (third-party) custom multi-language stemming analyzers? (I doubt that no one else ran into this issue) 2. If I'm going to implement such analyzer myself, could you please suggest a better way to 'pass' detected language value into such analyzer? Detecting language in analyzer itself is not an option, because: a) we already detect it in other place b) we do it based on combined values of many fields ('name', 'topic', 'description', etc.), while current field can be to short for reliable detection c) sometimes we just want to specify language explicitly. The obvious hack would be to prepend ISO 639-1 code to field value. But I'd like to believe that Solr allows for cleaner solution. I could think about either: a) custom query parameter (but I guess, it will require modifying request handlers, etc. which is highly undesirable) b) getting value from other field (we obviously have 'language' field and we do not have mixed-language records). If it is possible, could you please describe the mechanism for doing this or point to relevant code examples? Thank you very much and have a good day!
re: Solr is working very slow after certain time
A page Solr Performance Factors mentions 2 big tips that may help you, but you have to read the rest of the page to make sure you understand the caveats there. In general, adding many documents per update request is faster than one per update request. Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Source: http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance From: "Ameya Aware" Sent: Thursday, July 31, 2014 1:56 PM To: solr-user@lucene.apache.org Subject: Solr is working very slow after certain time Hi, i could index around 10 documents in couple of hours. But after that the time for indexing very large (around just 15-20 documents per minute). i have taken care of garbage collection. i am passing below parameters to Solr: -Xms6144m -Xmx6144m -XX:MaxPermSize=128m -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6 -XX:CMSInitiatingOccupancyFraction=70 -XX:NewRatio=3 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled -XX:+UseCompressedOops -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -XX:-UseGCOverheadLimit Can anyone help to solve this problem? Thanks, Ameya
re: How to accomadate huge data
Look into SolrCloud. From: "Ethan" Sent: Thursday, August 28, 2014 1:59 PM To: "solr-user" Subject: How to accomadate huge data Our index size is 110GB and growing, crossed RAM capacity of 96GB, and we are seeing a lot of disk and network IO resulting in huge latencies and instability(one of the server used to shutdown and stay in recovery mode when restarted). Our admin added swap space and that seemed to have mitigated the issue. But what is the usual practice in such scenario? Index size eventually outgrows RAM and is pushed on to disk. Is it advisable to shard(solr forum says no)? Or is there a different mechanism? System config: We have 3 node cluster with RAID1 SSD. Two nodes are running solr and the other is to maintain Quorum. -E
Re: reg: efficient querying using solr
This might help (indirectly): http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lu cene-solr.xls From: "gururaj kosuru" Sent: Wednesday, June 12, 2013 12:28 AM To: "solr-user" Subject: Re: reg: efficient querying using solr Thanks Walter, Shawn and Otis for the assistance, I will look into tuning the parameters by experimenting as seems to be the only way to go. On 11 June 2013 19:17, Shawn Heisey wrote: > On 6/11/2013 12:15 AM, gururaj kosuru wrote: > > How can one calculate an ideal max shard size for a solr core instance > if I > > am running a cloud with multiple systems of 4GB? > > That question is impossible to answer without experimentation, but > here's a good starting point. That's all it is, a starting point: > > http://wiki.apache.org/solr/SolrPerformanceProblems > > Thanks, > Shawn > >