fq efficiency
Hi all, I'm wondering if filter queries are efficient enough for my use cases. I have lots and lots of users in a big, multi-tenant, sharded index. To run a search, I can use an fq on the user id and pass in the search terms. Does this scale well with the # users? I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast. And looking up search terms is fast, of course. But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets: docs from all users with the search term and all of the current user's docs. Also, how about auto-complete and searching with a trailing wildcard? As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match "foo*". In a multi-tenant index, all users' keywords are stored together. So if Lucene were to look at all the keywords from "foo" to "fooz" (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user. Thanks, Scott
RE: fq efficiency
Thanks, that link is very helpful, especially the section, "Leapfrog, anyone?" This actually seems quite slow for my use case. Suppose we have 10,000 users and 1,000,000 documents. We search for "hello" for a particular user and let's assume that the fq set for the user is cached. "hello" is a common word and perhaps 10,000 documents will match. If the user has 100 documents, then finding the intersection requires checking each list ~100 times. If the user has 1,000 documents, we check each list ~1,000 times. That doesn't scale well. My searches are usually in one user's data. How can I take advantage of that? I could have a separate index for each user, but loading so many indexes at once seems infeasible; and dynamically loading & unloading indexes is a pain. Or I could create a filter that takes tokens and prepends them with the user id. That seems like a good solution, since my keyword searches always include a user id (and usually just 1 user id). Though I wonder if there is a downside I haven't thought of. Thanks, Scott > -Original Message- > From: Shawn Heisey [mailto:s...@elyograg.org] > Sent: Tuesday, November 05, 2013 4:35 PM > To: solr-user@lucene.apache.org > Subject: Re: fq efficiency > > On 11/5/2013 3:36 PM, Scott Schneider wrote: > > I'm wondering if filter queries are efficient enough for my use > cases. I have lots and lots of users in a big, multi-tenant, sharded > index. To run a search, I can use an fq on the user id and pass in the > search terms. Does this scale well with the # users? I suppose that, > since user id is indexed, generating the filter data (which is cached) > will be fast. And looking up search terms is fast, of course. But if > the search term is a common one that many users have in their > documents, then Solr may have to perform an intersection between two > large sets: docs from all users with the search term and all of the > current user's docs. > > > > Also, how about auto-complete and searching with a trailing wildcard? > As I understand it, these work well in a single-tenant index because > keywords are sorted in the index, so it's easy to get all the search > terms that match "foo*". In a multi-tenant index, all users' keywords > are stored together. So if Lucene were to look at all the keywords > from "foo" to "fooz" (I'm not sure if it actually does this), it > would skip over a large majority of keywords that don't belong to this > user. > > From what I understand, there's not really a whole lot of difference > between queries and filter queries when they are NOT cached, except > that > the main query and the filter queries are executed in parallel, which > can save time. > > When filter queries are found in the filterCache, it's a different > story. They get applied *before* the main query, which means that the > main query won't have to work as hard. The filterCache stores > information about which documents in the entire index match the filter. > By storing it as a bitset, the amount of space required is relatively > low. Applying filterCache results is very efficient. > > There are also advanced techniques, like assigning a cost to each > filter > and creating postfilters: > > http://yonik.com/posts/advanced-filter-caching-in-solr/ > > Thanks, > Shawn
RE: fq efficiency
Digging a bit more, I think I have answered my own questions. Can someone please say if this sounds right? http://wiki.apache.org/solr/LotsOfCores looks like a pretty good solution. If I give each user his own shard, each query can be run in only one shard. The effect of the filter query will basically be to find that shard. The requirements listed on the wiki suggest that performance will be good. But in Solr 3.x, this won't scale with the # users/shards. Prepending a user id to indexed keywords using an analyzer will break wildcard search. If there is a wildcard, the query analyzer doesn't run filters, so it won't prepend the user id. I could prepend the user id myself before calling Solr, but that seems... bad. Scott > -Original Message----- > From: Scott Schneider [mailto:scott_schnei...@symantec.com] > Sent: Thursday, November 07, 2013 2:03 PM > To: solr-user@lucene.apache.org > Subject: RE: fq efficiency > > Thanks, that link is very helpful, especially the section, "Leapfrog, > anyone?" This actually seems quite slow for my use case. Suppose we > have 10,000 users and 1,000,000 documents. We search for "hello" for a > particular user and let's assume that the fq set for the user is > cached. "hello" is a common word and perhaps 10,000 documents will > match. If the user has 100 documents, then finding the intersection > requires checking each list ~100 times. If the user has 1,000 > documents, we check each list ~1,000 times. That doesn't scale well. > > My searches are usually in one user's data. How can I take advantage > of that? I could have a separate index for each user, but loading so > many indexes at once seems infeasible; and dynamically loading & > unloading indexes is a pain. > > Or I could create a filter that takes tokens and prepends them with the > user id. That seems like a good solution, since my keyword searches > always include a user id (and usually just 1 user id). Though I wonder > if there is a downside I haven't thought of. > > Thanks, > Scott > > > > -Original Message- > > From: Shawn Heisey [mailto:s...@elyograg.org] > > Sent: Tuesday, November 05, 2013 4:35 PM > > To: solr-user@lucene.apache.org > > Subject: Re: fq efficiency > > > > On 11/5/2013 3:36 PM, Scott Schneider wrote: > > > I'm wondering if filter queries are efficient enough for my use > > cases. I have lots and lots of users in a big, multi-tenant, sharded > > index. To run a search, I can use an fq on the user id and pass in > the > > search terms. Does this scale well with the # users? I suppose > that, > > since user id is indexed, generating the filter data (which is > cached) > > will be fast. And looking up search terms is fast, of course. But > if > > the search term is a common one that many users have in their > > documents, then Solr may have to perform an intersection between two > > large sets: docs from all users with the search term and all of the > > current user's docs. > > > > > > Also, how about auto-complete and searching with a trailing > wildcard? > > As I understand it, these work well in a single-tenant index because > > keywords are sorted in the index, so it's easy to get all the search > > terms that match "foo*". In a multi-tenant index, all users' > keywords > > are stored together. So if Lucene were to look at all the keywords > > from "foo" to "fooz" (I'm not sure if it actually does this), it > > would skip over a large majority of keywords that don't belong to > this > > user. > > > > From what I understand, there's not really a whole lot of difference > > between queries and filter queries when they are NOT cached, except > > that > > the main query and the filter queries are executed in parallel, which > > can save time. > > > > When filter queries are found in the filterCache, it's a different > > story. They get applied *before* the main query, which means that > the > > main query won't have to work as hard. The filterCache stores > > information about which documents in the entire index match the > filter. > > By storing it as a bitset, the amount of space required is relatively > > low. Applying filterCache results is very efficient. > > > > There are also advanced techniques, like assigning a cost to each > > filter > > and creating postfilters: > > > > http://yonik.com/posts/advanced-filter-caching-in-solr/ > > > > Thanks, > > Shawn
RE: search with wildcard
I know it's documented that Lucene/Solr doesn't apply filters to queries with wildcards, but this seems to trip up a lot of users. I can also see why wildcards break a number of filters, but a number of filters (e.g. mapping charsets) could mostly or entirely work. The N-gram filter is another one that would be great to still run when there wildcards. If you indexed 4-grams and the query is a "*testp*", you currently won't get any results; but the N-gram filter could have a wildcard mode that, in this case, would return just the first 4-gram as a token. Is this something you've considered? It would have to be enabled in the core network, but disabled by default for existing filters; then it could be enabled 1-by-1 for existing filters. Apologies if the dev list is a better place for this. Scott > -Original Message- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Thursday, November 21, 2013 8:40 AM > To: solr-user@lucene.apache.org > Subject: Re: search with wildcard > > Hi Adnreas, > > If you don't want to use wildcards at query time, alternative way is to > use NGrams at indexing time. This will produce a lot of tokens. e.g. > For example 4grams of your example : Supertestplan => supe uper pert > erte rtes *test* estp stpl tpla plan > > > Is that you want? By the way why do you want to search inside of words? > > maxGramSize="4"/> > > > > > On Thursday, November 21, 2013 5:23 PM, Andreas Owen > wrote: > > I suppose i have to create another field with diffenet tokenizers and > set > the boost very low so it doesn't really mess with my ranking because > there > the word is now in 2 fields. What kind of tokenizer can do the job? > > > > From: Andreas Owen [mailto:a...@conx.ch] > Sent: Donnerstag, 21. November 2013 16:13 > To: solr-user@lucene.apache.org > Subject: search with wildcard > > > > I am querying "test" in solr 4.3.1 over the field below and it's not > finding > all occurences. It seems that if it is a substring of a word like > "Supertestplan" it isn't found unless I use a wildcards "*test*". This > is > write because of my tokenizer but does someone know a way around this? > I > don't want to add wildcards because that messes up queries with > multiple > words. > > > > positionIncrementGap="100"> > > > > > > > > > > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> > > > > class="solr.SnowballPorterFilterFactory" language="German"/> > > > > > >
Solr substring search
Hello, I'm trying to find out how Solr runs a query for "*foo*". Google tells me that you need to use NGramFilterFactory for that kind of substring search, but I find that even with very simple fieldTypes, it just works. (Perhaps because I'm testing on very small data sets, Solr is willing to look through all the keywords.) e.g. This works on the tutorial. Can someone tell me exactly how this works and/or point me to the Lucene code that implements this? Thanks, Scott
Querying a non-indexed field?
Hello, Is it possible to restrict query results using a non-indexed, stored field? e.g. I might index fewer fields to reduce the index size. I query on a few indexed fields, getting a small # of results. I want to restrict this further based on values from non-indexed, stored fields. I can obviously do this myself, but it would be nice if Solr could do this for me. Thanks, Scott
RE: Querying a non-indexed field?
Ok, thanks for your answers! Scott > -Original Message- > From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] > Sent: Wednesday, September 18, 2013 5:36 PM > To: solr-user@lucene.apache.org > Subject: Re: Querying a non-indexed field? > > Moreover, you may be trying to save/optimize in a wrong place. Maybe > these > additional indexed fields are not so costly. Maybe you can optimize in > some > other part of your setup. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Sep 18, 2013 5:47 PM, "Chris Hostetter" > wrote: > > > > > : Subject: Re: Querying a non-indexed field? > > : > > : No. --wunder > > > > To elaborate just a bit... > > > > : query on a few indexed fields, getting a small # of results. I > want to > > : restrict this further based on values from non-indexed, stored > fields. > > : I can obviously do this myself, but it would be nice if Solr could > do > > > > ...you could implement this in a custom SearchComponent, or custom > qparser > > that would generate PostFilter compatible queries, that looked at the > > stored field values -- but it's extremeley unlikeley that you would > ever > > convince any of the lucene/solr devs to agree to commit a general > purpose > > version of this type of logic into the code base -- because in the > general > > case (arbitrary unknown number of documents matching the main query) > it > > would be extremely inefficient and would encourage "bad" user > behavior. > > > > -Hoss > >
Problem loading my codec sometimes
Hello, I created my own codec and Solr can find it sometimes and not other times. When I start fresh (delete the data folder and run Solr), it all works fine. I can add data and query it. When I stop Solr and start it again, I get: Caused by: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [SimpleText, Appending, Lucene40, Lucene3x, Lucene41, Lucene42] I added the JAR to the path and I'm pretty sure Java sees it, or else it would not be using my codec when I start fresh. (I've looked at the index files and verified that it's using my codec.) I suppose Solr is asking SPI for my codec based on the codec class name stored in the index files, but I don't see why this would fail when a fresh start works. Any thoughts? Thanks, Scott
RE: Problem loading my codec sometimes
Thanks for your quick response! My jar was in solr/lib. I removed all the directives from solrconfig.xml, but I still get the error. My solr.xml doesn't have sharedLib. By the way, I am running Solr 4.4.0 with most of the default example files (including solr.xml). My schema.xml and solrconfig.xml are from another project using Solr 3.6. I modified them a bit to fix any obvious errors. I still wonder why it can create a new index using my codec, but not load an index previously created with my codec. In solrconfig.xml, I specify the CodecFactory along with the package name, whereas the codec name that is read from the index file has no package name. Could that be the problem? I think that's the way it's supposed to be. Could it be that Solr has my jar in the classpath, but SPI is not registering my codec class from the jar? I'm not familiar with SPI. What else can I try? Thanks, Scott > -Original Message- > From: Shawn Heisey [mailto:s...@elyograg.org] > Sent: Tuesday, September 24, 2013 5:51 PM > To: solr-user@lucene.apache.org > Subject: Re: Problem loading my codec sometimes > > On 9/24/2013 6:32 PM, Scott Schneider wrote: > > I created my own codec and Solr can find it sometimes and not other > times. When I start fresh (delete the data folder and run Solr), it > all works fine. I can add data and query it. When I stop Solr and > start it again, I get: > > > > Caused by: java.lang.IllegalArgumentException: A SPI class of type > org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You > need to add the corresponding JAR file supporting this SPI to your > classpath.The current classpath supports the following names: > [SimpleText, Appending, Lucene40, Lucene3x, Lucene41, Lucene42] > > > > I added the JAR to the path and I'm pretty sure Java sees it, or else > it would not be using my codec when I start fresh. (I've looked at the > index files and verified that it's using my codec.) I suppose Solr is > asking SPI for my codec based on the codec class name stored in the > index files, but I don't see why this would fail when a fresh start > works. > > What I always recommend for those who want to use custom and contrib > jars is that they put all such jars (and their dependencies) into > ${solr.solr.home}/lib, don't use any directives in > solrconfig.xml, > and don't put the sharedLib attribute into solr.xml. Doing it in any > other way has a tendency to trigger bugs or causes jars to get loaded > more than once. > > The ${solr.solr.home} property defaults to $CWD/solr (CWD is current > working directory for those who don't already know) and is the location > of the solr.xml file. Note that depending on the exact version of Solr > and which servlet container you are using, there may actually be two > solr.xml files, one which loads solr into your container and one that > configures Solr. I am referring to the latter. > > If you are using the solr example and its directory layout, the > directory you would need to put all jars into is example/solr/lib ... > which is a directory that doesn't exist and has to be created. > > http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29 > http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond > > Thanks, > Shawn
RE: Problem loading my codec sometimes
Ah, I fixed it. I wasn't properly including the org.apache.lucene.codecs.Codec file in my jar. I wasn't sure if it was necessary in Solr, since I specify my factory in solrconfig.xml. I think that's why I could create a new index, but not load an existing one. Scott > -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > Sent: Wednesday, September 25, 2013 9:49 AM > To: solr-user@lucene.apache.org > Subject: RE: Problem loading my codec sometimes > > > : I still wonder why it can create a new index using my codec, but not > : load an index previously created with my codec. In solrconfig.xml, I > : specify the CodecFactory along with the package name, whereas the > codec > : name that is read from the index file has no package name. Could > that > : be the problem? I think that's the way it's supposed to be. Could > it > : be that Solr has my jar in the classpath, but SPI is not registering > my > : codec class from the jar? I'm not familiar with SPI. > > it's very possible that there is a classloader / SPI runtime race > condition in looking up the codec names found in segment files. This > sort > of classpath related runtime issue is extremely hard to write tests > for. > > Could you please file a bug and include... > > * the source of your codec (or a simple sample codec that you can >also use to reproduce the problem) > * a ziped up copy of your entire solr home directory, including >the jar file containing your codec so we can verify the SPI files >are in their properly > - no need to include an actual index here > * some simple sample docments in xml or json taht we can index >with the schema you are using > > > > -Hoss
RE: Problem loading my codec sometimes
Ok, I created SOLR-5278. Thanks again! Scott > -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > Sent: Wednesday, September 25, 2013 10:15 AM > To: solr-user@lucene.apache.org > Subject: RE: Problem loading my codec sometimes > > > : Ah, I fixed it. I wasn't properly including the > : org.apache.lucene.codecs.Codec file in my jar. I wasn't sure if it > was > : necessary in Solr, since I specify my factory in solrconfig.xml. I > : think that's why I could create a new index, but not load an existing > : one. > > Ah interesting. > > yes, you definitely need the SPI registration in the jar file so that > it > can resolve codec files found on disk when opening them -- the > configuration in solrconfig.xml tells solr hch codec to use when > writing > new segments, but it must respect the codec information in segements > found > on disk when opening them (that's how the index backcompat works), and > those are looked up via SPI. > > Can you do me a favor please and still file an issue with these > details. > the attachments i asked about before would still be handy, but probably > not neccessary -- at a minimum could you show us the "jar tf" output of > your plugin jar when you were having the problem. > > Even if the codec factory code can find the configured codec on > startup, > we should probably throw a very load error write away if that same > codec > can't be found by name using SPI to prevent people from running into > confusing problems when making mistakes like this. > > > > -Hoss