multi-valued with metadata?
I've done my best to search through the archives for this problem, and found at least one person dealing with a similar issue (with no responses). I'm sure this has been asked more than once before, but my search-fu is apparently lacking. Essentially, I need to be able to retrieve some metadata (author IDs) along with the multi-valued fields holding document authors; e.g., I search in document titles, I get the doc ID, the list of authors, and *their* IDs, so I can drill down to other papers by these authors. A basic sample of the XML data: A001 This is a title 111 John Smith 222 Mary Johnson A002 And this is another title 222 Mary Johnson 333 Alice Pocahontas The only reasonable option I've thought of is to index the authors twice, with one list of names and one of IDs. It's not clear to me that SOLR guarantees ordered results of these multi-valued fields, though. Another option would be to delimit the ID in some manner so that I could search on and pull the ID from the author fields, but only display the name. A final option, and the one I'm hoping for, is to find that SOLR has some built-in support for this kind of thing. - Andrew
Re: multi-valued with metadata?
Two things come to mind, neither optimal, but... First, index both author and ID with a delimiter, something like Mary Johnson | 222 and deal with breaking that info up for display when you were displaying the documents. Make sure your analyzer breaks this up appropriately or your searching will be "interesting". The other would be to do the above in a different field, perhaps stored only so you'd have info that's never displayed to the user but the info would still be available from the doc. I'm pretty sure that order is preserved for mutli-valued fields, but I'm not 100% sure that behavior is guaranteed in the future. Best Erick On Sun, Nov 28, 2010 at 11:30 AM, Andrew Houghton wrote: > I've done my best to search through the archives for this problem, and > found at least one person dealing with a similar issue (with no > responses). I'm sure this has been asked more than once before, but > my search-fu is apparently lacking. > > Essentially, I need to be able to retrieve some metadata (author IDs) > along with the multi-valued fields holding document authors; e.g., I > search in document titles, I get the doc ID, the list of authors, and > *their* IDs, so I can drill down to other papers by these authors. > > A basic sample of the XML data: > > > > A001 > This is a title > > > 111 > John Smith > > > 222 > Mary Johnson > > > > > A002 > And this is another title > > > 222 > Mary Johnson > > > 333 > Alice Pocahontas > > > > > > The only reasonable option I've thought of is to index the authors > twice, with one list of names and one of IDs. It's not clear to me > that SOLR guarantees ordered results of these multi-valued fields, > though. Another option would be to delimit the ID in some manner so > that I could search on and pull the ID from the author fields, but > only display the name. > > A final option, and the one I'm hoping for, is to find that SOLR has > some built-in support for this kind of thing. > > - Andrew >
Re: Facet.query and collapsing
http://wiki.apache.org/solr/FieldCollapsing#Known_Limitations > Hi All, > > I'm in a situation where I need to perform a facet on a query with field > collapsing. > > Let's say the main query is something like this > > title:apple&fq={!tag=sources}source_id:(33 OR > 44)&facet=on&facet.field={!ex=sources}source_id&facet.query=source_id:(33 > OR 44)&collapse=on&collapse.field=hash_id > > I'd like my facet query to return the number of unique documents (based on > the hash_id field) that are associated to either source 33 or 44 > > Right now, the query works but the count returned is larger than expected > since there is no collapsing performed on the facet query's result set. > > Is there any way of doing this? I'd like to be able to do this without > performing a second request. > > Thanks > > NOTE: I'm using Solr 1.4.1 with patch 236 > (https://issues.apache.org/jira/browse/SOLR-236)
Re: Is this sort order possible in a single query?
> I can't see a way to do it without functionqueries at the moment, which > doesn't mean there isn't any. If you want to use the suggested sort method, you could probably sort first by score: sort=score desc, num_copies desc, num_comments desc To let the score be influenced by exact author match only, you can set all other components to 0 (disMax): qf=author^0.0 title^0.0 author_exact Personally I doubt that the requirements will give an excellent usability experience. If you do NOT get an exact author match, you will get a "bestseller" list, not taking into account your terms in the ranking at all. Unless there are specific reasons for these requirements I'd recommend using rank boosts instead of simple sorting. Boost author_exact very high using DisMax combined with a version of the author field with KeywordTokenizerFactory and LowerCaseFilterFactory, perhaps combined with PatternReplaceFilterFactory to normalize punctuation etc. Then boost num_copies and num_comments using FunctionQueries in such a way that sold copies count more than comments. Example: http://localhost:8093/solr/select?defType=dismax&author:(j.k. rowling)&qf=author^5.0 author_exact^1000.0 title^10.0 descr^0.2&fq=log(sum(num_copies,1))^1000.0 log(sum(num_comments,1))^100.0 Also a hint for this kind of fields is to disable field normalization (omitNorms="true") -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: Logging queries and hit count
You can also configure your logging framework to output the relevant logs to a separate file: log4j.logger.org.apache.solr.core.SolrCore=INFO, A1 This way you'll avoid too much noise from other componets, but you'll get all update and admin requests as well, so you'll have to filter on core name. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 26. nov. 2010, at 21.11, Ahmet Arslan wrote: >> Is it possible to create a lean log file for queries and >> the number of >> hits these queries returned? >> >> We are running Solr under Tomcat. > > I believe that many people do it at client side. But tomcat already logs that > info. If you set tomcat's log level to INFO you can extract hits, QTime and > query itself. > > INFO: [] webapp=/test path=/select/ > params={indent=on&start=0&q=solr&rows=10&version=2.2} hits=0 status=0 QTime=0 > > >
Re: multi-valued with metadata?
The best Solr solution often involves indexing the same source fields into several Solr fields for different purposes. In this case I'd go with your idea of delimiting the author id, along with indexing it separately: A001 111 John Smith 111|John Smith The field author_id_name could be stored but not indexed. You would retrieve the id and author_id_name fields, and then parse the author id out of the author_id_name field and use it for drill-down searches against the author_id field. The link pointing to the drill-down search would use the name part of the author_id_name as its anchor. Peter On 2010/11/28 9:30 AM, "Andrew Houghton" wrote: > I've done my best to search through the archives for this problem, and > found at least one person dealing with a similar issue (with no > responses). I'm sure this has been asked more than once before, but > my search-fu is apparently lacking. > > Essentially, I need to be able to retrieve some metadata (author IDs) > along with the multi-valued fields holding document authors; e.g., I > search in document titles, I get the doc ID, the list of authors, and > *their* IDs, so I can drill down to other papers by these authors. > > A basic sample of the XML data: > > > > A001 > This is a title > > > 111 > John Smith > > > 222 > Mary Johnson > > > > > A002 > And this is another title > > > 222 > Mary Johnson > > > 333 > Alice Pocahontas > > > > > > The only reasonable option I've thought of is to index the authors > twice, with one list of names and one of IDs. It's not clear to me > that SOLR guarantees ordered results of these multi-valued fields, > though. Another option would be to delimit the ID in some manner so > that I could search on and pull the ID from the author fields, but > only display the name. > > A final option, and the one I'm hoping for, is to find that SOLR has > some built-in support for this kind of thing. > > - Andrew >
Re: Basic Solr Configurations and best practice
thanx Alexey I downloaded Solr 4 and implemented the TikaEntityProcessor, it worked fine with Tika 0.6. didn't work with Tika 0.7 nor Tika 0.8 SNAPSHOT On Sat, Nov 27, 2010 at 4:05 AM, Alexey Serba wrote: > > 1- How to combine data from DIH and content extracted from file > system > > document into one document in the index? > http://wiki.apache.org/solr/TikaEntityProcessor > You can have one sql entity that retrieves metadata from database and > another nested entity that parses binary file into additional fields > in the document. > > > 2- Should I move the per-user permissions into a separate index? > What > > technique to implement? > I would start with keeping permissions in the same index as the actual > content. > > > On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman wrote: > > Hi guys > > > > I'm kind of new to solr and I'm wondering how to configure solr to best > > fulfills my requirements. > > > > Requirements are as follow: > > > > I have 2 data sources: database and file system documents. Every document > in > > the file system has related information stored in the database. Both the > > file content and the related database fields must be indexed. Along with > > the DB data is per-user permissions for every document. I'm using DIH > for > > the DB and Tika for the file System. The documents contents nearly never > > change, while the DB data especially the permissions changes very > > frequently. Total number of documents roughly around 2M and each document > is > > about 500KB. > > > > 1- How to combine data from DIH and content extracted from file > system > > document into one document in the index? > > > > 2- Should I move the per-user permissions into a separate index? > What > > technique to implement? > > >