multi-valued with metadata?

2010-11-28 Thread Andrew Houghton
I've done my best to search through the archives for this problem, and
found at least one person dealing with a similar issue (with no
responses).  I'm sure this has been asked more than once before, but
my search-fu is apparently lacking.

Essentially, I need to be able to retrieve some metadata (author IDs)
along with the multi-valued fields holding document authors; e.g., I
search in document titles, I get the doc ID, the list of authors, and
*their* IDs, so I can drill down to other papers by these authors.

A basic sample of the XML data:



  A001
  This is a title
  
    
      111
      John Smith
    
    
      222
      Mary Johnson
    
  


  A002
  And this is another title
  
    
      222
      Mary Johnson
    
    
      333
      Alice Pocahontas
    
  



The only reasonable option I've thought of is to index the authors
twice, with one list of names and one of IDs.  It's not clear to me
that SOLR guarantees ordered results of these multi-valued fields,
though.  Another option would be to delimit the ID in some manner so
that I could search on and pull the ID from the author fields, but
only display the name.

A final option, and the one I'm hoping for, is to find that SOLR has
some built-in support for this kind of thing.

- Andrew


Re: multi-valued with metadata?

2010-11-28 Thread Erick Erickson
Two things come to mind, neither optimal, but...

First, index both author and ID with a delimiter, something like
Mary Johnson | 222
and deal with breaking that info up for display when you were displaying
the documents. Make sure your analyzer breaks this up appropriately or
your searching will be "interesting".

The other would be to do the above in a different field, perhaps stored only
so you'd have info that's never displayed to the user but the info would
still be available from the doc.

I'm pretty sure that order is preserved for mutli-valued fields, but I'm not
100%
sure that behavior is guaranteed in the future.


Best
Erick
On Sun, Nov 28, 2010 at 11:30 AM, Andrew Houghton  wrote:

> I've done my best to search through the archives for this problem, and
> found at least one person dealing with a similar issue (with no
> responses).  I'm sure this has been asked more than once before, but
> my search-fu is apparently lacking.
>
> Essentially, I need to be able to retrieve some metadata (author IDs)
> along with the multi-valued fields holding document authors; e.g., I
> search in document titles, I get the doc ID, the list of authors, and
> *their* IDs, so I can drill down to other papers by these authors.
>
> A basic sample of the XML data:
>
> 
> 
>   A001
>   This is a title
>   
> 
>   111
>   John Smith
> 
> 
>   222
>   Mary Johnson
> 
>   
> 
> 
>   A002
>   And this is another title
>   
> 
>   222
>   Mary Johnson
> 
> 
>   333
>   Alice Pocahontas
> 
>   
> 
> 
>
> The only reasonable option I've thought of is to index the authors
> twice, with one list of names and one of IDs.  It's not clear to me
> that SOLR guarantees ordered results of these multi-valued fields,
> though.  Another option would be to delimit the ID in some manner so
> that I could search on and pull the ID from the author fields, but
> only display the name.
>
> A final option, and the one I'm hoping for, is to find that SOLR has
> some built-in support for this kind of thing.
>
> - Andrew
>


Re: Facet.query and collapsing

2010-11-28 Thread Markus Jelsma
http://wiki.apache.org/solr/FieldCollapsing#Known_Limitations

> Hi All,
> 
> I'm in a situation where I need to perform a facet on a query with field
> collapsing.
> 
> Let's say the main query is something like this
> 
> title:apple&fq={!tag=sources}source_id:(33 OR
> 44)&facet=on&facet.field={!ex=sources}source_id&facet.query=source_id:(33
> OR 44)&collapse=on&collapse.field=hash_id
> 
> I'd like my facet query to return the number of unique documents (based on
> the hash_id field) that are associated to either source 33 or 44
> 
> Right now, the query works but the count returned is larger than expected
> since there is no collapsing performed on the facet query's result set.
> 
> Is there any way of doing this? I'd like to be able to do this without
> performing a second request.
> 
> Thanks
> 
> NOTE: I'm using Solr 1.4.1 with patch 236
> (https://issues.apache.org/jira/browse/SOLR-236)


Re: Is this sort order possible in a single query?

2010-11-28 Thread Jan Høydahl / Cominvent
> I can't see a way to do it without functionqueries at the moment, which
> doesn't mean there isn't any.

If you want to use the suggested sort method, you could probably sort first by 
score:
sort=score desc, num_copies desc, num_comments desc
To let the score be influenced by exact author match only, you can set all 
other components to 0 (disMax):
qf=author^0.0 title^0.0 author_exact


Personally I doubt that the requirements will give an excellent usability 
experience. If you do NOT get an exact author match, you will get a 
"bestseller" list, not taking into account your terms in the ranking at all.

Unless there are specific reasons for these requirements I'd recommend using 
rank boosts instead of simple sorting.

Boost author_exact very high using DisMax combined with a version of the author 
field with KeywordTokenizerFactory and LowerCaseFilterFactory, perhaps combined 
with PatternReplaceFilterFactory to normalize punctuation etc.

 





 

Then boost num_copies and num_comments using FunctionQueries in such a way that 
sold copies count more than comments.

Example:
http://localhost:8093/solr/select?defType=dismax&author:(j.k. 
rowling)&qf=author^5.0 author_exact^1000.0 title^10.0 
descr^0.2&fq=log(sum(num_copies,1))^1000.0 log(sum(num_comments,1))^100.0

Also a hint for this kind of fields is to disable field normalization 
(omitNorms="true")

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: Logging queries and hit count

2010-11-28 Thread Jan Høydahl / Cominvent
You can also configure your logging framework to output the relevant logs to a 
separate file:

log4j.logger.org.apache.solr.core.SolrCore=INFO, A1

This way you'll avoid too much noise from other componets, but you'll get all 
update and admin requests as well, so you'll have to filter on core name.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 26. nov. 2010, at 21.11, Ahmet Arslan wrote:

>> Is it possible to create a lean log file for queries and
>> the number of
>> hits these queries returned?
>> 
>> We are running Solr under Tomcat.
> 
> I believe that many people do it at client side. But tomcat already logs that 
> info. If you set tomcat's log level to INFO you can extract hits, QTime and 
> query itself.
> 
> INFO: [] webapp=/test path=/select/ 
> params={indent=on&start=0&q=solr&rows=10&version=2.2} hits=0 status=0 QTime=0
> 
> 
> 



Re: multi-valued with metadata?

2010-11-28 Thread Binkley, Peter
The best Solr solution often involves indexing the same source fields into
several Solr fields for different purposes. In this case I'd go with your
idea of delimiting the author id, along with indexing it separately:

A001
111
John Smith
111|John Smith

The field author_id_name could be stored but not indexed. You would retrieve
the id and author_id_name fields, and then parse the author id out of the
author_id_name field and use it for drill-down searches against the
author_id field. The link pointing to the drill-down search would use the
name part of the author_id_name as its anchor.

Peter


On 2010/11/28 9:30 AM, "Andrew Houghton"  wrote:

> I've done my best to search through the archives for this problem, and
> found at least one person dealing with a similar issue (with no
> responses).  I'm sure this has been asked more than once before, but
> my search-fu is apparently lacking.
> 
> Essentially, I need to be able to retrieve some metadata (author IDs)
> along with the multi-valued fields holding document authors; e.g., I
> search in document titles, I get the doc ID, the list of authors, and
> *their* IDs, so I can drill down to other papers by these authors.
> 
> A basic sample of the XML data:
> 
> 
> 
>   A001
>   This is a title
>   
>     
>       111
>       John Smith
>     
>     
>       222
>       Mary Johnson
>     
>   
> 
> 
>   A002
>   And this is another title
>   
>     
>       222
>       Mary Johnson
>     
>     
>       333
>       Alice Pocahontas
>     
>   
> 
> 
> 
> The only reasonable option I've thought of is to index the authors
> twice, with one list of names and one of IDs.  It's not clear to me
> that SOLR guarantees ordered results of these multi-valued fields,
> though.  Another option would be to delimit the ID in some manner so
> that I could search on and pull the ID from the author fields, but
> only display the name.
> 
> A final option, and the one I'm hoping for, is to find that SOLR has
> some built-in support for this kind of thing.
> 
> - Andrew
> 



Re: Basic Solr Configurations and best practice

2010-11-28 Thread Darx Oman
thanx Alexey
I downloaded Solr 4 and implemented the TikaEntityProcessor, it worked fine
with Tika 0.6.
didn't work with Tika 0.7 nor Tika 0.8 SNAPSHOT


On Sat, Nov 27, 2010 at 4:05 AM, Alexey Serba  wrote:

> > 1-  How to combine data from DIH and content extracted from file
> system
> > document into one document in the index?
> http://wiki.apache.org/solr/TikaEntityProcessor
> You can have one sql entity that retrieves metadata from database and
> another nested entity that parses binary file into additional fields
> in the document.
>
> > 2-  Should I move the per-user permissions into a separate index?
> What
> > technique to implement?
> I would start with keeping permissions in the same index as the actual
> content.
>
>
> On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman  wrote:
> > Hi guys
> >
> > I'm kind of new to solr and I'm wondering how to configure solr to best
> > fulfills my requirements.
> >
> > Requirements are as follow:
> >
> > I have 2 data sources: database and file system documents. Every document
> in
> > the file system has related information stored in the database.  Both the
> > file content and the related database fields must be indexed.  Along with
> > the DB data is per-user permissions for every document.  I'm using DIH
> for
> > the DB and Tika for the file System.  The documents contents nearly never
> > change, while the DB data especially the permissions changes very
> > frequently. Total number of documents roughly around 2M and each document
> is
> > about 500KB.
> >
> > 1-  How to combine data from DIH and content extracted from file
> system
> > document into one document in the index?
> >
> > 2-  Should I move the per-user permissions into a separate index?
> What
> > technique to implement?
> >
>