Re: MoreLikeThis and dynamicField

2010-06-12 Thread Peter Karich
Hi Lance,

> You cannot give it a wildcard to go find dynamic fields;

ah, ok. But it would be nice to use wildcards.
You correctly guessed that I meant the wildcard querying. The
dynamicField definition is not the problem ... So one can define
 and
then the query via myfield_t would work.
The termVectors is only for performance reasons, I think (it worked
without this attribute).

Now, regarding "stored" or "not-stored": mlt worked for me if I did:


But not if I do


Or do I need to configure the field in a different way?

Regards,
Peter.

> MoreLikeThis works on any field you can name. You cannot give it a
> wildcard to go find dynamic fields; no query feature can do this
> (would be handy!).
>
> The field must have term vectors configured. In the
> solr/example/solr/conf schema, the 'includes' field has this set and
> you can do MLT on that. Try 'usb' as a search term for the electronics
> store example in solr/example/solr.
>
> I don't know if the needs to be stored. Also, whether it can work on
> terms in a multi-valued field. You'll have to reconfigure the 'text'
> copyField target to have term vectors.
>
> On Fri, Jun 11, 2010 at 1:06 PM, Peter Karich  wrote:
>   
>> Hi,
>>
>> it seems to me that the MoreLikeThis component doesn't work for dynamic
>> fields. Is that correct?
>> And it also doesn't work for fields which are indexed but not stored,
>> right? e.g. 'text' where dynamic fields could be copied to.
>>
>> Or did I create an incorrect example?
>>
>> Regards,
>> Peter.
>>
>> --
>> http://karussell.wordpress.com/
>>
>>
>> 
>
>
>   


-- 
http://karussell.wordpress.com/



Solr and Nutch/Droids - to use or not to use?

2010-06-12 Thread MitchK

Hello community and a nice satureday,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the
performance.

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the
webpages. 

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable.

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special
indexing-logic.

V. I want to use filter queries (i.e. main-query "christopher lee" returns
1.5mio results, subquery "action" -> the main-query would be a filter-query
and "action" would be the actual query. So a search within search-results
would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic.

-

The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new
experiences.

What do I know about Solr + Nutch:
As it is said on lucidimagination.com, Solr + Nutch does not scale if the
index is too large.
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr.

Furthermore I don't want to index the pages with nutch and reindex them with
solr. 
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework.


Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop. 

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in
indexing. 


Where should I dive deeper? 
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?


Thanks for the discussion!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p890640.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr DataConfig / DIH Question

2010-06-12 Thread Holmes, Charles V.
I'm putting together an entity.  A simplified version of the database schema is 
below.  There is a 1-[0,1] relationship between Person and Address with 
address_id being the nullable foreign key.  If it makes any difference, I'm 
using SQL Server 2005 on the backend.

Person [id (pk), name, address_id (fk)]
Address [id (pk), zipcode]

My data config looks like the one below.  This naturally fails when the 
address_id is null since the query ends up being "select * from user.address 
where id = ". 


  


I've worked around it by using a config like this one.  However, this makes the 
queries quite complex for some of my larger joins.


  
  


Is there a cleaner / better way of handling these type of relationships?  I've 
also tried to specify a default in the Solr schema, but that seems to only work 
after all the data is indexed which makes sense but surprised me initially.  
BTW, thanks for the great DIH tutorial on the wiki!

Thanks!
Charles


Re: MoreLikeThis and dynamicField

2010-06-12 Thread Lance Norskog
MLT needs term vectors. You may either store the term vectors while
indexing, or store the text so that the MLT handler can pull it and
analyze it. (Highlighting has the same problem.)

I don't know if you still need the stored text with term vectors in
the index, but storing neither will definitely not work.

On Sat, Jun 12, 2010 at 1:37 AM, Peter Karich  wrote:
> Hi Lance,
>
>> You cannot give it a wildcard to go find dynamic fields;
>
> ah, ok. But it would be nice to use wildcards.
> You correctly guessed that I meant the wildcard querying. The
> dynamicField definition is not the problem ... So one can define
>  and
> then the query via myfield_t would work.
> The termVectors is only for performance reasons, I think (it worked
> without this attribute).
>
> Now, regarding "stored" or "not-stored": mlt worked for me if I did:
> 
>
> But not if I do
> 
>
> Or do I need to configure the field in a different way?
>
> Regards,
> Peter.
>
>> MoreLikeThis works on any field you can name. You cannot give it a
>> wildcard to go find dynamic fields; no query feature can do this
>> (would be handy!).
>>
>> The field must have term vectors configured. In the
>> solr/example/solr/conf schema, the 'includes' field has this set and
>> you can do MLT on that. Try 'usb' as a search term for the electronics
>> store example in solr/example/solr.
>>
>> I don't know if the needs to be stored. Also, whether it can work on
>> terms in a multi-valued field. You'll have to reconfigure the 'text'
>> copyField target to have term vectors.
>>
>> On Fri, Jun 11, 2010 at 1:06 PM, Peter Karich  wrote:
>>
>>> Hi,
>>>
>>> it seems to me that the MoreLikeThis component doesn't work for dynamic
>>> fields. Is that correct?
>>> And it also doesn't work for fields which are indexed but not stored,
>>> right? e.g. 'text' where dynamic fields could be copied to.
>>>
>>> Or did I create an incorrect example?
>>>
>>> Regards,
>>> Peter.
>>>
>>> --
>>> http://karussell.wordpress.com/
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> http://karussell.wordpress.com/
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr DataConfig / DIH Question

2010-06-12 Thread Lance Norskog
This is a slow way to do this; databases are capable of doing this
join and feeding the results very efficiently.

The 'skipDoc' feature allows you to break out of the processing chain
after the first query. It is used in the wikipedia example.

http://wiki.apache.org/solr/DataImportHandler

On Sat, Jun 12, 2010 at 6:37 PM, Holmes, Charles V.  wrote:
> I'm putting together an entity.  A simplified version of the database schema 
> is below.  There is a 1-[0,1] relationship between Person and Address with 
> address_id being the nullable foreign key.  If it makes any difference, I'm 
> using SQL Server 2005 on the backend.
>
> Person [id (pk), name, address_id (fk)]
> Address [id (pk), zipcode]
>
> My data config looks like the one below.  This naturally fails when the 
> address_id is null since the query ends up being "select * from user.address 
> where id = ".
>
>         Query="select * from user.person">
>            Query="select * from user.address where id = ${person.address_id}"
>  
> 
>
> I've worked around it by using a config like this one.  However, this makes 
> the queries quite complex for some of my larger joins.
>
>         Query="select * from user.person">
>            Query="select * from user.address where id = (select address_id from 
> user.person where id = ${person.id})">
>  
> 
>
> Is there a cleaner / better way of handling these type of relationships?  
> I've also tried to specify a default in the Solr schema, but that seems to 
> only work after all the data is indexed which makes sense but surprised me 
> initially.  BTW, thanks for the great DIH tutorial on the wiki!
>
> Thanks!
> Charles
>



-- 
Lance Norskog
goks...@gmail.com


how to patch?

2010-06-12 Thread Li Li
I want to use fast highlighter in solr1.4 and find a issue in
https://issues.apache.org/jira/browse/SOLR-1268
  File Name   Date Attached  ↑
 Attached By   Size
   SOLR-1268.patch   2010-02-05 10:32 PM
 Koji Sekiguchi4 kB
   SOLR-1268-0_fragsize.patch   2010-02-04 10:43 PM
Koji Sekiguchi2 kB
   SOLR-1268-0_fragsize.patch   2010-01-29 11:51 PM
Koji Sekiguchi1 kB
   SOLR-1268.patch  2010-01-03 11:27 PM
 Koji Sekiguchi   48 kB
   SOLR-1268.patch  2010-01-02 10:42 PM
 Koji Sekiguchi   35 kB

I am not familiar with patch. There are 3 patch named SOLR-1268, which
one shoud I use? I guess I should the newest one but it's so small.
SOLR-1268-0_fragsize also have 2 files


Re: Dynamic dataConfig files in DIH

2010-06-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Jun 11, 2010 at 11:13 PM, Chris Hostetter
wrote:

>
> : Is there a way to dynamically point which dataConfig file to use to
> import
> : using DIH without using the defaults hardcoded in solrconfig.xml?
>

> what do you mean by "dynamically" ? ... it's a query param, so you can
> specify the file name in the url when you issue the command.
>
> not it is not. it is not reloaded for every request. We should enhance dih
to do so. But the whole data-config file can be sent as a request param and
it works (this is used by the dih debug mode)

>
> -Hoss
>
>


-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: Solr DataConfig / DIH Question

2010-06-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
this looks like a common problem.  I guess DIH should handle this more
gracefully. Instead of firing a query and failing it should not fire a query
if any of the values are missing . This can b made configurable if needed

On Sun, Jun 13, 2010 at 9:14 AM, Lance Norskog  wrote:

> This is a slow way to do this; databases are capable of doing this
> join and feeding the results very efficiently.
>
> The 'skipDoc' feature allows you to break out of the processing chain
> after the first query. It is used in the wikipedia example.
>
> http://wiki.apache.org/solr/DataImportHandler
>
> On Sat, Jun 12, 2010 at 6:37 PM, Holmes, Charles V. 
> wrote:
> > I'm putting together an entity.  A simplified version of the database
> schema is below.  There is a 1-[0,1] relationship between Person and Address
> with address_id being the nullable foreign key.  If it makes any difference,
> I'm using SQL Server 2005 on the backend.
> >
> > Person [id (pk), name, address_id (fk)]
> > Address [id (pk), zipcode]
> >
> > My data config looks like the one below.  This naturally fails when the
> address_id is null since the query ends up being "select * from user.address
> where id = ".
> >
> >  >Query="select * from user.person">
> >   >  Query="select * from user.address where id =
> ${person.address_id}"
> >  
> > 
> >
> > I've worked around it by using a config like this one.  However, this
> makes the queries quite complex for some of my larger joins.
> >
> >  >Query="select * from user.person">
> >   >  Query="select * from user.address where id = (select address_id
> from user.person where id = ${person.id})">
> >  
> > 
> >
> > Is there a cleaner / better way of handling these type of relationships?
>  I've also tried to specify a default in the Solr schema, but that seems to
> only work after all the data is indexed which makes sense but surprised me
> initially.  BTW, thanks for the great DIH tutorial on the wiki!
> >
> > Thanks!
> > Charles
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com