Fwd: Using MLT Handler to find similar documents but also filter similar documents by a keyword.

2012-03-10 Thread Ravish Bhagdev
I will appreciate any comments or help on this. Thanks.

Rav

-- Forwarded message --
From: Ravish Bhagdev 
Date: Fri, Mar 2, 2012 at 12:12 AM
Subject: Using MLT Handler to find similar documents but also filter
similar documents by a keyword.
To: solr-user@lucene.apache.org


Hi,

Apologies if this has been answered before, I tried searching for it and
didn't find anything answering this exactly.

I want to find similar documents using MLT Handler using some specified
fields but I want to filter down the returned matches with some keywords as
well.

I looked at the example provided at
http://wiki.apache.org/solr/MoreLikeThisHandler :

/solr/mlt?q=id:SP2514N&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&*
fq=inStock:true*&mlt.interestingTerms=details

which is specifying a filter query using fq to filter (something).

I understand that the first document returned as a result of query
(q=id:SP2514N) is used for performing the matching and fq actually affects
this result rather than the matched documents returned by MLT.  Am I right
or wrong?

That is the "fq" in above example going to filter the MLT match results by
the fq query or will it just affect the initial query to  get the first
document to match by?  If former, that is what I want to do, but is fq the
way to do it?  Can I use this fq on any kind of text/string field?

I hope my question is making sense, it is a bit hard to explain so I am
sorry if not!

Thanks,
Ravish


Re: Xml representation of indexed document

2012-03-10 Thread Paul Libbrecht
Chamnap,

that'd be a view of the stored fields only (although Luke has some more to 
extract unstored fields).
In my search projects I have an indexer and that component (not DIH) can 
display an "indexed view" of a document.

maybe it helps.

paul


Le 10 mars 2012 à 08:57, Anupam Bhattacharya a écrit :

> You can use Luke to view Lucene Indexes.
> 
> Anupam
> 
> On Sat, Mar 10, 2012 at 12:27 PM, Chamnap Chhorn 
> wrote:
> 
>> Hi all,
>> 
>> I'm doing data import using DIH in solr 3.5. I'm curious to know whether it
>> is see the xml representation of indexed data from the browser. Is it
>> possible?
>> I just want to make sure these data is correctly indexed with correct value
>> or for debugging purpose.
>> 
>> --
>> Chamnap
>> 
> 
> 
> 
> -- 
> Thanks & Regards
> Anupam Bhattacharya



how to ignore indexing of duplicated documents?

2012-03-10 Thread nagarjuna
Hi all,

  i am new to solr ...i would like to know how to avoid indexing of
duplicate documents.
i have one table in database with one column and that has the keywords which
are
frequently repeated when i tried to index it is indexing all the terms in
the database.
i would like to ignore the indexing of duplicate terms.
please help me 


thanks in advance


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-ignore-indexing-of-duplicated-documents-tp3814858p3814858.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Xml representation of indexed document

2012-03-10 Thread Chamnap Chhorn
Thanks Anupam and Paul.

Yes, it can't display unstored fields. I can't find the way to extract
unstored fields in Luke. Any idea?
In your project, which indexer do you use? Previously, I wrote a ruby
script to index, but it took a lot of time. That's why I changed to DIH.


Chamnap


On Sat, Mar 10, 2012 at 4:41 PM, Paul Libbrecht  wrote:

> Chamnap,
>
> that'd be a view of the stored fields only (although Luke has some more to
> extract unstored fields).
> In my search projects I have an indexer and that component (not DIH) can
> display an "indexed view" of a document.
>
> maybe it helps.
>
> paul
>
>
> Le 10 mars 2012 à 08:57, Anupam Bhattacharya a écrit :
>
> > You can use Luke to view Lucene Indexes.
> >
> > Anupam
> >
> > On Sat, Mar 10, 2012 at 12:27 PM, Chamnap Chhorn <
> chamnapchh...@gmail.com>wrote:
> >
> >> Hi all,
> >>
> >> I'm doing data import using DIH in solr 3.5. I'm curious to know
> whether it
> >> is see the xml representation of indexed data from the browser. Is it
> >> possible?
> >> I just want to make sure these data is correctly indexed with correct
> value
> >> or for debugging purpose.
> >>
> >> --
> >> Chamnap
> >>
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
>
>


-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/


Accessing other entities from DIH

2012-03-10 Thread Chamnap Chhorn
Hi all,

I'm using DIH solr 3.5 to import data from mysql. In my document, I have
some fields: name, category, text_spell, ...
text_spell is a multi-valued field which combines from name and category
(category is a multi-value field as well).


   

   


In this case, I would use ScriptTransformer to produce a new array of
[name, category], but the from the example in solr
wiki,
it seems it could only access the current row in the current entity.
Is it possible to access other entities?

If not possible, how could i solve this problem. I know I could use UNION
statement, but it duplicates the query and it would degrade the performance
as well. Any idea?

-- 
Chamnap


Re: Accessing other entities from DIH

2012-03-10 Thread Mikhail Khludnev
Hello,

First of all you can have an access to the context, where the parent entity
fields can be obtained from (following your link):

The semantics of execution is same as that of a java transformer. The
method can have two arguments as in 'transformRow(Map ,
Context context) in the abstract class 'Transformer' . As it is javascript
the second argument may be omittted and it still works.

then,

generally it sounds like a copyfield
http://wiki.apache.org/solr/SchemaXml#Copy_Fields have you considered it?

On Sat, Mar 10, 2012 at 3:42 PM, Chamnap Chhorn wrote:

> Hi all,
>
> I'm using DIH solr 3.5 to import data from mysql. In my document, I have
> some fields: name, category, text_spell, ...
> text_spell is a multi-valued field which combines from name and category
> (category is a multi-value field as well).
>
> query="SELECT uuid, name from listings" pk="uuid">
> query="SELECT `categories`.`name` FROM categories INNER JOIN
> `listing_categories` ON
> `categories`.`uuid`=`listing_categories`.`category_uuid`) WHERE
> `listing_categories`.`listing_uuid`='${listing.uuid}'">
>
>   
> 
>
> In this case, I would use ScriptTransformer to produce a new array of
> [name, category], but the from the example in solr
> wiki,
> it seems it could only access the current row in the current entity.
> Is it possible to access other entities?
>
> If not possible, how could i solve this problem. I know I could use UNION
> statement, but it duplicates the query and it would degrade the performance
> as well. Any idea?
>
> --
> Chamnap
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: Xml representation of indexed document

2012-03-10 Thread Mikhail Khludnev
Hello,

DIH has a cute interactive ui with debug/verbose features. Have you checked
them?

On Sat, Mar 10, 2012 at 10:57 AM, Chamnap Chhorn wrote:

> Hi all,
>
> I'm doing data import using DIH in solr 3.5. I'm curious to know whether it
> is see the xml representation of indexed data from the browser. Is it
> possible?
> I just want to make sure these data is correctly indexed with correct value
> or for debugging purpose.
>
> --
> Chamnap
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: Xml representation of indexed document

2012-03-10 Thread Paul Libbrecht

I made my own indexed doc representation using JDOM then represented that 
web-based.

paul

Le 10 mars 2012 à 12:08, Chamnap Chhorn a écrit :

> Thanks Anupam and Paul.
> 
> Yes, it can't display unstored fields. I can't find the way to extract
> unstored fields in Luke. Any idea?
> In your project, which indexer do you use? Previously, I wrote a ruby
> script to index, but it took a lot of time. That's why I changed to DIH.
> 
> 
> Chamnap
> 
> 
> On Sat, Mar 10, 2012 at 4:41 PM, Paul Libbrecht  wrote:
> 
>> Chamnap,
>> 
>> that'd be a view of the stored fields only (although Luke has some more to
>> extract unstored fields).
>> In my search projects I have an indexer and that component (not DIH) can
>> display an "indexed view" of a document.
>> 
>> maybe it helps.
>> 
>> paul
>> 
>> 
>> Le 10 mars 2012 à 08:57, Anupam Bhattacharya a écrit :
>> 
>>> You can use Luke to view Lucene Indexes.
>>> 
>>> Anupam
>>> 
>>> On Sat, Mar 10, 2012 at 12:27 PM, Chamnap Chhorn <
>> chamnapchh...@gmail.com>wrote:
>>> 
 Hi all,
 
 I'm doing data import using DIH in solr 3.5. I'm curious to know
>> whether it
 is see the xml representation of indexed data from the browser. Is it
 possible?
 I just want to make sure these data is correctly indexed with correct
>> value
 or for debugging purpose.
 
 --
 Chamnap
 
>>> 
>>> 
>>> 
>>> --
>>> Thanks & Regards
>>> Anupam Bhattacharya
>> 
>> 
> 
> 
> -- 
> Chhorn Chamnap
> http://chamnapchhorn.blogspot.com/



Faster Solr Indexing

2012-03-10 Thread Peyman Faratin
Hi

I am trying to index 12MM docs faster than is currently happening in Solr 
(using solrj). We have identified solr's add method as the bottleneck (and not 
commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm 
ram). 

Adding 1000 docs is taking approximately 25 seconds. We are making sure we add 
and commit in batches. And we've tried both CommonsHttpSolrServer and 
EmbeddedSolrServer (assuming removing http overhead would speed things up with 
embedding) but the differences is marginal. 

The docs being indexed are on average 20 fields long, mostly indexed but none 
stored. The major size contributors are two fields:

- content, and
- shingledContent (populated using copyField of content).

The length of the content field is (likely) gaussian distributed (few large 
docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to 
support phrase queries and content for unigram queries (following the advice of 
Solr Enterprise search server advice - p. 305, section "The Solution: 
Shingling"). 

Clearly the size of the docs is a contributor to the slow adds (confirmed by 
removing these 2 fields resulting in halving the indexing time). We've tried 
compressed=true also but that is not working. 

Any guidance on how to support our application logic (without having to change 
the schema too much) and speed the indexing speed (from current 212 days for 
12MM docs) would be much appreciated. 

thank you

Peyman 



Re: Accessing other entities from DIH

2012-03-10 Thread Chamnap Chhorn
Thanks Mikhail.

Yeah, in this case CopyField is better. I can combine multiple fields into
a new field, right? Something like this:




Anyway, I might need to access the child entity and parent entity. Can you
provide me some examples on how to use context? I'm not a java developer,
it's a little abstract to me in the solr wiki.
Or, could you give some links that explain this into more details?

Chamnap

On Sat, Mar 10, 2012 at 7:11 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> First of all you can have an access to the context, where the parent entity
> fields can be obtained from (following your link):
>
> The semantics of execution is same as that of a java transformer. The
> method can have two arguments as in 'transformRow(Map ,
> Context context) in the abstract class 'Transformer' . As it is javascript
> the second argument may be omittted and it still works.
>
> then,
>
> generally it sounds like a copyfield
> http://wiki.apache.org/solr/SchemaXml#Copy_Fields have you considered it?
>
> On Sat, Mar 10, 2012 at 3:42 PM, Chamnap Chhorn  >wrote:
>
> > Hi all,
> >
> > I'm using DIH solr 3.5 to import data from mysql. In my document, I have
> > some fields: name, category, text_spell, ...
> > text_spell is a multi-valued field which combines from name and category
> > (category is a multi-value field as well).
> >
> >  >query="SELECT uuid, name from listings" pk="uuid">
> >>  query="SELECT `categories`.`name` FROM categories INNER JOIN
> > `listing_categories` ON
> > `categories`.`uuid`=`listing_categories`.`category_uuid`) WHERE
> > `listing_categories`.`listing_uuid`='${listing.uuid}'">
> >
> >   
> > 
> >
> > In this case, I would use ScriptTransformer to produce a new array of
> > [name, category], but the from the example in solr
> > wiki,
> > it seems it could only access the current row in the current entity.
> > Is it possible to access other entities?
> >
> > If not possible, how could i solve this problem. I know I could use UNION
> > statement, but it duplicates the query and it would degrade the
> performance
> > as well. Any idea?
> >
> > --
> > Chamnap
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> 
>  
>


Re: Accessing other entities from DIH

2012-03-10 Thread Mikhail Khludnev
Chamnap,

Context's way is kind of experimental as-is approach, and the only way to
explore it is use debugger or be ready to debug JavaScript manually. It is
not documented well.
Common approach is copyfield.

With Best Wishes.

On Sat, Mar 10, 2012 at 8:24 PM, Chamnap Chhorn wrote:

> Thanks Mikhail.
>
> Yeah, in this case CopyField is better. I can combine multiple fields into
> a new field, right? Something like this:
> 
> 
> 
>
> Anyway, I might need to access the child entity and parent entity. Can you
> provide me some examples on how to use context? I'm not a java developer,
> it's a little abstract to me in the solr wiki.
> Or, could you give some links that explain this into more details?
>
> Chamnap
>
> On Sat, Mar 10, 2012 at 7:11 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Hello,
> >
> > First of all you can have an access to the context, where the parent
> entity
> > fields can be obtained from (following your link):
> >
> > The semantics of execution is same as that of a java transformer. The
> > method can have two arguments as in 'transformRow(Map ,
> > Context context) in the abstract class 'Transformer' . As it is
> javascript
> > the second argument may be omittted and it still works.
> >
> > then,
> >
> > generally it sounds like a copyfield
> > http://wiki.apache.org/solr/SchemaXml#Copy_Fields have you considered
> it?
> >
> > On Sat, Mar 10, 2012 at 3:42 PM, Chamnap Chhorn  > >wrote:
> >
> > > Hi all,
> > >
> > > I'm using DIH solr 3.5 to import data from mysql. In my document, I
> have
> > > some fields: name, category, text_spell, ...
> > > text_spell is a multi-valued field which combines from name and
> category
> > > (category is a multi-value field as well).
> > >
> > >  > >query="SELECT uuid, name from listings" pk="uuid">
> > >> >  query="SELECT `categories`.`name` FROM categories INNER
> JOIN
> > > `listing_categories` ON
> > > `categories`.`uuid`=`listing_categories`.`category_uuid`) WHERE
> > > `listing_categories`.`listing_uuid`='${listing.uuid}'">
> > >
> > >   
> > > 
> > >
> > > In this case, I would use ScriptTransformer to produce a new array of
> > > [name, category], but the from the example in solr
> > > wiki,
> > > it seems it could only access the current row in the current entity.
> > > Is it possible to access other entities?
> > >
> > > If not possible, how could i solve this problem. I know I could use
> UNION
> > > statement, but it duplicates the query and it would degrade the
> > performance
> > > as well. Any idea?
> > >
> > > --
> > > Chamnap
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Lucid Certified
> > Apache Lucene/Solr Developer
> > Grid Dynamics
> >
> > 
> >  
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Vector based queries

2012-03-10 Thread Pat Ferrel
I have a case where I'd like to get documents which most closely match a 
particular vector. The RowSimilarityJob of Mahout is ideal for 
precalculating similarity between existing documents but in my case the 
query is constructed at run time. So the UI constructs a vector to be 
used as a query. We have this running in prototype using a run time 
calculation of cosine similarity but the implementation is not scalable 
to large doc stores.


One thought is to calculate fairly small clusters. The UI will know 
which cluster to target for the vector query. So we might be able to 
narrow down the number of docs per query to a reasonable size.


It seems like a place for multiple hash functions maybe? Could we use 
some kind of hack of the boost feature of Solr or some other approach?


Does anyone have a suggestion?


Re: Xml representation of indexed document

2012-03-10 Thread Chamnap Chhorn
Mikhail, DIH interactive ui doesn't look good to me because I can't see the
xml of indexed documents. I need to see to make sure I'm doing right.

How do you make sure you're doing right by using DIH interactive ui?

On Sat, Mar 10, 2012 at 7:14 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> DIH has a cute interactive ui with debug/verbose features. Have you checked
> them?
>
> On Sat, Mar 10, 2012 at 10:57 AM, Chamnap Chhorn  >wrote:
>
> > Hi all,
> >
> > I'm doing data import using DIH in solr 3.5. I'm curious to know whether
> it
> > is see the xml representation of indexed data from the browser. Is it
> > possible?
> > I just want to make sure these data is correctly indexed with correct
> value
> > or for debugging purpose.
> >
> > --
> > Chamnap
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> 
>  
>



-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/


How to index a single big file?

2012-03-10 Thread neosky
Hello, I have a great challenge here. I have a big file(1.2G) with more than
200 million records need to index.  It might more than 9 G file with more
than 1000 million record later.
One record contains 3 fields. I am quite newer for solr and lucene, so I
have some questions:
1. It seems that solr only works with the xml file, so I must transform the
text file into xml?
2. Even I transform the file into the xml format, can the solr deal with
this big file?
So, I have some ideas here.Maybe I should split the big file first. 
1. One option is I split one record into one file, but it seems that it will
produce million files and it still hard to store and index.
2. Another option is that I split the file into some smaller file about 10M.
But it seems that it is also difficult to split based on file size that
doesn't mess up the format.
Do you guys have any experience on indexing this kind of big file?  Any idea
or suggestion are helpful.
Thanks in advance!

attached one record sample
original raw data:
>A0B531 A0B531_METTP^|^^|^Putative uncharacterized
protein^|^^|^^|^Methanosaeta thermophila PT^|^349307^|^Arch/Euryar^|^28890
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK

I plan to format

>A0B531


A0B531_METTP^|^^|^Putative uncharacterized protein^|^^|^^|^Methanosaeta
thermophila PT^|^349307^|^Arch/Euryar^|^28890


MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-a-single-big-file-tp3815540p3815540.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to index a single big file?

2012-03-10 Thread Grant Ingersoll

On Mar 10, 2012, at 1:52 PM, neosky wrote:

> Hello, I have a great challenge here. I have a big file(1.2G) with more than
> 200 million records need to index.  It might more than 9 G file with more
> than 1000 million record later.
> One record contains 3 fields. I am quite newer for solr and lucene, so I
> have some questions:
> 1. It seems that solr only works with the xml file, so I must transform the
> text file into xml?

There are other formats supported, including just using the SolrJ client and 
some of your own code that loops through the files.   I wouldn't bother 
converting to XML, just have your SolrJ program take in a record and convert it 
to a SolrInputDocument and then send in.


> 2. Even I transform the file into the xml format, can the solr deal with
> this big file?
> So, I have some ideas here.Maybe I should split the big file first. 
> 1. One option is I split one record into one file, but it seems that it will
> produce million files and it still hard to store and index.
> 2. Another option is that I split the file into some smaller file about 10M.
> But it seems that it is also difficult to split based on file size that
> doesn't mess up the format.
> Do you guys have any experience on indexing this kind of big file?  Any idea
> or suggestion are helpful.

I would likely split into some subset of smaller files (I would guess in the 
range of 10-30M recs per file) and then process those files in parallel 
(multithreaded) using SolrJ and sending in batches of documents at once or 
using the StreamingUpdateSolrServer. 

There are lots of good tutorials on using SolrJ available.



Grant Ingersoll
http://www.lucidimagination.com





3 Way Solr Join . . ?

2012-03-10 Thread Angelina Bola
Does "Solr" support a 3-way join? i.e.
http://wiki.apache.org/solr/Join (I have the 2-way join working)

For example, I am pulling 3 different tables from a RDBMS into one Solr core:

   Table#1: Customers (parent table)
   Table#2: Addresses  (child table with foreign key to customers)
   Table#3: Phones (child table with foreign key to customers)

with a ONE to MANY relationship between:

Customers and Addresses
Customers and Phones

When I pull them into Solr I cannot denormalize the relationships as a
given customers can have many addresses and many phones.

When they come into the my single core (customerInfo), each document
gets a customerInfo_type and a uid corresponding to that type, for
example:

Customer Document
customerInfo_type='customer'
customer_id

Address Document
customerInfo_type='address' 
fk_address_customer_id

Phone Document
customerInfo_type='phone'   
fk_phone_customer_id

Logically, I need to query in Solr for Customers who:

- Have an address in a given state
- Have a phone in a given area code
- Are a given gender

Syntactically, it would think it would look like:

 - http://localhost:8983/solr/customerInfo/select/?
 q={!join from=fk_address_customer_id to=customer_id}address_State:Maine&
 fq={!join from=customer_id to=fk_phone_customer_id}phone_area_code:212&
 fq=customer_gender:female

But that does not work for me.

Appreciate any thoughts,

Angelyna


Re: 3 Way Solr Join . . ?

2012-03-10 Thread Walter Underwood
Fields can be multi-valued. Put multiple phone numbers in a field and match all 
of them. 

wunder

On Mar 10, 2012, at 4:58 PM, Angelina Bola wrote:

> Does "Solr" support a 3-way join? i.e.
> http://wiki.apache.org/solr/Join (I have the 2-way join working)
> 
> For example, I am pulling 3 different tables from a RDBMS into one Solr core:
> 
>   Table#1: Customers (parent table)
>   Table#2: Addresses  (child table with foreign key to customers)
>   Table#3: Phones (child table with foreign key to customers)
> 
> with a ONE to MANY relationship between:
> 
>   Customers and Addresses
>   Customers and Phones
> 
> When I pull them into Solr I cannot denormalize the relationships as a
> given customers can have many addresses and many phones.
> 
> When they come into the my single core (customerInfo), each document
> gets a customerInfo_type and a uid corresponding to that type, for
> example:
> 
>   Customer Document
>   customerInfo_type='customer'
>   customer_id
> 
>   Address Document
>   customerInfo_type='address' 
>   fk_address_customer_id
> 
>   Phone Document
>   customerInfo_type='phone'   
>   fk_phone_customer_id
> 
> Logically, I need to query in Solr for Customers who:
> 
>   - Have an address in a given state
>   - Have a phone in a given area code
>   - Are a given gender
> 
> Syntactically, it would think it would look like:
> 
> - http://localhost:8983/solr/customerInfo/select/?
> q={!join from=fk_address_customer_id to=customer_id}address_State:Maine&
> fq={!join from=customer_id to=fk_phone_customer_id}phone_area_code:212&
> fq=customer_gender:female
> 
> But that does not work for me.
> 
> Appreciate any thoughts,
> 
> Angelyna

--
Walter Underwood
wun...@wunderwood.org





Re: Stemmer Question

2012-03-10 Thread Jamie Johnson
Barring the horrible name I am wondering if folks would be interested
in having something like this as an alternative to the standard
kstemmer.  This is largely based on the SynonymFilter except it builds
tokens using the kstemmer and the original input.  I've created a JIRA
for this to start discussion.  I'd be really interested in
comments/thoughts on this.

https://issues.apache.org/jira/browse/SOLR-3231


On Fri, Mar 9, 2012 at 4:04 PM, Jamie Johnson  wrote:
> So I've thrown something together fairly quickly which is based on
> what Ahmet had sent that I believe will preserve the original token as
> well as the stemmed version.  I didn't go as far as weighting them
> differently using the payloads however.  I am not sure how to use the
> preserveOriginal attribute from WordDelimeterFilterFactory, can anyone
> provide guidance on that?
>
> On Fri, Mar 9, 2012 at 2:53 PM, Jamie Johnson  wrote:
>> Further digging leads me to believe this is not the case.  The Synonym
>> Filter supports this, but the Stemming Filter does not.
>>
>> Ahmet,
>>
>> Would you be willing to provide your filter as well?  I wonder if we
>> can make it aware of the preserveOriginal attribute on
>> WordDelimterFilterFactory?
>>
>>
>> On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson  wrote:
>>> Ok, so I'm digging through the code and I noticed in
>>> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of
>>> a keepOrig attribute.  Doing some googling led me to
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which
>>> speaks of an attribute preserveOriginal="1" on
>>> solr.WordDelimiterFilterFactory.  So it seems like I can get the
>>> functionality I am looking for by setting preserveOriginal, is that
>>> correct?
>>>
>>>
>>> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan  wrote:
> I'd be very interested to see how you
> did this if it is available. Does
> this seem like something useful to the community at large?

 I PMed it to you. Filter is not a big deal. Just modified from {@link 
 org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can 
 provide it publicly too.


More explanation on row in DIH

2012-03-10 Thread Chamnap Chhorn
Hi all,

Anyone please help explain me about a row in DIH.
Let's say, a listing can have multiple keyphrase_assets. A keyphrase_asset
is a comma-seperated value ("hotel,bank,..."). I need to index and split by
comma into a multi-valued keyphrase field.

function fKeyphrasePosition(row) {
}


Therefore, my fKeyphrasePosition function will executed every row of
keyphrase_asset or just execute only one time?
Because keyphrase is a dynamic field in schema.xml ("keyphrase_0",
"keyphrase_1", ...), I need to access the current index as well. I know I
can write a subquery to return row number of keyphrase_asset. Is that the
only way to go?

It's really difficult for me to debug my javascript code. I try to use
LogTransformer, but couldn't see where it outputs to.

-- 
Chamnap


Re: Vector based queries

2012-03-10 Thread Lance Norskog
Look at the MoreLikeThis feature in Lucene. I believe it does roughly
what you describe.

On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel  wrote:
> I have a case where I'd like to get documents which most closely match a
> particular vector. The RowSimilarityJob of Mahout is ideal for
> precalculating similarity between existing documents but in my case the
> query is constructed at run time. So the UI constructs a vector to be used
> as a query. We have this running in prototype using a run time calculation
> of cosine similarity but the implementation is not scalable to large doc
> stores.
>
> One thought is to calculate fairly small clusters. The UI will know which
> cluster to target for the vector query. So we might be able to narrow down
> the number of docs per query to a reasonable size.
>
> It seems like a place for multiple hash functions maybe? Could we use some
> kind of hack of the boost feature of Solr or some other approach?
>
> Does anyone have a suggestion?



-- 
Lance Norskog
goks...@gmail.com


Re: 3 Way Solr Join . . ?

2012-03-10 Thread William Bell
Yeah I am a bit afraid when people want to use the join() feature. To
get good performance you really need to try to stick to the
recommendation of denormalizing your database into multiValued search
fields.

You can also use external fields, or store formatted info into a
String field in json or xml format.


On Sat, Mar 10, 2012 at 6:22 PM, Walter Underwood  wrote:
> Fields can be multi-valued. Put multiple phone numbers in a field and match 
> all of them.
>
> wunder
>
> On Mar 10, 2012, at 4:58 PM, Angelina Bola wrote:
>
>> Does "Solr" support a 3-way join? i.e.
>> http://wiki.apache.org/solr/Join (I have the 2-way join working)
>>
>> For example, I am pulling 3 different tables from a RDBMS into one Solr core:
>>
>>   Table#1: Customers     (parent table)
>>   Table#2: Addresses  (child table with foreign key to customers)
>>   Table#3: Phones     (child table with foreign key to customers)
>>
>> with a ONE to MANY relationship between:
>>
>>       Customers and Addresses
>>       Customers and Phones
>>
>> When I pull them into Solr I cannot denormalize the relationships as a
>> given customers can have many addresses and many phones.
>>
>> When they come into the my single core (customerInfo), each document
>> gets a customerInfo_type and a uid corresponding to that type, for
>> example:
>>
>>       Customer Document
>>               customerInfo_type='customer'
>>               customer_id
>>
>>       Address Document
>>               customerInfo_type='address'
>>               fk_address_customer_id
>>
>>       Phone Document
>>               customerInfo_type='phone'
>>               fk_phone_customer_id
>>
>> Logically, I need to query in Solr for Customers who:
>>
>>       - Have an address in a given state
>>       - Have a phone in a given area code
>>       - Are a given gender
>>
>> Syntactically, it would think it would look like:
>>
>> - http://localhost:8983/solr/customerInfo/select/?
>>     q={!join from=fk_address_customer_id to=customer_id}address_State:Maine&
>>     fq={!join from=customer_id to=fk_phone_customer_id}phone_area_code:212&
>>     fq=customer_gender:female
>>
>> But that does not work for me.
>>
>> Appreciate any thoughts,
>>
>> Angelyna
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: does solr have a mechanism for intercepting requests - before they are handed off to a request handler

2012-03-10 Thread William Bell
Why not wrap the call into a service and then call the right handler?

On Fri, Mar 9, 2012 at 10:11 AM, geeky2  wrote:
> hello all,
>
> does solr have a mechanism that could intercept a request (before it is
> handed off to a request handler).
>
> the intent (from the business) is to send in a generic request - then
> pre-parse the url and send it off to a specific request handler.
>
> thank you,
> mark
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/does-solr-have-a-mechanism-for-intercepting-requests-before-they-are-handed-off-to-a-request-handler-tp3813255p3813255.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Knowing which fields matched a search

2012-03-10 Thread William Bell
debugQuery tells you.

On Fri, Mar 9, 2012 at 1:05 PM, Russell Black  wrote:
> When searching across multiple fields, is there a way to identify which 
> field(s) resulted in a match without using highlighting or stored fields?



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Lucene vs Solr design decision

2012-03-10 Thread William Bell
Great answer Robert.

On Fri, Mar 9, 2012 at 12:06 PM, Robert Stewart  wrote:
> Split up index into say 100 cores, and then route each search to a specific 
> core by some mod operator on the user id:
>
> core_number = userid % num_cores
>
> core_name = "core"+core_number
>
> That way each index core is relatively small (maybe 100 million docs or less).
>
>
> On Mar 9, 2012, at 2:02 PM, Glen Newton wrote:
>
>> millions of cores will not work...
>> ...yet.
>>
>> -glen
>>
>> On Fri, Mar 9, 2012 at 1:46 PM, Lan  wrote:
>>> Solr has no limitation on the number of cores. It's limited by your 
>>> hardware,
>>> inodes and how many files you could keep open.
>>>
>>> I think even if you went the Lucene route you would run into same hardware
>>> limits.
>>>
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/Lucene-vs-Solr-design-decision-tp3813457p3813511.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> -
>> http://zzzoot.blogspot.com/
>> -
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076