Re: Which is a good XPath generator?

2010-07-25 Thread Geert-Jan Brits
I am assuming (like Li I think)  that you want to induce a structure/schema
from a html-example so you can use that schema to extract data from similiar
html-structured pages.

Another term often used in literature for that is "Wrapper Induction".
Beside DOM, using CSS-classes often give good distinction and they are often
more stable under small redesigns.

Besides Li's suggestions have a look at this thread for an open source
python implementation (I hav enever tested it)
http://www.holovaty.com/writing/templatemaker/
also make sure to read all the comments for links to other products, etc.

HTH,
Geert-Jan



2010/7/25 Li Li 

> it's not a related topic in solr. maybe you should read some papers
> about wrapper generation or automatical web data extraction. If you
> want to generate xpath, you could possibly read liubing's papers such
> as "Structured Data Extraction from the Web based on Partial Tree
> Alignment". Besides dom tree, visual clues also may be used. But none
> of them will be perfect solution because of the diversity of web
> pages.
>
> 2010/7/25 Savannah Beckett :
> > Hi,
> >   I am looking for a XPath generator that can generate xpath by picking a
> > specific tag inside a html.  Do you know a good xpath generator?  If
> possible,
> > free xpath generator would be great.
> > Thanks.
> >
> >
> >
>


Solr 4.0 and lucene-analyzers

2010-07-25 Thread Pavel Minchenkov
Hi,
If generate solr maven artifacts from trunk, it will have dependency on
lucene-analyzers:4.0-dev, which can't be resolved.
Maybe I'm doing something wrong?

Thanks.


-- 
Pavel Minchenkov


Re: Novice seeking help to change filters to search without diacritics

2010-07-25 Thread Erick Erickson
use copyfield in your schema file. The copyfield takes its own
analyzer, so the original can fold and the copy may not.

dismax might help you at query time on this...

HTH
Erick

On Sat, Jul 24, 2010 at 11:40 PM, HSingh  wrote:

>
>
> : Usually people set up two fields, one with diacritics and one without.
> : Then searches are against both fields.  If you think a match against the
> field
> : with diacritics is more valuable, you can give that field a boost.
>
> Hi Steve, where can one setup these two fields?  Thank you for your kind
> assistance!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Novice-seeking-help-to-change-filters-to-search-without-diacritics-tp971263p993150.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: filter query on timestamp slowing query???

2010-07-25 Thread oferiko


britske wrote:
> 
> just wanted to mention a possible other route, which might be entirely
> hypothetical :-)
> 
> *If* you could query on internal docid (I'm not sure that it's available
> out-of-the-box, or if you can at all)
> your original problem, quoted below, could imo be simplified to asking for
> the last docid inserted (that match the other criteria from your use-case)
> and in the next call filter from that docid forward.
> 
that sounds great, is there really a way to do that? 


britske wrote:
> 
>>Every 30 minutes, i ask the index what are the documents that were added
to
>>it, since the last time i queried it, that match a certain criteria.
>>From time to time, once a week or so, i ask the index for ALL the
documents
>>that match that criteria. (i also do this for not only one query, but
>>several)
>>This is why i need the timestamp filter.
> 
> Again, I'm not entirely sure that quering / filtering on internal docid's
> is
> possible (perhaps someone can comment) but if it is, it would perhaps be
> more performant.
> Big IF, I know.
> 
> Geert-Jan
> 
> 2010/7/23 Chris Hostetter 
> 
>> : On top of using trie dates, you might consider separating the timestamp
>> : portion and the type portion of the fq into seperate fq parameters --
>> : that will allow them to to be stored in the filter cache seperately. So
>> : for instance, if you include "type:x OR type:y" in queries a lot, but
>> : with different date ranges, then when you make a new query, the set for
>> : "type:x OR type:y" can be pulled from the filter cache and intersected
>>
>> definitely ... that's the one big thing that jumped out at me once you
>> showed us *how* you were constructing these queries.
>>
>>
>>
>> -Hoss
>>
>>
> 
> 
that's also something that i'll integrate into my testing environment,
thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p994679.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: filter query on timestamp slowing query???

2010-07-25 Thread Jonathan Rochkind
britske wrote:
>> *If* you could query on internal docid (I'm not sure that it's available
>> out-of-the-box, or if you can at all)
>> your original problem, quoted below, could imo be simplified to asking for
>> the last docid inserted (that match the other criteria from your use-case)
>> and in the next call filter from that docid forward.
>
>that sounds great, is there really a way to do that?

I don't know about internal docids, but no reason you can't use that same 
technique with timestamps, if you want to do the 
two-query-remember-30-minutes-agos-last-doc approach. 

Query for latest timestamp by sorting by timestamp descending, set rows=1, the 
row you get back has the greatest timestamp. 

30 minutes later, query with fq=timestamp>that_one_we_remembered. 

Would this be any slower with timestamps than with docids?  I don't think so, 
but one way to find out. 

Also, with any sorting, you probably might want to include a warming query that 
sorts by the column you are going to be sorted on. I haven't figured out yet if 
a warming query that sorts on a field will help speed up later range-queries 
(rather than just later sorts) on that field too, but I'm thinking it might.  

Jonathan

how to Protect data

2010-07-25 Thread Girish Pandit

Hi,

I was being ask about protecting data, means that the search index data 
is stored in the some indexed files and when you open those indexed 
files, I can clearly see the data, means some texts, e.g. name, address, 
postal code etc.


is there anyway I can hide the data? means some kind of data encoding to 
not even see any text raw data.


-Girish



Re: a bug of solr distributed search

2010-07-25 Thread Li Li
where is the link of this patch?

2010/7/24 Yonik Seeley :
> On Fri, Jul 23, 2010 at 2:23 PM, MitchK  wrote:
>> why do we do not send the output of TermsComponent of every node in the
>> cluster to a Hadoop instance?
>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop
>> only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
>> After reducing, every node in the cluster gets the current values to compute
>> the idf.
>> We can store this information in a HashMap-based SolrCache (or something
>> like that) to provide constant-time access. To keep the values up to date,
>> we can repeat that after every x minutes.
>
> There's already a patch in JIRA that does distributed IDF.
> Hadoop wouldn't be the right tool for that anyway... it's for batch
> oriented systems, not low-latency queries.
>
>> If we got that, it does not care whereas we use doc_X from shard_A or
>> shard_B, since they will all have got the same scores.
>
> That only works if the docs are exactly the same - they may not be.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: a bug of solr distributed search

2010-07-25 Thread Li Li
the solr version I used is 1.4

2010/7/26 Li Li :
> where is the link of this patch?
>
> 2010/7/24 Yonik Seeley :
>> On Fri, Jul 23, 2010 at 2:23 PM, MitchK  wrote:
>>> why do we do not send the output of TermsComponent of every node in the
>>> cluster to a Hadoop instance?
>>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop
>>> only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
>>> After reducing, every node in the cluster gets the current values to compute
>>> the idf.
>>> We can store this information in a HashMap-based SolrCache (or something
>>> like that) to provide constant-time access. To keep the values up to date,
>>> we can repeat that after every x minutes.
>>
>> There's already a patch in JIRA that does distributed IDF.
>> Hadoop wouldn't be the right tool for that anyway... it's for batch
>> oriented systems, not low-latency queries.
>>
>>> If we got that, it does not care whereas we use doc_X from shard_A or
>>> shard_B, since they will all have got the same scores.
>>
>> That only works if the docs are exactly the same - they may not be.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>


"SELECT" on a Rich Document to download/display content

2010-07-25 Thread Girish Pandit

Hi,

I indexed a word document, when I do select, it shows the file name. How 
can I display content? also if I add "hl=true", is this going to show me 
the line with the highlight from the word document?


I am using below URL to do select:

http://localhost:8983/solr/select/?q=Management

it shows Response like below:

0name="QTime">1name="q">ManagementnumFound="1" start="0">Mgmt.doc



Indexing was done with below Java code:

   public void SolrCellRequestDemo() throws IOException, 
SolrServerException {
   SolrServer server = new 
CommonsHttpSolrServer("http://localhost:8983/solr";);
   ContentStreamUpdateRequest req = new 
ContentStreamUpdateRequest("/update/extract");
   req.addFile(new File("/Users/Girish/Development/Web 
Server/apache-solr-1.4.1/example/exampledocs/Mgmt.doc"));

   req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
   req.setParam("literal.id", "Mgmt.doc");
   NamedList result = server.request(req);
   System.out.println("Result: " + result);
  
   }




Re: a bug of solr distributed search

2010-07-25 Thread MitchK

Good morning,

https://issues.apache.org/jira/browse/SOLR-1632

- Mitch


Li Li wrote:
> 
> where is the link of this patch?
> 
> 2010/7/24 Yonik Seeley :
>> On Fri, Jul 23, 2010 at 2:23 PM, MitchK  wrote:
>>> why do we do not send the output of TermsComponent of every node in the
>>> cluster to a Hadoop instance?
>>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop
>>> only needs to reduce the stuff. Maybe we even do not need Hadoop for
>>> this.
>>> After reducing, every node in the cluster gets the current values to
>>> compute
>>> the idf.
>>> We can store this information in a HashMap-based SolrCache (or something
>>> like that) to provide constant-time access. To keep the values up to
>>> date,
>>> we can repeat that after every x minutes.
>>
>> There's already a patch in JIRA that does distributed IDF.
>> Hadoop wouldn't be the right tool for that anyway... it's for batch
>> oriented systems, not low-latency queries.
>>
>>> If we got that, it does not care whereas we use doc_X from shard_A or
>>> shard_B, since they will all have got the same scores.
>>
>> That only works if the docs are exactly the same - they may not be.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html
Sent from the Solr - User mailing list archive at Nabble.com.


question about relevance

2010-07-25 Thread Bharat Jain
Hello All,

I have a index which store multiple objects belonging to a user

for e.g.

   -> Identifies user
object type e.g. userBasic or userAdv

  
> MAPS to userBasicInfoObject
  

  
   -> MAPS to userAdvInfoObject
  




Now when I am doing some query I get multiple records mapping to java
objects (identified by objType) that belong to the same user.


Now I want to show the relevant users at the top of the list. I am thinking
of adding the Lucene scores of different result documents to get the best
scores. Is this correct approach to get the relevance of the user?

Thanks
Bharat Jain