Re: Which is a good XPath generator?
I am assuming (like Li I think) that you want to induce a structure/schema from a html-example so you can use that schema to extract data from similiar html-structured pages. Another term often used in literature for that is "Wrapper Induction". Beside DOM, using CSS-classes often give good distinction and they are often more stable under small redesigns. Besides Li's suggestions have a look at this thread for an open source python implementation (I hav enever tested it) http://www.holovaty.com/writing/templatemaker/ also make sure to read all the comments for links to other products, etc. HTH, Geert-Jan 2010/7/25 Li Li > it's not a related topic in solr. maybe you should read some papers > about wrapper generation or automatical web data extraction. If you > want to generate xpath, you could possibly read liubing's papers such > as "Structured Data Extraction from the Web based on Partial Tree > Alignment". Besides dom tree, visual clues also may be used. But none > of them will be perfect solution because of the diversity of web > pages. > > 2010/7/25 Savannah Beckett : > > Hi, > > I am looking for a XPath generator that can generate xpath by picking a > > specific tag inside a html. Do you know a good xpath generator? If > possible, > > free xpath generator would be great. > > Thanks. > > > > > > >
Solr 4.0 and lucene-analyzers
Hi, If generate solr maven artifacts from trunk, it will have dependency on lucene-analyzers:4.0-dev, which can't be resolved. Maybe I'm doing something wrong? Thanks. -- Pavel Minchenkov
Re: Novice seeking help to change filters to search without diacritics
use copyfield in your schema file. The copyfield takes its own analyzer, so the original can fold and the copy may not. dismax might help you at query time on this... HTH Erick On Sat, Jul 24, 2010 at 11:40 PM, HSingh wrote: > > > : Usually people set up two fields, one with diacritics and one without. > : Then searches are against both fields. If you think a match against the > field > : with diacritics is more valuable, you can give that field a boost. > > Hi Steve, where can one setup these two fields? Thank you for your kind > assistance! > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Novice-seeking-help-to-change-filters-to-search-without-diacritics-tp971263p993150.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: filter query on timestamp slowing query???
britske wrote: > > just wanted to mention a possible other route, which might be entirely > hypothetical :-) > > *If* you could query on internal docid (I'm not sure that it's available > out-of-the-box, or if you can at all) > your original problem, quoted below, could imo be simplified to asking for > the last docid inserted (that match the other criteria from your use-case) > and in the next call filter from that docid forward. > that sounds great, is there really a way to do that? britske wrote: > >>Every 30 minutes, i ask the index what are the documents that were added to >>it, since the last time i queried it, that match a certain criteria. >>From time to time, once a week or so, i ask the index for ALL the documents >>that match that criteria. (i also do this for not only one query, but >>several) >>This is why i need the timestamp filter. > > Again, I'm not entirely sure that quering / filtering on internal docid's > is > possible (perhaps someone can comment) but if it is, it would perhaps be > more performant. > Big IF, I know. > > Geert-Jan > > 2010/7/23 Chris Hostetter > >> : On top of using trie dates, you might consider separating the timestamp >> : portion and the type portion of the fq into seperate fq parameters -- >> : that will allow them to to be stored in the filter cache seperately. So >> : for instance, if you include "type:x OR type:y" in queries a lot, but >> : with different date ranges, then when you make a new query, the set for >> : "type:x OR type:y" can be pulled from the filter cache and intersected >> >> definitely ... that's the one big thing that jumped out at me once you >> showed us *how* you were constructing these queries. >> >> >> >> -Hoss >> >> > > that's also something that i'll integrate into my testing environment, thanks -- View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p994679.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: filter query on timestamp slowing query???
britske wrote: >> *If* you could query on internal docid (I'm not sure that it's available >> out-of-the-box, or if you can at all) >> your original problem, quoted below, could imo be simplified to asking for >> the last docid inserted (that match the other criteria from your use-case) >> and in the next call filter from that docid forward. > >that sounds great, is there really a way to do that? I don't know about internal docids, but no reason you can't use that same technique with timestamps, if you want to do the two-query-remember-30-minutes-agos-last-doc approach. Query for latest timestamp by sorting by timestamp descending, set rows=1, the row you get back has the greatest timestamp. 30 minutes later, query with fq=timestamp>that_one_we_remembered. Would this be any slower with timestamps than with docids? I don't think so, but one way to find out. Also, with any sorting, you probably might want to include a warming query that sorts by the column you are going to be sorted on. I haven't figured out yet if a warming query that sorts on a field will help speed up later range-queries (rather than just later sorts) on that field too, but I'm thinking it might. Jonathan
how to Protect data
Hi, I was being ask about protecting data, means that the search index data is stored in the some indexed files and when you open those indexed files, I can clearly see the data, means some texts, e.g. name, address, postal code etc. is there anyway I can hide the data? means some kind of data encoding to not even see any text raw data. -Girish
Re: a bug of solr distributed search
where is the link of this patch? 2010/7/24 Yonik Seeley : > On Fri, Jul 23, 2010 at 2:23 PM, MitchK wrote: >> why do we do not send the output of TermsComponent of every node in the >> cluster to a Hadoop instance? >> Since TermsComponent does the map-part of the map-reduce concept, Hadoop >> only needs to reduce the stuff. Maybe we even do not need Hadoop for this. >> After reducing, every node in the cluster gets the current values to compute >> the idf. >> We can store this information in a HashMap-based SolrCache (or something >> like that) to provide constant-time access. To keep the values up to date, >> we can repeat that after every x minutes. > > There's already a patch in JIRA that does distributed IDF. > Hadoop wouldn't be the right tool for that anyway... it's for batch > oriented systems, not low-latency queries. > >> If we got that, it does not care whereas we use doc_X from shard_A or >> shard_B, since they will all have got the same scores. > > That only works if the docs are exactly the same - they may not be. > > -Yonik > http://www.lucidimagination.com >
Re: a bug of solr distributed search
the solr version I used is 1.4 2010/7/26 Li Li : > where is the link of this patch? > > 2010/7/24 Yonik Seeley : >> On Fri, Jul 23, 2010 at 2:23 PM, MitchK wrote: >>> why do we do not send the output of TermsComponent of every node in the >>> cluster to a Hadoop instance? >>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop >>> only needs to reduce the stuff. Maybe we even do not need Hadoop for this. >>> After reducing, every node in the cluster gets the current values to compute >>> the idf. >>> We can store this information in a HashMap-based SolrCache (or something >>> like that) to provide constant-time access. To keep the values up to date, >>> we can repeat that after every x minutes. >> >> There's already a patch in JIRA that does distributed IDF. >> Hadoop wouldn't be the right tool for that anyway... it's for batch >> oriented systems, not low-latency queries. >> >>> If we got that, it does not care whereas we use doc_X from shard_A or >>> shard_B, since they will all have got the same scores. >> >> That only works if the docs are exactly the same - they may not be. >> >> -Yonik >> http://www.lucidimagination.com >> >
"SELECT" on a Rich Document to download/display content
Hi, I indexed a word document, when I do select, it shows the file name. How can I display content? also if I add "hl=true", is this going to show me the line with the highlight from the word document? I am using below URL to do select: http://localhost:8983/solr/select/?q=Management it shows Response like below: 0name="QTime">1name="q">ManagementnumFound="1" start="0">Mgmt.doc Indexing was done with below Java code: public void SolrCellRequestDemo() throws IOException, SolrServerException { SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr";); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(new File("/Users/Girish/Development/Web Server/apache-solr-1.4.1/example/exampledocs/Mgmt.doc")); req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); req.setParam("literal.id", "Mgmt.doc"); NamedList result = server.request(req); System.out.println("Result: " + result); }
Re: a bug of solr distributed search
Good morning, https://issues.apache.org/jira/browse/SOLR-1632 - Mitch Li Li wrote: > > where is the link of this patch? > > 2010/7/24 Yonik Seeley : >> On Fri, Jul 23, 2010 at 2:23 PM, MitchK wrote: >>> why do we do not send the output of TermsComponent of every node in the >>> cluster to a Hadoop instance? >>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop >>> only needs to reduce the stuff. Maybe we even do not need Hadoop for >>> this. >>> After reducing, every node in the cluster gets the current values to >>> compute >>> the idf. >>> We can store this information in a HashMap-based SolrCache (or something >>> like that) to provide constant-time access. To keep the values up to >>> date, >>> we can repeat that after every x minutes. >> >> There's already a patch in JIRA that does distributed IDF. >> Hadoop wouldn't be the right tool for that anyway... it's for batch >> oriented systems, not low-latency queries. >> >>> If we got that, it does not care whereas we use doc_X from shard_A or >>> shard_B, since they will all have got the same scores. >> >> That only works if the docs are exactly the same - they may not be. >> >> -Yonik >> http://www.lucidimagination.com >> > > -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html Sent from the Solr - User mailing list archive at Nabble.com.
question about relevance
Hello All, I have a index which store multiple objects belonging to a user for e.g. -> Identifies user object type e.g. userBasic or userAdv > MAPS to userBasicInfoObject -> MAPS to userAdvInfoObject Now when I am doing some query I get multiple records mapping to java objects (identified by objType) that belong to the same user. Now I want to show the relevant users at the top of the list. I am thinking of adding the Lucene scores of different result documents to get the best scores. Is this correct approach to get the relevance of the user? Thanks Bharat Jain