from:"Andrew Clegg"

Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg

Hi, I'm after a bit of clarification about the 'limitations' section of the distributed search page on the wiki. The first two limitations say: * Documents must have a unique key and the unique key must be stored (stored="true" in schema.xml) * When duplicate doc IDs are received, Solr chooses

Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg

Mark Miller-3 wrote: > > The 'doc ID' in the second point refers to the unique key in the first > point. > I thought so but thanks for clarifying. Maybe a wording change on the wiki would be good? Cheers, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Duplicat

Using symlinks to alias cores

2010-07-04 Thread Andrew Clegg

Another question... I have a series of cores representing historical data, only the most recent of which gets indexed to. I'd like to alias the most recent one to 'current' so that when they roll over I can just change the alias, and the cron jobs etc. which manage indexing don't have to change.

Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg

Mark Miller-3 wrote: > > On 7/4/10 12:49 PM, Andrew Clegg wrote: >> I thought so but thanks for clarifying. Maybe a wording change on the >> wiki > > Sounds like a good idea - go ahead and make the change if you'd like. > That page seems to be marked immut

Re: Using symlinks to alias cores

2010-07-10 Thread Andrew Clegg

Chris Hostetter-3 wrote: > > a cleaner way to deal with this would be do use something like > RewriteRule -- either in your appserver (if it supports a feature like > that) or in a proxy sitting in front of Solr. > I think we'll go with this -- seems like the most bulletproof way. Cheers,

SolrCloud in production?

2010-07-24 Thread Andrew Clegg

Is anyone using ZooKeeper-based Solr Cloud in production yet? Any war stories? Any problematic missing features? Thanks, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-in-production-tp991995p991995.html Sent from the Solr - User mailing list archive at N

maxMergeDocs and performance tuning

2010-08-15 Thread Andrew Clegg

Hi, I'm a little confused about how the tuning params in solrconfig.xml actually work. My index currently has mergeFactor=25 and maxMergeDocs=2147483647. So this means that up to 25 segments can be created before a merge happens, and each segment can have up to 2bn docs in, right? But this pag

Re: maxMergeDocs and performance tuning

2010-08-17 Thread Andrew Clegg

Okay, thanks Marc. I don't really have any complaints about performance (yet!) but I'm still wondering how the mechanics work, e.g. when you have a number of segments equal to mergeFactor, and each contains maxMergeDocs documents. The docs are a bit fuzzy on this... -- View this message in conte

Duplicate docs when mergin

2010-08-21 Thread Andrew Clegg

-- View this message in context: http://lucene.472066.n3.nabble.com/Duplicate-docs-when-mergin-tp1261979p1261979.html Sent from the Solr - User mailing list archive at Nabble.com.

Duplicate docs when merging indices?

2010-08-21 Thread Andrew Clegg

Hi, First off, sorry about previous accidental post, had a sausage-fingered moment. Anyway... If I merge two indices with CoreAdmin, as detailed here... http://wiki.apache.org/solr/MergingSolrIndexes What happens to duplicate documents between the two? i.e. those that have the same unique key

Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg

(Many apologies if this appears twice, I tried to send it via Nabble first but it seems to have got stuck, and is fairly urgent/serious.) Hi, I'm trying to use the replication handler to take snapshots, then archive them and ship them off-site. Just now I got a message from tar that worried me:

Re: Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg

January 2011 12:30, Andrew Clegg wrote: > (Many apologies if this appears twice, I tried to send it via Nabble > first but it seems to have got stuck, and is fairly urgent/serious.) > > Hi, > > I'm trying to use the replication handler to take snapshots, then > archive th

Re: Replication snapshot, tar says "file changed as we read it"

2011-03-23 Thread Andrew Clegg

ripts.conf,solrconfig_slave.xml:solrconfig.xml,stopwords.txt,synonyms.txt 00:00:10 Thanks, Andrew. On 16 January 2011 12:55, Andrew Clegg wrote: > PS one other point I didn't mention is that this server has a very > fast autocommit limit (2 seconds max time). &g

NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg

First of all, apologies if you get this twice. I posted it by email an hour ago but it hasn't appeared in any of the archives, so I'm worried it's got junked somewhere. I'm trying to use a DataImportHandler to merge some data from a database with some other fields from a collection of XML files,

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg

Chantal Ackermann wrote: > > Hi Andrew, > > your inner entity uses an XML type datasource. The default entity > processor is the SQL one, however. > > For your inner entity, you have to specify the correct entity processor > explicitly. You do that by adding the attribute "processor", and th

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg

Erik Hatcher wrote: > > > On Jul 30, 2009, at 11:54 AM, Andrew Clegg wrote: >>> url="${domain.pdb_code}-noatom.xml" processor="XPathEntityProcessor" >> forEach="/"> >>> xpath="//*[local-name(

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg

Chantal Ackermann wrote: > > > my experience with XPathEntityProcessor is non-existent. ;-) > > Don't worry -- your hints put me on the right track :-) I got it working with: Now, to get it to ignore missing files without an error... Hmm... Che

Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg

A couple of questions about the DIH XPath syntax... The docs say it supports: xpath="/a/b/subje...@qualifier='fullTitle']" xpath="/a/b/subject/@qualifier" xpath="/a/b/c" Does the second one mean "select the value of the attribute called qualifier in the /a/b/subject element"? e.g. For

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg

Andrew Clegg wrote: > > > Sorry, Nabble swallowed my XML example. That was supposed to be [a] [b] [subject qualifier="some text" /] [/b] [/a] ... but in XML. Andrew. -- View this message in context: http://www.nabble.com/Questions-about-XPath-in-

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg

Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: > > On Thu, Aug 13, 2009 at 6:35 PM, Andrew Clegg > wrote: > >> Does the second one mean "select the value of the attribute called >> qualifier >> in the /a/b/subject element"? > > yes you are right. Isn

Re: Questions about XPath in data import handler

2009-08-14 Thread Andrew Clegg

Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: > > yes. look at the 'flatten' attribute in the field. It should give you > all the text (not attributes) under a given node. > > I missed that one -- many thanks. Andrew. -- View this message in context: http://www.nabble.com/Questions-about-XPath-in-d

'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg

Hi folks, I'm trying to use the Debug Now button in the development console to test the effects of some changes in my data import config (see attached). However, each time I click it, the right-hand frame fails to load -- it just gets replaced with the standard 'connection reset' message from Fi

Re: 'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg

Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: > > apparently I do not see any command full-import, delta-import being > fired. Is that true? > It seems that way -- they're not appearing in the logs. I've tried Debug Now with both full and delta selected from the dropdown, no difference either way. If

Re: Adding a prefix to fields

2009-08-20 Thread Andrew Clegg

ahammad wrote: > > Is it possible to add a prefix to the data in a Solr field? For example, > right now, I have a field called "id" that gets data from a DB through the > DataImportHandler. The DB returns a 4-character string like "ag5f". Would > it be possible to add a prefix to the data that

Re: Solr Range Query Anomalities

2009-08-20 Thread Andrew Clegg

Try a sdouble or sfloat field type? Andrew. johan.sjoberg wrote: > > Hi, > > we're performing range queries of a field which is of type double. Some > queries which should generate results does not, and I think it's best > explained by the following examples; it's also expected to exist data

Re: Wildcard seaches?

2009-08-20 Thread Andrew Clegg

Paul Tomblin wrote: > > Is there such a thing as a wildcard search? If I have a simple > solr.StrField with no analyzer defined, can I query for "foo*" or > "foo.*" and get everything that starts with "foo" such as 'foobar" and > "foobaz"? > Yes. foo* is fine even on a simple string field.

Re: can solr accept other tag other than field?

2009-08-20 Thread Andrew Clegg

You can use the Data Import Handler to pull data out of any XML or SQL data source: http://wiki.apache.org/solr/DataImportHandler Andrew. Elaine Li wrote: > > Hi, > > I am new solr user. I want to use solr search to run query against > many xml files I have. > I have set up the solr server

Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg

Hi all, I'm having problems getting Solr to start on Tomcat 6. Tomcat is installed in /opt/apache-tomcat , solr is in /opt/apache-tomcat/webapps/solr , and my Solr home directory is /opt/solr . My config file is in /opt/solr/conf/solrconfig.xml . I have a Solr-specific context file in /opt/apach

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg

Constantijn Visinescu wrote: > > This might be a bit of a hack but i got this in the web.xml of my > applicatin > and it works great. > > > >solr/home >/Solr/WebRoot/WEB-INF/solr >java.lang.String > > > That worked, thanks. You're right though, it is a

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-30 Thread Andrew Clegg

hossman wrote: > > > : Hi all, I'm having problems getting Solr to start on Tomcat 6. > > which version of Solr? > > Sorry -- a nightly build from about a month ago. Re. your other message, I was sure the two machines had the same version on, but maybe not -- when I'm back in the office tom

Re: Problem getting Solr home from JNDI in Tomcat

2009-10-01 Thread Andrew Clegg

Andrew Clegg wrote: > > > hossman wrote: >> >> >> This is why the examples of using context files on the wiki talk about >> keeping the war *outside* of the webapps directory, and using docBase in >> your Context declaration... >

Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg

Hi folks, I'm using the 2009-09-30 build, and any single or double quotes in the query string cause an NPE. Is this normal behaviour? I never tried it with my previous installation. Example: http://myserver:8080/solr/select/?title:%22Creatine+kinase%22 (I've also tried without the URL encoding

Re: Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg

r-4 wrote: > > don't forget q=... :) > > Erik > > On Oct 1, 2009, at 9:49 AM, Andrew Clegg wrote: > >> >> Hi folks, >> >> I'm using the 2009-09-30 build, and any single or double quotes in >> the query >> string cause an NPE

Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg

Hi, I have a field in my index called related_ids, indexed and stored, with the following field type: Several records in my index contain the token 1cuk in the related_ids field, but only *some* of them are r

Re: Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg

ncethat you're hitting that > limit? That 1cuk is past the 10,000th term > in record 2.40? > > For this to be possible, I have to assume that the FieldAnalysis > tool ignores this limit > > FWIW > Erick > > On Fri, Oct 23, 2009 at 12:01 PM, Andrew Clegg > wrote:

Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

Morning, Last week I was having a problem with terms visible in my search results in large documents not causing query hits: http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-td26029040.html#a26029351 Erick suggested it might be related to maxFieldLength,

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

I can reproduce a problem with maxFieldLength being > ignored. > > -Yonik > http://www.lucidimagination.com > > > > On Mon, Oct 26, 2009 at 7:11 AM, Andrew Clegg > wrote: >> >> Morning, >> >> Last week I was having a problem with terms visible

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

Yonik Seeley-2 wrote: > > Sorry Andrew, this is something that's bitten people before. > search for maxFieldLength and you will see *2* of them in your config > - one for indexDefaults and one for mainIndex. > The one in mainIndex is set at 1 and hence overrides the one in > indexDefaults.

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

Yonik Seeley-2 wrote: > > If you could, it would be great if you could test commenting out the > one in mainIndex and see if it inherits correctly from > indexDefaults... if so, I can comment it out in the example and remove > one other little thing that people could get wrong. > Yep, it seems

Greater-than and less-than in data import SQL queries

2009-10-27 Thread Andrew Clegg

Hi, If I have a DataImportHandler query with a greater-than sign in, like this: Everything's fine. However, if it contains a less-than sign: I get this exception: INFO: Processing configuration from solrconfig.xml: {config=dataconfig.xml} [Fatal Error] :240:129: The value o

Re: Greater-than and less-than in data import SQL queries

2009-10-27 Thread Andrew Clegg

e, so it has to obey XML encoding rules which > make it ugly but whatcha gonna do? > > Erik > > On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote: > >> >> Hi, >> >> If I have a DataImportHandler query with a greater-than sign in, >> like th

Faceting within one document

2009-10-28 Thread Andrew Clegg

Hi, If I give a query that matches a single document, and facet on a particular field, I get a list of all the terms in that field which appear in that document. (I also get some with a count of zero, I don't really understand where they come from... ?) Is it possible with faceting, or a simila

Re: Faceting within one document

2009-10-28 Thread Andrew Clegg

> For terms - http://wiki.apache.org/solr/TermsComponent > > Helps? > > Cheers > Avlesh > > On Wed, Oct 28, 2009 at 11:32 PM, Andrew Clegg > wrote: > >> >> Hi, >> >> If I give a query that matches a single document, and facet on a >> partic

dismax and query analysis

2009-10-29 Thread Andrew Clegg

Morning, Can someone clarify how dismax queries work under the hood? I couldn't work this particular point out from the documentation... I get that they pretty much issue the user's query against all of the fields in the schema -- or rather, all of the fields you've specified in the qf parameter

Re: dismax and query analysis

2009-10-29 Thread Andrew Clegg

set of analyzers > assigned to that particular field for queries (as opposed to indexing). > For > example, if "test" is matched against a "string" vs "text" field, > different > analyzers may be applied to "string" or "text" > &g

Re: Faceting within one document

2009-10-29 Thread Andrew Clegg

optimize, there should be no 0-value facets. > > On Wed, Oct 28, 2009 at 11:36 AM, Andrew Clegg > wrote: >> >> >> Isn't the TermVectorComponent more for one document at a time, and the >> TermsComponent for the whole index? >> >> Actually -- having

Re: Faceting within one document

2009-10-29 Thread Andrew Clegg

Actually Avlesh pointed me at that, earlier in the thread. But thanks :-) Yonik Seeley-2 wrote: > > On Wed, Oct 28, 2009 at 2:02 PM, Andrew Clegg > wrote: >> If I give a query that matches a single document, and facet on a >> particular >> field, I get a list of

NullPointerException with TermVectorComponent

2009-11-02 Thread Andrew Clegg

Hi, I've recently added the TermVectorComponent as a separate handler, following the example in the supplied config file, i.e.: true tvComponent It works, but with one quirk. When you use tf.all=true, you

Highlighting is very slow

2009-11-03 Thread Andrew Clegg

Hi everyone, I'm experimenting with highlighting for the first time, and it seems shockingly slow for some queries. For example, this query: http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on takes 313ms. But when I add highlighting: http://server:80

Highlighting is very slow

2009-11-03 Thread Andrew Clegg

Hi everyone, I'm experimenting with highlighting for the first time, and it seems shockingly slow for some queries. For example, this query: http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on takes 313ms. But when I add highlighting: http://server:80

Re: Highlighting is very slow

2009-11-04 Thread Andrew Clegg

although not with those really long response > times). Fixed by moving to JRE 1.6 and tuning garbage collection. > > Bye, > > Jaco. > > 2009/11/3 Andrew Clegg > >> >> Hi everyone, >> >> I'm experimenting with highlighting for the first tim

Re: Highlighting is very slow

2009-11-09 Thread Andrew Clegg

Nicolas Dessaigne wrote: > > Alternatively, you could use a copyfield with a maxChars limit as your > highlighting field. Works well in my case. > Thanks for the tip. We did think about doing something similar (only enabling highlighting for certain shorter fields) but we decided that perhaps

Selection of terms for MoreLikeThis

2009-11-10 Thread Andrew Clegg

Hi, If I run a MoreLikeThis query like the following: http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1 one of the hits in the results is "and" (I don't do any stopword removal on this field). Howev

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg

Lukáš Vlček wrote: > > I am looking for good arguments to justify implementation a search for > sites > which are available on the public internet. There are many sites in > "powered > by Solr" section which are indexed by Google and other search engines but > still they decided to invest resour

Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg

Morning all, I'm having problems with joining child a child entity from one database to a parent from another... My entity definitions look like this (names changed for brevity): c is getting indexed fine (it's stored, I can see field 'c' in the search results) but child.d isn't. I know

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg

Lukáš Vlček wrote: > > When you need to search for something Lucene or Solr related, which one do > you use: > - generic Google > - go to a particular mail list web site and search from here (if there is > any search form at all) > Both of these (Nabble in the second case) in case any recent p

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg

Any ideas on this? Is it worth sending a bug report? Those links are live, by the way, in case anyone wants to verify that MLT is returning suggestions with very low tf.idf. Cheers, Andrew. Andrew Clegg wrote: > > Hi, > > If I run a MoreLikeThis query like the followin

Re: Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg

Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: > > no obvious issues. > you may post your entire data-config.xml > Here it is, exactly as last attempt but with usernames etc. removed. Ignore the comments and the unused FileDataSource... http://old.nabble.com/file/p26335171/dataimport.temp.xml dataimpo

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg

Chantal Ackermann wrote: > > no idea, I'm afraid - but could you sent the output of > interestingTerms=details? > This at least would show what MoreLikeThis uses, in comparison to the > TermVectorComponent you've already pasted. > I can, but I'm afraid they're not very illuminating! http://

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg

Chantal Ackermann wrote: > > your URL does not include the parameter mlt.boost. Setting that to > "true" made a noticeable difference for my queries. > Hmm, I'm really not sure if this is doing the right thing either. When I add it I get: 1.0 0.60737264 0.27599618 0.2476748 0.24487767

Re: 'Connection reset' in DataImportHandler Development Console

2009-12-14 Thread Andrew Clegg

aerox7 wrote: > > Hi Andrew, > I download the last build of solr (1.4) and i have the same probleme with > DebugNow in Dataimport dev Console. have you found a solution ? > Sorry about slow reply, I've been on holiday. No, I never found a solution, it worked in some nightlies but not in other

Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg

Hi, I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does anyone have an example of using TextProfileSignature that demonstrates the tunable parameters mentioned in th

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg

'm missing something. Thanks again, Andrew. Erik Hatcher-4 wrote: > > > On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote: >> I'm interested in near-dupe removal as mentioned (briefly) here: >> >> http://wiki.apache.org/solr/Deduplication >> >> Howe

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg

Erik Hatcher-4 wrote: > > > On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote: >> Thanks Erik, but I'm still a little confused as to exactly where in >> the Solr >> config I set these parameters. > > You'd configure them within the element, so

Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-02 Thread Andrew Clegg

Hi, Is there a way to get the DataImportHandler to skip already-seen records rather than reindexing them? The UpdateHandler has an capability which (as I understand it) means that a document whose uniqueKey matches one already in the index will be skipped instead of overwritten. Can the DIH be

Re: Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-03 Thread Andrew Clegg

Marc Sturlese wrote: > > You can use deduplication to do that. Create the signature based on the > unique field or any field you want. > Cool, thanks, I hadn't thought of that. -- View this message in context: http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataImportHandler-based-

ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg

Hi, I'm trying to get the Velocity / Solritas feature to work for one core of a two-core Solr instance, but it's not playing nice. I know the right jars are being loaded, because I can see them mentioned in the log, but still I get a class not found exception: 09-May-2010 15:34:02 org.apache.so

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg

Erik Hatcher-4 wrote: > > What version of Solr? Try switching to > class="solr.VelocityResponseWriter", and if that doesn't work use > class="org.apache.solr.request.VelocityResponseWriter". The first > form is the recommended way to do it. The actual package changed in > trunk not t

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg

Sorry -- in the second of those error messages (the NPE) I meant lucene not standard. Andrew Clegg wrote: > > > Erik Hatcher-4 wrote: >> >> What version of Solr? Try switching to >> class="solr.VelocityResponseWriter", an

Fixed: Solritas on multicore Solr, using standard query handler (was Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter)

2010-05-09 Thread Andrew Clegg

in or /solr/itas and insert your core name in the middle. (Does anyone know if there'd be a simple way to make that automatic?) Andrew Clegg wrote: > > > Erik Hatcher-4 wrote: >> >> What version of Solr? Try switching to >> class="solr.Velocit

How bad is stopping Solr with SIGKILL?

2010-05-31 Thread Andrew Clegg

Hi folks, I had a Solr instance (in Jetty on Linux) taken down by a process monitoring tool (God) with a SIGKILL recently. How bad is this? Can it cause index corruption if it's in the middle of indexing something? Or will it just lose uncommitted changes? What if the signal arrives in the middl

Indexing link targets in HTML fragments

2010-06-06 Thread Andrew Clegg

Hi Solr gurus, I'm wondering if there is an easy way to keep the targets of hyperlinks from a field which may contain HTML fragments, while stripping the HTML. e.g. if I had a field that looked like this: "This is the entire content of my field, but http://example.com/ some of the words are a

Re: Indexing link targets in HTML fragments

2010-06-06 Thread Andrew Clegg

Lance Norskog-2 wrote: > > The PatternReplace and HTMPStrip tokenizers might be the right bet. > The easiest way to go about this is to make a bunch of text fields > with different analysis stacks and investigate them in the Scema > Browser. You can paste an HTML document into the text box and s

Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg

findbestopensource wrote: > > Could you tell us your schema used for indexing. In my opinion, using > standardanalyzer / Snowball analyzer will do the best. They will not break > the URLs. Add href, and other related html tags as part of stop words and > it > will removed while indexing. > Thi

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg

Neeb wrote: > > Just wondering if you ever managed to run TextProfileSignature based > deduplication. I would appreciate it if you could send me the code > fragment for it from solrconfig. > Actually the project that was for got postponed and I got distracted by other things, for now at least

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg

Andrew Clegg wrote: > > Re. your config, I don't see a minTokenLength in the wiki page for > deduplication, is this a recent addition that's not documented yet? > Sorry about this -- stupid question -- I should have read back through the thread and refreshed my memory.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg

Markus Jelsma wrote: > > Well, it got me too! KMail didn't properly order this thread. Can't seem > to > find Hatcher's reply anywhere. ??!!? > Whole thread here: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html -- View this message in co

77 matches

Mail list logo