indexing Chienese langage

2009-02-16 Thread revathy arun
Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer

DIH transformers

2009-02-16 Thread Fergus McMenemie
Hello. I have been beating my head around the data-config.xml listed at the end of this message. It breaks in a few different ways. 1) I have bodged TemplateTransformer to allow it to return when one of the variables is undefined. This ensures my uniqueKey is always defined. But thin

Re: spellcheck.onlyMorePopular

2009-02-16 Thread Marcus Stratmann
Shalin Shekhar Mangar wrote: The implementation is a bit more complicated. 1. Read all tokens from the specified field in the solr index. 2. Create n-grams of the terms read in #1 and index them into a separate Lucene index (spellcheck index). 3. When asked for suggestions, create n-grams of the

Re: almost realtime updates with replication

2009-02-16 Thread sunnyfr
Hi Hoss, Is it a problem if the snappuller miss one snapshot before the last one ?? Cheer, Have a nice day, hossman wrote: > > : > : There are a couple queries that we would like to run almost realtime so > : I would like to have it so our client sends an update on every new > : document and

snapshot created if there is no documente updated/new?

2009-02-16 Thread sunnyfr
Hi I would like to know if a snapshot is automaticly created even if there is no document update or added ? Thanks a lot, -- View this message in context: http://www.nabble.com/snapshot-created-if-there-is-no-documente-updated-new--tp22034462p22034462.html Sent from the Solr - User mailing l

Re: almost realtime updates with replication

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess , it should not be a problem --Noble On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr wrote: > > Hi Hoss, > > Is it a problem if the snappuller miss one snapshot before the last one ?? > > Cheer, > Have a nice day, > > > hossman wrote: >> >> : >> : There are a couple queries that we would like to

Re: facet count on partial results

2009-02-16 Thread Karl Wettin
15 feb 2009 kl. 20.15 skrev Yonik Seeley: On Sat, Feb 14, 2009 at 6:45 AM, karl wettin wrote: Also, as my threadshold is based on the distance in score between the first result it sounds like using a result start position greater than 0 is something I have to look out for. Or? Hmmm - th

Re: DIH transformers

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie wrote: > Hello. > > I have been beating my head around the data-config.xml listed > at the end of this message. It breaks in a few different ways. > > 1) I have bodged TemplateTransformer to allow it to return > when one of the variables is un

Distributed search

2009-02-16 Thread revathy arun
Hi, Can we use multicore to have several indexes per webapp and use distributed search to merge the indexes? for exampe if we have 3 cores -core0 ,core1 and core2 for 3 different languages and to search across all the 3 indexes use the shard parameter as shard=localhost:8080/solr/core0,localhost:

Re: Release of solr 1.4 & autosuggest

2009-02-16 Thread Grant Ingersoll
On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote: Hi All, I am interested in TermComponent addition in solr 1.4 ( http://wiki.apache.org/solr/TermsComponent). When should we expect solr 1.4 to be available for use? Also, can this Termcomponent be made available as a plugin for solr 1.3? I'

Re: Word Locations & Search Components

2009-02-16 Thread Grant Ingersoll
On Feb 15, 2009, at 10:33 PM, Johnny X wrote: Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' fie

Re: Multilanguage

2009-02-16 Thread Erick Erickson
I recommend that you search both this and the Lucene list. You'll find that this topic has been discussed many times, and several approaches have been outlined. The searchable archives are linked to from here: http://lucene.apache.org/java/docs/mailinglists.html. Best Erick On Mon, Feb 16, 2009

Re: almost realtime updates with replication

2009-02-16 Thread sunnyfr
Hi Noble, So ok I don't mind really if it miss one, if it get the last one it's good. I've was wondering as well if a snapshot is created even if no document has been update? Thanks a lot Noble, Wish you a very nice day, Noble Paul നോബിള്‍ नोब्ळ् wrote: > > I guess , it should not be a probl

snapshot as big as the index folder?

2009-02-16 Thread sunnyfr
Hi, Is it normal or did I miss something ?? 5.8Gbook/data/snapshot.20090216153346 12K book/data/spellchecker2 4.0Kbook/data/index 12K book/data/spellcheckerFile 12K book/data/spellchecker1 5.8Gbook/data/ Last update ? 92562 45492 0 2009-02-16 15:20:01 2009-02-16 15:20:0

Re: snapshot as big as the index folder?

2009-02-16 Thread sunnyfr
It change a lot in few minute ?? is it normal ? thanks 5.8Gbook/data/snapshot.20090216153346 4.0Kbook/data/index 5.8Gbook/data/ r...@search-07:/data/solr# du -h book/data/ 5.8Gbook/data/snapshot.20090216153346 3.7Gbook/data/index 4.0Kbook/data/snapshot.20090216153759 9.4G

Re: Word Locations & Search Components

2009-02-16 Thread Alexander Ramos Jardim
I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc

Re: Word Locations & Search Components

2009-02-16 Thread Johnny X
Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Con

Re: Word Locations & Search Components

2009-02-16 Thread Erick Erickson
I think you essentially have to do much of the same work either way, so take whatever comes easiest. Personally, I think that pre-processing the data (and using two fields) would be easiest, but it's up to you. Using a custom analyzer would involve collecting all the contents, deciding what is "re

Re: facet count on partial results

2009-02-16 Thread Yonik Seeley
>> On Sat, Feb 14, 2009 at 6:45 AM, karl wettin >> wrote: >>> Also, as my threadshold is based on the distance in score between the >>> first result it sounds like using a result start position greater than >>> 0 is something I have to look out for. Or? >> >> Hmmm - this isn't that easy in general

Re: Release of solr 1.4 & autosuggest

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
the logging used is changed j.u.l to slf4j . That is the only problem I can see. If you drop in that jar as well it should just work On Mon, Feb 16, 2009 at 6:49 PM, Grant Ingersoll wrote: > > On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote: > >> Hi All, >> I am interested in TermComponent add

Re: almost realtime updates with replication

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
yes , it does . it just blindly creates hard links irrespective of a document is added or not. but no snappull will happen because there is no new file to be downloaded On Mon, Feb 16, 2009 at 7:40 PM, sunnyfr wrote: > > Hi Noble, > > So ok I don't mind really if it miss one, if it get the last o

Re: delete snapshot??

2009-02-16 Thread sunnyfr
Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: > > The --delete option of the rsync command deletes extraneous files from the > destination directory. It does not delete Solr snapshots. To do that you >

Re: delete snapshot??

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr wrote: > > Hi, > > Ok but can I use it more often then every day like every three hours, > because snapshot are quite big. > > Thanks a lot, > > > Bill Au wrote: >> >> The --delete option of the r

Input XML duplicate fields uniqueness

2009-02-16 Thread Adi_Jinx
Hi, I have an Input XML as now for SOLR indexing converted it to 1 12-Feb-2009 1 NJ safsafsd#sf08 Dev 1 NJ CP 2 KL 080jnkdfhjwf Int 0 080jnkdfhjwf 08dedf I was able to index it. Just put this single x

Re: delete snapshot??

2009-02-16 Thread sunnyfr
Hi Noble, But how come i've space error ?? :( thanks a lot, Feb 16 18:28:34 search-07 jsvc.exec[8872]: ataImporter.java:361) Caused by: java.io.IOException: No space left on device ^Iat java.io.RandomAccessFile.writeBytes(Native Method) ^Iat java.io.RandomAccessFile.write(RandomAccessFile.java:

Re: Input XML duplicate fields uniqueness

2009-02-16 Thread Shalin Shekhar Mangar
On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx wrote: > > I was able to index it. Just put this single xml and searched based on > rec.id and response xml returned however input xml tag order was not > maintained. So I was unable to identify which attributes of account belongs > to which account. Is

can the TermsComponent be used in combination with fq?

2009-02-16 Thread Peter Wolanin
We have been trying to figure out how to construct, for example, a directory page with an overview of available facets for several fields. Looking at the issue and wiki http://wiki.apache.org/solr/TermsComponent https://issues.apache.org/jira/browse/SOLR-877 It would seem like this component wou

Re: term offsets not returned with tv=true

2009-02-16 Thread Koji Sekiguchi
Your request seems to be fine. Have you reindexed after setting termOffsets definition to document field? Koji Jeffrey Baker wrote: I'm trying to exercise the termOffset functions in the nightly build (2009-02-11) but it doesn't seem to do anything. I have an item in my schema like so: An

Re: Release of solr 1.4 & autosuggest

2009-02-16 Thread David Smiley @MITRE.org
Sorry for budding in on this thread but what value is added by TermComponent when you can use faceting for auto-suggest? And with faceting, you can limit the suggestion by existing words before the word the user is typing by using it for "q". ~ David Smiley Pooja Verlani wrote: > > Hi All, >

Re: delete snapshot??

2009-02-16 Thread sunnyfr
Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M /data/solr/book/data/snapshot.20090216195003 12M /data/solr/book

Re: delete snapshot??

2009-02-16 Thread sunnyfr
Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M /data/solr/book/data/snapshot.20090216195003 12M /data/solr/book

Re: Release of solr 1.4 & autosuggest

2009-02-16 Thread Grant Ingersoll
On Feb 16, 2009, at 6:13 PM, David Smiley @MITRE.org wrote: Sorry for budding in on this thread but what value is added by TermComponent when you can use faceting for auto-suggest? Yeah, you can do auto-suggest w/ faceting, no doubt. In fact the TermComponent could just as well be cal

Re: Input XML duplicate fields uniqueness

2009-02-16 Thread Adi_Jinx
Shalin Shekhar Mangar wrote: > > On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx wrote: > > How about creating a Solr document for each account and adding the recid > and > updt attributes from the record tag? > > -- > Regards, > Shalin Shekhar Mangar. > > However then I do need to allow dupl

Re: Distributed search

2009-02-16 Thread Otis Gospodnetic
Hi, That should work, yes, though it may not be a wise thing to do performance-wise, if the number of CPU cores that solr server has is lower than the number of Solr cores. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun

Re: indexing Chienese langage

2009-02-16 Thread Otis Gospodnetic
Hi, While some of the characters in simplified and traditional Chinese do differ, the Chinese tokenizer doesn't care - it simply creates ngram tokens. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.

Re: Multilanguage

2009-02-16 Thread Otis Gospodnetic
Hi, The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.apac

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic
Siddharth, At the end of your email you said: "One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents." Unless I'm missing something unusual about your application, I don't think the above is

Re: Word Locations & Search Components

2009-02-16 Thread Otis Gospodnetic
Hi, Wouldn't this be as easy as: - split email into "paragraphs" - for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799) - for each signature look for other emails with this signature - when you find an email with an identical signature, you know you've found the "ban

RE: Outofmemory error for large files

2009-02-16 Thread Gargate, Siddharth
Otis, I haven't tried it yet but what I meant is : If we divide the content in multiple parts, then words will be splitted in two different SOLR documents. If the main document contains 'Hello World' then these two words might get indexed in two different documents. Searching for 'Hello world

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic
Siddharth, But does your 150MB file represent a single Document?  That doesn't sound right. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: "Gargate, Siddharth" To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 12:39:53

How to fetch all matching records :urgent

2009-02-16 Thread Neha Bhardwaj
Hello, I am using getResults method of queryResponse class, on a keyword that has more than hundred of matching records. Bit this method returns me only 10 results. And then throw an array index out of bound exception. how can I fetch all the results? Its really important and urgent for me ,

Re: How to fetch all matching records :urgent

2009-02-16 Thread Walter Underwood
Increment the start value by 10 and make another request. wunder On 2/16/09 9:13 PM, "Neha Bhardwaj" wrote: > Hello, > > I am using getResults method of queryResponse class, on a keyword that has > more than hundred of matching records. Bit this method returns me only 10 > results. And then

Re: delete snapshot??

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
The hardlinks will prevent the unused files from getting cleaned up. So the diskspace is consumed for unused index files also. You may need to delete unused snapshots from time to time --Noble On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr wrote: > > Hi Noble, > > I maybe don't get something > Ok if it

Re: Outofmemory error for large files

2009-02-16 Thread Shalin Shekhar Mangar
On Tue, Feb 17, 2009 at 10:26 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Siddharth, > > But does your 150MB file represent a single Document? That doesn't sound > right. > Otis, Solrj writes the whole XML in memory before writing it to server. That may be one reason behind Sidhh

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic
Right.  But I was trying to point out that a single 150MB Document is not in fact what the o.p. wants to do.  For example, if your 150MB represents, say, a whole book, should that really be a single document?  Or should individual chapters be separate documents, for example? Otis -- Sematext --

Re: Need help with DictionaryCompoundWordTokenFilterFactory

2009-02-16 Thread Otis Gospodnetic
Ralf, Not sure if you got this working or not, but perhaps a simple solution is changing the default boolean operator from OR to AND. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: "Kraus, Ralf | pixelhouse GmbH" To: solr-user@lucen