RE: Indexing very large files.

2008-02-24 Thread Jon Lehto
Hi Dave,

A couple more thoughts -

Security is separate from file size.
Maybe assign your users to membership classes, which will cut down
the amount of updating needed over time, as people enter/leave/change roles.
For instance, 'Bob' was in operations with full access, moved to techsupport
with access restricted to a major customers content, then moved to support
their direct competitor with no access to the 1st and full to the
competitor. Ideally, changing Bob's permissions can be done, without
touching the index. There are commercial products available for this sort of
thing. Netegrity Siteminder is one, who had the largest market share. Maybe
read how they handle it, and based on resources build/buy what you need. The
biggest work (not that big) is the integration with
permissions/Single-Sign-On system - LDAP or other. Indexed docs, just get a
security token. Maybe read about Access Control Lists, if you've not worked
with them before.

Breaking big files can be done blindly during index loading, with a scheme
which clues users or UI on how to access other sections. Document conversion
to text should be in the indexing pipeline. 

Another approach could be to index a summary and point to the large doc
in the file system or database.

Cheers,
Jon

-Original Message-
From: David Thibault [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 23, 2008 9:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing very large files.

Thanks.  I'm trying to do a general purpose secure enterprise search system.
 Specifically, it needs to be able to crawl web pages (which are almost all
small files) and filesystems (which may have widely varying file sizes).  I
realize other projects exist that have done similar, but none take into
account the original file permissions, index those too, and then limit
search results to documents that the searching party should have access to
(and hiding results that the searcher should not have access to).  Since the
types of files are not known in advance, I can't exactly split them up into
logical units.  I could possibly just limit my indexing to the first X mb of
any file, though.  I hadn't thought of the implications for relevance or
post-processing that you bring up above.
Thanks,
Dave

On 2/23/08, Jon Lehto <[EMAIL PROTECTED]> wrote:
>
> Dave
>
> You may want to break large docs into chunks, say by chapter or other
> logical segment.
>
> This will help in
>   - relevance ranking - the term frequency of large docs will cause
>uneven weighting unless the relevance calculation does log
> normalization
>   - finer granularity of retrieval - for example a dictionary, thesaurus,
> and
>Encyclopedia probably have what you want, but how to get it quickly?
>   - post-processing - like high-lighting, can be a performance killer, as
> the
>search/replace scans the entire large file for matching strings
>
>
> Jon
>
>
> -Original Message-
> From: David Thibault [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 21, 2008 7:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing very large files.
>
> All,
> A while back I was running into an issue with a Java heap out of memory
> error while indexing large files.  I figured out that was my own error due
> to a misconfiguration of my Netbeans memory settings.
>
> However, now that is fixed and I have stumbled upon a new error.  When
> trying to upload files which include a Solr TextField value of 32MB or
> more
> in size, I get the following error (uploading with SimplePostTool):
>
>
> Solr returned an error: error reading input, returned 0
> javax.xml.stream.XMLStreamException: error reading input, returned 0  at
> com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3709)  at
> com.bea.xml.stream.MXParser.more(MXParser.java:3715)  at
> com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1936)  at
> com.bea.xml.stream.MXParser.next(MXParser.java:1333)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
> XmlUpdateRequestHandler.java:318)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(
> XmlUpdateRequestHandler.java:195)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> XmlUpdateRequestHandler.java:123)  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:117)  at org.apache.solr.core.SolrCore.execute(
> SolrCore.java:902)  at org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java:280)  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:
> 237)
>   at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:235)  at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:206)  at
> org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:233)  at
> org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:175)  at
> org.apache.catalina.c

Re: will hardlinks work across partitions?

2008-02-24 Thread James Brady

Unfortunately, you cannot hard link across mount points.

Snapshooter uses "cp -lr", which, on my Linux machine at least, fails  
with:
cp: cannot create link `/mnt2/myuser/linktest': Invalid cross-device  
link


James

On 23 Feb 2008, at 14:34, Brian Whitman wrote:

Will the hardlink snapshot scheme work across physical disk  
partitions? Can I snapshoot to a different partition than the one  
holding the live solr index?




Re: will hardlinks work across partitions?

2008-02-24 Thread Leonardo Santagada




On 23 Feb 2008, at 14:34, Brian Whitman wrote:

Will the hardlink snapshot scheme work across physical disk  
partitions? Can I snapshoot to a different partition than the one  
holding the live solr index?




On 24/02/2008, at 15:32, James Brady wrote:


Unfortunately, you cannot hard link across mount points.

Snapshooter uses "cp -lr", which, on my Linux machine at least,  
fails with:
cp: cannot create link `/mnt2/myuser/linktest': Invalid cross-device  
link


James



http://en.wikipedia.org/wiki/Hard_link

:)

--
Leonardo Santagada





Re: Field Search

2008-02-24 Thread Chris Hostetter

: Now, if I try to search for,
: *title:Advertise --- * I am getting following results:
...
: I forgot to put my dismax request handler.

dismax doesn't support any special syntax in the query string, if you want 
to search for words only in a single field, make sure that is the only 
field in the qf.


-Hoss



Re: Problem using wildcard characters ? and *

2008-02-24 Thread Chris Hostetter

If you ar using the "text" field from the example schema with the 
EnglishPorterFilterFactory, the word "create" gets stemmed to "creat" 
which would explain your problem.

In general, stemming and wildcard queries don't work.

:  When i give a search for au?it (audit is the name of the block) it
: shows correct results.
:  But when i try same thing with crea?e (create is the name of the
: block) no results are displayed..
:  Both audit and create are stored in the same place.

-Hoss



Re: DisMax deprecated?

2008-02-24 Thread Chris Hostetter

: That is one of my peeves with the Solr Javadocs.  Few of the @deprecated tags
: (if any) tell what you should be using instead.  In this particular case, the
: answer is very simple.  The class merely moved to a new package:

I've added deprecation comments to the "old" instances of the major 
request handlers.

Keep in mind: Solr is all maintained by volunteers, if you don't like 
something about the code, feel free to submit a patch.  In the case of 
deprecation messages, they sometimes get overlooked when people use 
automatic refactoring tools.


-Hoss



Re: Solr 1.2 example apps

2008-02-24 Thread Chris Hostetter

: Is there a repository of sample applications for SOLR; something more 
: than what comes with the installer? I'm particularly interested in 
: faceted browsing. I have located an example from developerWorks but it 
: seems to be not working with 1.2.

i provided an example with some faceting in my Apachecon talk that was 
entirely "discovered" data (ie: i just found a CSV online and created a 
demo arround it) ...

http://people.apache.org/~hossman/apachecon2007us/




-Hoss



Re: *newbie* Wildcards are not working correctly

2008-02-24 Thread Chris Hostetter

wildcard, prefix and fuzzy queries do not use analyzers -- they can't, 
the concepts are incompatible.  So if you wnat to use these types of 
queries you need a really simple field type, no stemming, no word 
delimiter filter, all lowercase, etc...

: Subject: *newbie* Wildcards are not working correctly


-Hoss



Re: Threads in Solr

2008-02-24 Thread Chris Hostetter

: I was thinking may be I could run those queries not one by one but in 
: parallel, in separate threads. But it appears that it takes longer than 
: to run queries one by one.

: Do you have any idea why? Do you think the idea to run those queries in 
: separate threads is good in general? Are SolrIndexSearcher and 
: SimpleFacets thread safe?

SolrIndexSearcher is threadsafe ... SimpleFacets should be thread safe, 
but i won't swear to it off the top of my head.  without seeing exactly 
how you setup your threads, it's hard to guess ... in general multiple 
threads are only useful if you are io bound, or have hardware that can 
take advantage of parallelization (ie: multiple cores).

but it's also possible that things take just as long because all of your 
threads wind up computing the same DocSets at the same time -- or block on 
generating the same FieldCache arrays at the same time.



-Hoss



Re: Search terms in the result

2008-02-24 Thread Chris Hostetter
: clarification about the XML format returned by a search query: in
: fact, I cannot find any official document describing the precise
: format of the document returned.

that's because no one has every really taken much time to document it .. 
in general the XML format is considered fairly "self describing" ... the 
interesting variations come about depending on what request handler you 
use (the XML format is extremely generic, request handlers can put almost 
any data into it as long as the data can be described in terms of simple 
datastructures)


: Particularly, I'm interested in understanding if search terms of the
: query would be returned in the result, as happens in Lucene.

i'm not sure what you mean by "as happens in Lucene" but there is a
core request params that cna be used to force the response writer to 
include input params in the response: "echoParams" ... you can also use 
"debugQuery" to get debugging oriented information in the response (that 
includes info on the query executed.

http://wiki.apache.org/solr/CoreQueryParameters
http://wiki.apache.org/solr/CommonQueryParameters




-Hoss