Re: Lucene 2.9.x vs 3.x

2011-01-16 Thread Simon Willnauer
On Sat, Jan 15, 2011 at 2:19 PM, Salman Akram
 wrote:
> Hi,
>
> SOLR 1.4.1 uses Lucene 2.9.3 by default (I think so). I have few questions
>
> Are there any major performance (or other) improvements in Lucene
> 3.0.3/Lucene 2.9.4?

you can see all major changes here:
http://lucene.apache.org/java/3_0_3/changes/Changes.html

>
> Does 3.x has major compatibility issues moving from 2.9.x?
I assume you mean 3.0.x instead of 3.x ? The answer is no - nothing
major! Its mainly cut over to java 5 like generics and varags etc.
>
> Will SOLR 1.4.1 build work fine with Lucene 3.0.3?

phew... I am not sure but could be though that should very easy to
try... just get the sources here
http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/
change the jar and run a test build

simon
>
> Thanks!
>
> --
> Regards,
>
> Salman Akram
> Senior Software Engineer - Tech Lead
> 80-A, Abu Bakar Block, Garden Town, Pakistan
> Cell: +92-321-4391210
>


TVF file

2011-01-16 Thread Salman Akram
Hi,

>From my understanding TVF file stores the Term Vectors (Positions/Offset) so
if no field has Field.TermVector set (default is NO) so it shouldn't be
created, right?

I have an index created through SOLR on which no field had any value for
TermVectors so by default it shouldn't be saved. All the fields are either
String or Text. All fields have just indexed and stored attributes set to
True. String fields have omitNorms = true as well.

Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file
in my index. Its almost 30% of the total index (around 60% is the PRX
positions file).

Also in Luke it shows 'f' (omitTF) flag for strings but not for text fields.

Any ideas what's going on? Thanks!

-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: TVF file

2011-01-16 Thread Salman Akram
Some more info I copied it from Luke and below is what it says for...

Text Fields --> stored/uncompressed,indexed,tokenized
String Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions

The main contents field is not stored so it doesn't show up on Luke but that
is Analyzed and Tokenized for searching.

On Sun, Jan 16, 2011 at 3:50 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> From my understanding TVF file stores the Term Vectors (Positions/Offset)
> so if no field has Field.TermVector set (default is NO) so it shouldn't be
> created, right?
>
> I have an index created through SOLR on which no field had any value for
> TermVectors so by default it shouldn't be saved. All the fields are either
> String or Text. All fields have just indexed and stored attributes set to
> True. String fields have omitNorms = true as well.
>
> Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file
> in my index. Its almost 30% of the total index (around 60% is the PRX
> positions file).
>
> Also in Luke it shows 'f' (omitTF) flag for strings but not for text
> fields.
>
> Any ideas what's going on? Thanks!
>
> --
> Regards,
>
> Salman Akram
> Senior Software Engineer - Tech Lead
> 80-A, Abu Bakar Block, Garden Town, Pakistan
> Cell: +92-321-4391210
>



-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg
(Many apologies if this appears twice, I tried to send it via Nabble
first but it seems to have got stuck, and is fairly urgent/serious.)

Hi,

I'm trying to use the replication handler to take snapshots, then
archive them and ship them off-site.

Just now I got a message from tar that worried me:

tar: snapshot.20110115035710/_70b.tis: file changed as we read it
tar: snapshot.20110115035710: file changed as we read it

The relevant bit of script that does it looks like this (error
checking removed):

curl 'http://localhost:8983/solr/core/1replication?command=backup'
PREFIX=''
if [[ "$START_TIME" =~ 'Sun' ]]
then
PREFIX='weekly.'
fi
cd $SOLR_DATA_DIR
for snapshot in `ls -d -1 snapshot.*`
do
TARGET="${LOCAL_BACKUP_DIR}/${PREFIX}${snapshot}.tar.bz2"
echo "Archiving ${snapshot} into $TARGET"
tar jcf $TARGET $snapshot
echo "Deleting ${snapshot}"
rm -rf $snapshot
done

I was under the impression that files in the snapshot were guaranteed
to never change, right? Otherwise what's the point of the replication
backup command?

I tried putting in a 30-second sleep after the snapshot and before the
tar, but the error occurred again anyway.

There was a message from Lance N. with a similar error in, years ago:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg06104.html

but that would be pre-replication anyway, right?

This is on Ubuntu 10.10 using java 1.6.0_22 and Solr 1.4.0.

Thanks,

Andrew.


-- 

:: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::


Re: TVF file

2011-01-16 Thread Otis Gospodnetic
Is it possible that the tvf file you are looking at is old (i.e. not part of 
your active index)?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 16, 2011 6:17:23 AM
> Subject: Re: TVF file
> 
> Some more info I copied it from Luke and below is what it says  for...
> 
> Text Fields --> stored/uncompressed,indexed,tokenized
> String  Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions
> 
> The  main contents field is not stored so it doesn't show up on Luke but that
> is  Analyzed and Tokenized for searching.
> 
> On Sun, Jan 16, 2011 at 3:50 PM,  Salman Akram <
> salman.ak...@northbaysolutions.net>  wrote:
> 
> > Hi,
> >
> > From my understanding TVF file stores  the Term Vectors (Positions/Offset)
> > so if no field has Field.TermVector  set (default is NO) so it shouldn't be
> > created, right?
> >
> > I  have an index created through SOLR on which no field had any value for
> >  TermVectors so by default it shouldn't be saved. All the fields are  either
> > String or Text. All fields have just indexed and stored  attributes set to
> > True. String fields have omitNorms = true as  well.
> >
> > Even in Luke it doesn't show V (Term Vector) flag but I  have a big TVF file
> > in my index. Its almost 30% of the total index  (around 60% is the PRX
> > positions file).
> >
> > Also in Luke it  shows 'f' (omitTF) flag for strings but not for text
> >  fields.
> >
> > Any ideas what's going on? Thanks!
> >
> >  --
> > Regards,
> >
> > Salman Akram
> > Senior Software  Engineer - Tech Lead
> > 80-A, Abu Bakar Block, Garden Town,  Pakistan
> > Cell: +92-321-4391210
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> Senior Software Engineer - Tech  Lead
> 80-A, Abu Bakar Block, Garden Town, Pakistan
> Cell:  +92-321-4391210
> 


Re: TVF file

2011-01-16 Thread Salman Akram
Nops. I optimized it with Standard File Format and cleaned up Index dir
through Luke. It adds upto to the total size when I optimized it with
Compound File Format.

On Sun, Jan 16, 2011 at 5:46 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Is it possible that the tvf file you are looking at is old (i.e. not part
> of
> your active index)?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 6:17:23 AM
> > Subject: Re: TVF file
> >
> > Some more info I copied it from Luke and below is what it says  for...
> >
> > Text Fields --> stored/uncompressed,indexed,tokenized
> > String  Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions
> >
> > The  main contents field is not stored so it doesn't show up on Luke but
> that
> > is  Analyzed and Tokenized for searching.
> >
> > On Sun, Jan 16, 2011 at 3:50 PM,  Salman Akram <
> > salman.ak...@northbaysolutions.net>  wrote:
> >
> > > Hi,
> > >
> > > From my understanding TVF file stores  the Term Vectors
> (Positions/Offset)
> > > so if no field has Field.TermVector  set (default is NO) so it
> shouldn't be
> > > created, right?
> > >
> > > I  have an index created through SOLR on which no field had any value
> for
> > >  TermVectors so by default it shouldn't be saved. All the fields are
>  either
> > > String or Text. All fields have just indexed and stored  attributes set
> to
> > > True. String fields have omitNorms = true as  well.
> > >
> > > Even in Luke it doesn't show V (Term Vector) flag but I  have a big TVF
> file
> > > in my index. Its almost 30% of the total index  (around 60% is the PRX
> > > positions file).
> > >
> > > Also in Luke it  shows 'f' (omitTF) flag for strings but not for text
> > >  fields.
> > >
> > > Any ideas what's going on? Thanks!
> > >
> > >  --
> > > Regards,
> > >
> > > Salman Akram
> > > Senior Software  Engineer - Tech Lead
> > > 80-A, Abu Bakar Block, Garden Town,  Pakistan
> > > Cell: +92-321-4391210
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
> >
>



-- 
Regards,

Salman Akram


Re: Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg
PS one other point I didn't mention is that this server has a very
fast autocommit limit (2 seconds max time).

But I don't know if this is relevant -- I thought the files in the
snapshot wouldn't be committed to again. Please correct me if this is
a huge misunderstanding.

On 16 January 2011 12:30, Andrew Clegg  wrote:
> (Many apologies if this appears twice, I tried to send it via Nabble
> first but it seems to have got stuck, and is fairly urgent/serious.)
>
> Hi,
>
> I'm trying to use the replication handler to take snapshots, then
> archive them and ship them off-site.
>
> Just now I got a message from tar that worried me:
>
> tar: snapshot.20110115035710/_70b.tis: file changed as we read it
> tar: snapshot.20110115035710: file changed as we read it
>
> The relevant bit of script that does it looks like this (error
> checking removed):
>
> curl 'http://localhost:8983/solr/core/1replication?command=backup'
> PREFIX=''
> if [[ "$START_TIME" =~ 'Sun' ]]
> then
>        PREFIX='weekly.'
> fi
> cd $SOLR_DATA_DIR
> for snapshot in `ls -d -1 snapshot.*`
> do
>        TARGET="${LOCAL_BACKUP_DIR}/${PREFIX}${snapshot}.tar.bz2"
>        echo "Archiving ${snapshot} into $TARGET"
>        tar jcf $TARGET $snapshot
>        echo "Deleting ${snapshot}"
>        rm -rf $snapshot
> done
>
> I was under the impression that files in the snapshot were guaranteed
> to never change, right? Otherwise what's the point of the replication
> backup command?
>
> I tried putting in a 30-second sleep after the snapshot and before the
> tar, but the error occurred again anyway.
>
> There was a message from Lance N. with a similar error in, years ago:
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg06104.html
>
> but that would be pre-replication anyway, right?
>
> This is on Ubuntu 10.10 using java 1.6.0_22 and Solr 1.4.0.
>
> Thanks,
>
> Andrew.
>
>
> --
>
> :: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::
>



-- 

:: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::


Re: TVF file

2011-01-16 Thread Otis Gospodnetic
Hm, want to email the index dir listing (ls -lah) + the field type and field 
definitions from your schema.xml?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 16, 2011 7:51:15 AM
> Subject: Re: TVF file
> 
> Nops. I optimized it with Standard File Format and cleaned up Index  dir
> through Luke. It adds upto to the total size when I optimized it  with
> Compound File Format.
> 
> On Sun, Jan 16, 2011 at 5:46 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com>  wrote:
> 
> > Is it possible that the tvf file you are looking at is old  (i.e. not part
> > of
> > your active index)?
> >
> >  Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene  ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Salman Akram 
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Sun, January 16, 2011 6:17:23 AM
> > > Subject: Re: TVF  file
> > >
> > > Some more info I copied it from Luke and below is  what it says  for...
> > >
> > > Text Fields -->  stored/uncompressed,indexed,tokenized
> > > String  Fields -->  stored/uncompressed,indexed,omitTermFreqAndPositions
> > >
> > >  The  main contents field is not stored so it doesn't show up on Luke  but
> > that
> > > is  Analyzed and Tokenized for  searching.
> > >
> > > On Sun, Jan 16, 2011 at 3:50 PM,   Salman Akram <
> > > salman.ak...@northbaysolutions.net>   wrote:
> > >
> > > > Hi,
> > > >
> > > >  From my understanding TVF file stores  the Term Vectors
> >  (Positions/Offset)
> > > > so if no field has Field.TermVector   set (default is NO) so it
> > shouldn't be
> > > > created,  right?
> > > >
> > > > I  have an index created through  SOLR on which no field had any value
> > for
> > > >   TermVectors so by default it shouldn't be saved. All the fields  are
> >  either
> > > > String or Text. All fields have just  indexed and stored  attributes set
> > to
> > > > True.  String fields have omitNorms = true as  well.
> > > >
> >  > > Even in Luke it doesn't show V (Term Vector) flag but I  have a  big 
TVF
> > file
> > > > in my index. Its almost 30% of the total  index  (around 60% is the PRX
> > > > positions file).
> >  > >
> > > > Also in Luke it  shows 'f' (omitTF) flag for  strings but not for text
> > > >  fields.
> > >  >
> > > > Any ideas what's going on? Thanks!
> > >  >
> > > >  --
> > > > Regards,
> > >  >
> > > > Salman Akram
> > > > Senior Software   Engineer - Tech Lead
> > > > 80-A, Abu Bakar Block, Garden Town,   Pakistan
> > > > Cell: +92-321-4391210
> > > >
> >  >
> > >
> > >
> > > --
> > > Regards,
> >  >
> > > Salman Akram
> > >
> >  >
> >
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram
> 


Solr 1.4.1 and carrot2 clustering

2011-01-16 Thread Patrick Pekczynski

Dear all,

I really enjoy using Solr so far. During the last days I tried to activate the 
ClusteringComponent in Solr as indicated here

http://wiki.apache.org/solr/ClusteringComponent

and copied all the relevant java libraries in the WEB-INF/lib folder of my 
tomcat installation of Solr.

But everytime I try to issue a request to my Solr server using

http://localhost:9005/apache-solr-1.4.1/job0/select?q=*:*&fl=title,score,url&start=0&rows=100&indent=on&clustering=true

I get the following error message:

 java.lang.NoClassDefFoundError: bak/pcj/set/IntSet
at 
org.carrot2.text.preprocessing.PreprocessingPipeline.(PreprocessingPipeline.java:47)
at 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm.(LingoClusteringAlgorithm.java:108)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:355)
at java.lang.Class.newInstance(Class.java:308)
at 
org.carrot2.util.pool.SoftUnboundedPool.borrowObject(SoftUnboundedPool.java:114)
at 
org.carrot2.core.CachingController.borrowProcessingComponent(CachingController.java:329)



Hence I have downloaded the corresponding pcj-1.2.jar providing the interface 
"bak.pcj.set.IntSet" and I have also put it in the WEB-INF/lib folder

But I still keep getting this error message though the corresponding interface 
MUST be on the classpath now.

Can anyone help me out with this one? I'm really eager to give this clustering 
extension a try from within Solr using the 1.4.1 version that I have already
running on my server.

Thanks for a brief feedback.

Best regards,

Patrick


Re: Solr 1.4.1 and carrot2 clustering

2011-01-16 Thread Otis Gospodnetic
Patrick,

I went to http://search-lucene.com/solr and searched for: pcj

Hit # 2 shows this response:
  http://search-lucene.com/m/SUTgW1ELRsZ


Note where pcj & friends should be placed.  I hope this fixes the problem.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Patrick Pekczynski 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 16, 2011 8:57:41 AM
> Subject: Solr 1.4.1 and carrot2 clustering
> 
> Dear all,
> 
> I really enjoy using Solr so far. During the last days I tried  to activate 
> the 
>ClusteringComponent in Solr as indicated here
> 
> http://wiki.apache.org/solr/ClusteringComponent
> 
> and copied all the  relevant java libraries in the WEB-INF/lib folder of my 
>tomcat installation of  Solr.
> 
> But everytime I try to issue a request to my Solr server  using
> 
>http://localhost:9005/apache-solr-1.4.1/job0/select?q=*:*&fl=title,score,url&start=0&rows=100&indent=on&clustering=true
>e
> 
> I  get the following error message:
> 
>  java.lang.NoClassDefFoundError:  bak/pcj/set/IntSet
> at  
>org.carrot2.text.preprocessing.PreprocessingPipeline.(PreprocessingPipeline.java:47)
>
>  at  
>org.carrot2.clustering.lingo.LingoClusteringAlgorithm.(LingoClusteringAlgorithm.java:108)
>
>  at  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
>Method)
>  at  
>sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>
>  at  
>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>
>  at  java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>  at java.lang.Class.newInstance0(Class.java:355)
>  at java.lang.Class.newInstance(Class.java:308)
>  at  
>org.carrot2.util.pool.SoftUnboundedPool.borrowObject(SoftUnboundedPool.java:114)
>  at  
>org.carrot2.core.CachingController.borrowProcessingComponent(CachingController.java:329)
>
> 
> 
> 
> Hence  I have downloaded the corresponding pcj-1.2.jar providing the 
> interface  
>"bak.pcj.set.IntSet" and I have also put it in the WEB-INF/lib folder
> 
> But  I still keep getting this error message though the corresponding 
> interface 
>MUST  be on the classpath now.
> 
> Can anyone help me out with this one? I'm really  eager to give this 
> clustering 
>extension a try from within Solr using the 1.4.1  version that I have already
> running on my server.
> 
> Thanks for a brief  feedback.
> 
> Best regards,
> 
> Patrick
> 


Re: TVF file

2011-01-16 Thread Salman Akram
Please see below the dir listing and relevant part of schema file (I have
removed the name part from fields for obvious reasons).

Also regarding .frq file why exactly is it needed? Is it required in phrase
searching (I am not using highlighting or MoreLikeThis on this index file)
too? and this is not made if all fields are using omitTF?

Thanks alot!

--Dir Listing--
01/16/2011  06:05 AM  .
01/16/2011  06:05 AM  ..
01/15/2011  03:58 PM  log
04/22/2010  12:42 AM   549 luke.jnlp
01/16/2011  04:58 AM20 segments.gen
01/16/2011  04:58 AM   287 segments_5hl
01/16/2011  02:17 AM 4,760,716,827 _36w.fdt
01/16/2011  02:17 AM   107,732,836 _36w.fdx
01/16/2011  02:15 AM 4,032 _36w.fnm
01/16/2011  04:36 AM25,221,109,245 _36w.frq
01/16/2011  04:38 AM 4,457,445,928 _36w.nrm
01/16/2011  04:36 AM   126,866,227,056 _36w.prx
01/16/2011  04:36 AM22,510,915 _36w.tii
01/16/2011  04:36 AM 1,635,096,862 _36w.tis
01/16/2011  04:58 AM18,341,750 _36w.tvd
01/16/2011  04:58 AM78,450,397,739 _36w.tvf
01/16/2011  04:58 AM   215,465,668 _36w.tvx
  14 File(s) 241,755,049,714 bytes
   3 Dir(s)  1,072,112,025,600 bytes free


-Schema File--

F:\IndexingAppsRealTime\index>


 

  

  
  
  
  -->

  
 
 
   
 


 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   











On Sun, Jan 16, 2011 at 6:52 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hm, want to email the index dir listing (ls -lah) + the field type and
> field
> definitions from your schema.xml?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 7:51:15 AM
> > Subject: Re: TVF file
> >
> > Nops. I optimized it with Standard File Format and cleaned up Index  dir
> > through Luke. It adds upto to the total size when I optimized it  with
> > Compound File Format.
> >
> > On Sun, Jan 16, 2011 at 5:46 PM, Otis  Gospodnetic <
> > otis_gospodne...@yahoo.com>  wrote:
> >
> > > Is it possible that the tvf file you are looking at is old  (i.e. not
> part
> > > of
> > > your active index)?
> > >
> > >  Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene  ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Sun, January 16, 2011 6:17:23 AM
> > > > Subject: Re: TVF  file
> > > >
> > > > Some more info I copied it from Luke and below is  what it says
>  for...
> > > >
> > > > Text Fields -->  stored/uncompressed,indexed,tokenized
> > > > String  Fields -->
>  stored/uncompressed,indexed,omitTermFreqAndPositions
> > > >
> > > >  The  main contents field is not stored so it doesn't show up on Luke
>  but
> > > that
> > > > is  Analyzed and Tokenized for  searching.
> > > >
> > > > On Sun, Jan 16, 2011 at 3:50 PM,   Salman Akram <
> > > > salman.ak...@northbaysolutions.net>   wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > >  From my understanding TVF file stores  the Term Vectors
> > >  (Positions/Offset)
> > > > > so if no field has Field.TermVector   set (default is NO) so it
> > > shouldn't be
> > > > > created,  right?
> > > > >
> > > > > I  have an index created through  SOLR on which no field had any
> value
> > > for
> > > > >   TermVectors so by default it shouldn't be saved. All the fields
>  are
> > >  either
> > > > > String or Text. All fields have just  indexed and stored
>  attributes set
> > > to
> > > > > True.  String fields have omitNorms = true as  well.
> > > > >
> > >  > > Even in Luke it doesn't show V (Term Vector) flag but I  have a
>  big
> TVF
> > > file
> > > > > in my index. Its almost 30% of the total  index  (around 60% is the
> PRX
> > > > > positions file).
> > >  > >
> > > > > Also in Luke it  shows 'f' (omitTF) flag for  strings but not for
> text
> > > > >  fields.
> > > >  >
> > > > > Any ideas what's going on? Thanks!
> > > >  >
> > > > >  --
> > > > > Regards,
> > > >  >
> > > > > Salman Akram
> > > > > Senior Software   Engineer - Tech Lead
> > > > > 80-A, Abu Bakar Block, Garden Town,   Pakistan
> > > > > Cell: +92-321-4391210
> > > > >
> > >  >
> > > >
> > > >
> > > > --
> > > > Regards,
> > >  >
> > > > Salman Akram
> > > >
> > >  >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Single value vs multi value setting in tokenized field

2011-01-16 Thread kenf_nc

I have to support both general searches (free form text) and directed
searches (field:val field2:val). To do the general search I have a field
defined as:
   
and several copyField commands like:
  
  
  
  
Note that tags and features are multi-value themselves. So after indexing I
have a 'general text' bucket with numerous (usually in the 20 to 30 range)
rows of strings. 

My question is would it be better, for indexing speed and search
speed/quality, to concatenate all the text into a single string and store it
in "content" as one value? What are the implications on search results? If
Description is say a couple paragraphs of text and tags are
"Cuisine","Italian","Romantic" would the tags get lost in the muck of the
bigger text?

One thing to keep in mind. I'm sure some of you are going to say 'Dismax'
and in some situations I will, but my index has numerous document types that
have vastly different schemas. Another document may not have "title" and
"features" but might have "recommendations" and "location". In a general
query it wouldn't make sense to include every possible field in a dismax
query, I don't even know what all the fields are, new ones are added all the
time.

Has anyone got advice, suggestions on this topic (blending directed search
with general search)? 
Thanks in advance,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2268635.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-16 Thread Lance Norskog
You need to add another parameter which defines the 'id' field. 'id'
is required- it is unique for every document.  Usually you can pick
the filename.

Lance

On Fri, Jan 14, 2011 at 3:59 AM, Jörg Agatz  wrote:
> ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i
> have a Oher Problem, i cant sent content to the Server..
>
>
>
>
> when i will send Content to solr i get:
>
> 
> 
> 
> Error 400 
> 
> HTTP ERROR: 400Document [null] missing required field:
> id
> RequestURI=/solr/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://
>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> 
> 
>
>
> I do:
> curl "
> http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text";
> -F "myfile=@test.txt"
>
> some ideas?
>



-- 
Lance Norskog
goks...@gmail.com


Re: solrj & http client 4

2011-01-16 Thread Stevo Slavić
In those poms, not all modules have explicit version and groupId which
is a bad practice. Also some parent references contain invalid default
(../pom.xml) relativePath - path to their parent pom.xml. Paths to
build directories look suspicious to me. lucene-bdb module references
missing library com.sleepycat:berkeleydb:jar:4.7.25 - I see
lib/db-4.7.25.jar, if it's supposed to be installed in local
repository then pom would be handy.

Wiki page http://wiki.apache.org/solr/HowToContribute references this
http://markmail.org/message/yb5qgeamosvdscao mail but files
(.classpath) in archives attached to that email are very outdated.
eclipse target in base ant build script generates .classpath and
.settings so it seems mentioned wiki page is outdated too.

Steps to get Lucene/Solr trunk in eclipse IDE for me were:
1) In SVN Repository Exploring perspective add repository with
http://svn.apache.org/repos/asf/lucene/dev
2) Right-click trunk and choose "Find/Check Out As..."
3) Choose "Check out as a project configured using the New Project Wizard"
4) Choose "Java Project" wizard
5) Enter lucene-solr as project name, make sure Java 1.6 is selected
execution environment and "Create separate folders for sources and
class files" is selected layout, and click Finish
6) After checkout is complete, delete src directory that eclipse
created in project root directory
7) Turn on ant view (Window --> Show View --> Ant)
8) In ant view add build.xml from checked-out trunk root and double
click eclipse target
9) Once ant completes right-click project and choose refresh

Regards,
Stevo.

On Wed, Dec 22, 2010 at 6:29 PM, Steven A Rowe  wrote:
> Stevo,
>
> You may be interested in LUCENE-2657 
> , which provides full POMs 
> for Lucene/Solr trunk.
>
> I don't use Eclipse, but I think it can use POMs to bootstrap project 
> configuration.  (I know IntelliJ can do this.)
>
> Steve
>
>> -Original Message-
>> From: Stevo Slavić [mailto:ssla...@gmail.com]
>> Sent: Wednesday, December 22, 2010 9:17 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: solrj & http client 4
>>
>> Tried to checkout lucene/solr and setup projects and classpath in eclipse
>> -
>> there seems to be circular dependency between modules - this is not
>> possible/allowed in maven built project, would require refactoring.
>>
>> Regards,
>> Stevo.
>>
>> On Wed, Dec 8, 2010 at 1:42 PM, Stevo Slavić  wrote:
>>
>> > OK, thanks. Can't promise anything, but would love to contribute. First
>> > impression on the source code - ant is used as build tool, wish it was
>> > maven. If it was maven then
>> > https://issues.apache.org/jira/browse/SOLR-1218 would be trivial or
>> > wouldn't exist in the first place.
>> >
>> > Regards,
>> > Stevo.
>> >
>> >
>> > On Wed, Dec 8, 2010 at 10:25 AM, Chantal Ackermann <
>> > chantal.ackerm...@btelligent.de> wrote:
>> >
>> >> SOLR-2020 addresses upgrading to HttpComponents (form HttpClient). I
>> >> have had no time to work more on it, yet, though. I also don't have
>> that
>> >> much experience with the new version, so any help is much appreciated.
>> >>
>> >> Cheers,
>> >> Chantal
>> >>
>> >> On Tue, 2010-12-07 at 18:35 +0100, Yonik Seeley wrote:
>> >> > On Tue, Dec 7, 2010 at 12:32 PM, Stevo Slavić 
>> >> wrote:
>> >> > > Hello solr users and developers,
>> >> > >
>> >> > > Are there any plans to upgraded http client dependency in solrj
>> from
>> >> 3.x to
>> >> > > 4.x?
>> >> >
>> >> > I'd certainly be for moving to 4.x (and I think everyone else would
>> >> too).
>> >> > The issue is that it's not a drop-in replacement, so someone needs to
>> >> > do the work.
>> >> >
>> >> > -Yonik
>> >> > http://www.lucidimagination.com
>> >> >
>> >> > > Found this  ticket
>> -
>> >> > > judging by comments in it upgrade might help fix the issue. I have
>> a
>> >> project
>> >> > > in jar hell, getting different versions of http client as
>> transitive
>> >> > > dependency...
>> >> > >
>> >> > > Regards,
>> >> > > Stevo.
>> >>
>> >>
>> >>
>> >>
>> >
>


Re: TVF file

2011-01-16 Thread Otis Gospodnetic
Hm, this is a mystery to me - I don't see anything that would turn on Term 
Vectors...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 16, 2011 2:26:53 PM
> Subject: Re: TVF file
> 
> Please see below the dir listing and relevant part of schema file (I  have
> removed the name part from fields for obvious reasons).
> 
> Also  regarding .frq file why exactly is it needed? Is it required in  phrase
> searching (I am not using highlighting or MoreLikeThis on this index  file)
> too? and this is not made if all fields are using omitTF?
> 
> Thanks  alot!
> 
> --Dir Listing--
> 01/16/2011   06:05 AM   .
> 01/16/2011  06:05 AM   ..
> 01/15/2011  03:58 PM   log
> 04/22/2010  12:42 AM549 luke.jnlp
> 01/16/2011  04:58 AM 20  segments.gen
> 01/16/2011  04:58 AM287 segments_5hl
> 01/16/2011  02:17 AM  4,760,716,827 _36w.fdt
> 01/16/2011  02:17 AM107,732,836 _36w.fdx
> 01/16/2011  02:15 AM  4,032 _36w.fnm
> 01/16/2011  04:36 AM 25,221,109,245 _36w.frq
> 01/16/2011  04:38 AM 4,457,445,928  _36w.nrm
> 01/16/2011  04:36 AM   126,866,227,056  _36w.prx
> 01/16/2011  04:36 AM22,510,915  _36w.tii
> 01/16/2011  04:36 AM 1,635,096,862  _36w.tis
> 01/16/2011  04:58 AM18,341,750  _36w.tvd
> 01/16/2011  04:58 AM78,450,397,739  _36w.tvf
> 01/16/2011  04:58 AM   215,465,668  _36w.tvx
>   14 File(s)  241,755,049,714 bytes
>3  Dir(s)  1,072,112,025,600 bytes free
> 
> 
> -Schema  File--
> 
> F:\IndexingAppsRealTime\index>
> 
> 
>  
>  omitNorms="true"/>
>
>  
>luceneMatchVersion="LUCENE_29"/>
>
>
>-->
>  
>   
>  
>   
>
>  
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sun,  Jan 16, 2011 at 6:52 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com>  wrote:
> 
> > Hm, want to email the index dir listing (ls -lah) + the field  type and
> > field
> > definitions from your schema.xml?
> >
> >  Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene  ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Salman Akram 
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Sun, January 16, 2011 7:51:15 AM
> > > Subject: Re: TVF  file
> > >
> > > Nops. I optimized it with Standard File Format  and cleaned up Index  dir
> > > through Luke. It adds upto to the  total size when I optimized it  with
> > > Compound File  Format.
> > >
> > > On Sun, Jan 16, 2011 at 5:46 PM, Otis   Gospodnetic <
> > > otis_gospodne...@yahoo.com>   wrote:
> > >
> > > > Is it possible that the tvf file you are  looking at is old  (i.e. not
> > part
> > > > of
> > >  > your active index)?
> > > >
> > > >  Otis
> >  > > 
> > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > >  > Lucene  ecosystem search :: http://search-lucene.com/
> > > >
> > >  >
> > > >
> > > > - Original  Message  
> > > > > From: Salman Akram 
> >  > >  > To: solr-user@lucene.apache.org
> >  > >  > Sent: Sun, January 16, 2011 6:17:23 AM
> > > >  > Subject: Re: TVF  file
> > > > >
> > > > >  Some more info I copied it from Luke and below is  what it  says
> >  for...
> > > > >
> > > > > Text  Fields -->  stored/uncompressed,indexed,tokenized
> > > > >  String  Fields -->
> >   stored/uncompressed,indexed,omitTermFreqAndPositions
> > > >  >
> > > > >  The  main contents field is not stored so  it doesn't show up on Luke
> >  but
> > > > that
> > >  > > is  Analyzed and Tokenized for  searching.
> > > >  >
> > > > > On Sun, Jan 16, 2011 at 3:50 PM,   Salman Akram  <
> > > > > salman.ak...@northbaysolutions.net>wrote:
> > > > >
> > > > > > Hi,
> > >  > > >
> > > > > >  From my understanding TVF file  stores  the Term Vectors
> > > >  (Positions/Offset)
> >  > > > > so if no field has Field.TermVector   set (default is  NO) so it
> > > > shouldn't be
> > > > > >  created,  right?
> > > > > >
> > > > > >  I  have an index created through  SOLR on which no field had  any
> > value
> > > > for
> > > > > >TermVectors so by default it shouldn't be saved. All the fields
> >   are
> > > >  either
> > > > > > String or Text.  All fields have just  indexed and stored
> >  attributes  set
> > > > to
> > > > > > True.  String fields  have omitNorms = true as  well.
> > > > > >
> > >

Re: Single value vs multi value setting in tokenized field

2011-01-16 Thread Otis Gospodnetic
Hi,

I'm not a big fan of putting all fields in a single field (bye bye dismax, as 
you say), but if you are asking whether doing it via copyField or "directly" is 
will make a difference - not really.
If you do it with copyField, you still get to keep your individual fields, 
which 
could serve you down the road, but the index is bigger because you have 
duplicate data for those fields.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: kenf_nc 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 16, 2011 3:47:56 PM
> Subject: Single value vs multi value setting in tokenized field
> 
> 
> I have to support both general searches (free form text) and  directed
> searches (field:val field2:val). To do the general search I have a  field
> defined as:
> termVectors="true" multiValued="true"  />
> and several copyField commands like:
>   
>   
>   
>   
> Note  that tags and features are multi-value themselves. So after indexing I
> have a  'general text' bucket with numerous (usually in the 20 to 30 range)
> rows of  strings. 
> 
> My question is would it be better, for indexing speed and  search
> speed/quality, to concatenate all the text into a single string and  store it
> in "content" as one value? What are the implications on search  results? If
> Description is say a couple paragraphs of text and tags  are
> "Cuisine","Italian","Romantic" would the tags get lost in the muck of  the
> bigger text?
> 
> One thing to keep in mind. I'm sure some of you are  going to say 'Dismax'
> and in some situations I will, but my index has  numerous document types that
> have vastly different schemas. Another document  may not have "title" and
> "features" but might have "recommendations" and  "location". In a general
> query it wouldn't make sense to include every  possible field in a dismax
> query, I don't even know what all the fields are,  new ones are added all the
> time.
> 
> Has anyone got advice, suggestions on  this topic (blending directed search
> with general search)? 
> Thanks in  advance,
> Ken
> -- 
> View this message in context: 
>http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2268635.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
> 


Re: TVF file

2011-01-16 Thread Salman Akram
Well anyways thanks for the help.

Also can you please reply to this about .frq file (since that's quite big
too).

"Also regarding .frq file why exactly is it needed? Is it required in phrase
searching (I am not using highlighting or MoreLikeThis on this index  file)
too? and this is not made if all fields are using omitTF?"

On Mon, Jan 17, 2011 at 10:18 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hm, this is a mystery to me - I don't see anything that would turn on Term
> Vectors...
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 2:26:53 PM
> > Subject: Re: TVF file
> >
> > Please see below the dir listing and relevant part of schema file (I
>  have
> > removed the name part from fields for obvious reasons).
> >
> > Also  regarding .frq file why exactly is it needed? Is it required in
>  phrase
> > searching (I am not using highlighting or MoreLikeThis on this index
>  file)
> > too? and this is not made if all fields are using omitTF?
> >
> > Thanks  alot!
> >
> > --Dir Listing--
> > 01/16/2011   06:05 AM   .
> > 01/16/2011  06:05 AM   ..
> > 01/15/2011  03:58 PM   log
> > 04/22/2010  12:42 AM549 luke.jnlp
> > 01/16/2011  04:58 AM 20  segments.gen
> > 01/16/2011  04:58 AM287 segments_5hl
> > 01/16/2011  02:17 AM  4,760,716,827 _36w.fdt
> > 01/16/2011  02:17 AM107,732,836 _36w.fdx
> > 01/16/2011  02:15 AM  4,032 _36w.fnm
> > 01/16/2011  04:36 AM 25,221,109,245 _36w.frq
> > 01/16/2011  04:38 AM 4,457,445,928  _36w.nrm
> > 01/16/2011  04:36 AM   126,866,227,056  _36w.prx
> > 01/16/2011  04:36 AM22,510,915  _36w.tii
> > 01/16/2011  04:36 AM 1,635,096,862  _36w.tis
> > 01/16/2011  04:58 AM18,341,750  _36w.tvd
> > 01/16/2011  04:58 AM78,450,397,739  _36w.tvf
> > 01/16/2011  04:58 AM   215,465,668  _36w.tvx
> >   14 File(s)  241,755,049,714 bytes
> >3  Dir(s)  1,072,112,025,600 bytes free
> >
> >
> > -Schema  File--
> >
> > F:\IndexingAppsRealTime\index>
> > 
> > 
> >  
> >  sortMissingLast="true"
> > omitNorms="true"/>
> >
> >  
> >> luceneMatchVersion="LUCENE_29"/>
> >
> >
> >-->
> >  
> >   
> >  
> >   
> >
> >  
> >
> >
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > 
> >
> >
> >
> > On Sun,  Jan 16, 2011 at 6:52 PM, Otis Gospodnetic <
> > otis_gospodne...@yahoo.com>  wrote:
> >
> > > Hm, want to email the index dir listing (ls -lah) + the field  type and
> > > field
> > > definitions from your schema.xml?
> > >
> > >  Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene  ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Sun, January 16, 2011 7:51:15 AM
> > > > Subject: Re: TVF  file
> > > >
> > > > Nops. I optimized it with Standard File Format  and cleaned up Index
>  dir
> > > > through Luke. It adds upto to the  total size when I optimized it
>  with
> > > > Compound File  Format.
> > > >
> > > > On Sun, Jan 16, 2011 at 5:46 PM, Otis   Gospodnetic <
> > > > otis_gospodne...@yahoo.com>   wrote:
> > > >
> > > > > Is it possible that the tvf file you are  looking at is old  (i.e.
> not
> > > part
> > > > > of
> > > >  > your active index)?
> > > > >
> > > > >  Otis
> > >  > > 
> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > >  > Lucene  ecosystem search :: http://search-lucene.com/
> > > > >
> > > >  >
> > > > >
> > > > > - Original  Message  
> > > > > > From: Salman Akram 
> > >  > >  > To: solr-user@lucene.apache.org
> > >  > >  > Sent: Sun, January 16, 2011 6:17:23 AM
> > > > >  > Subject: Re: TVF  file
> > > > > >
> > > > > >  Some more info I copied it from Luke and below is  what it  says
> > >  for...
> > > > > >
> > > > > > Text  Fields -->  stored/uncompressed,indexed,tokenized
> > > > > >  String  Fields -->
> > >   stored/uncompressed,indexed,omitTermFreqAndPositions
> > > > >  >
> > > > > >  The  main contents field is not stored so  it doesn't show up on
> Luke
> > >  but
> > > > > that
> > > >  > > is  Analyzed and Token

Regardind subfield Enquiry

2011-01-16 Thread Isha Garg

Hi,
 Can we define subfield of a field in schema.
like if I defined a field name=Person can I define  subfields 
Male,Female for Person field. Tell me how to implement this type of 
requirement in schema.xml.


Re: Regardind subfield Enquiry

2011-01-16 Thread Grijesh

Define a new field gender for that and for query use filter
gender:male/female 

-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Regardind-subfield-Enquiry-tp2270574p2270636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: spell suggest response

2011-01-16 Thread Grijesh

I have Not implemented on any public site .
I just seen the auto-suggest list of google which have the lists containing
both auto-suggest and spell-check

Like just type on google "jave"
it will give the suggestion 

java script
java se
jave staffing
java string
jave s 

So i was trying to implement that type of combined autusuggest with
containing correct words.
So I used both spellcheck components with any of autosuggest implementation
(used terms or NGramFilter) 
And got the both suggestion autosuggest and correct words from spellcheck
component

-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2270669.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: spell suggest response

2011-01-16 Thread satya swaroop
Hi Grijesh,
As you said you are implementing this type. Can you tell how
did you made in brief..

Regards,
satya


Re: Regardind subfield Enquiry

2011-01-16 Thread Isha Garg

On Monday 17 January 2011 12:10 PM, Grijesh wrote:

Define a new field gender for that and for query use filter
gender:male/female

-
Thanx:
Grijesh
   
actually i want to enquire the subfield concept in schema like if I have 
field name=clause and can it be divided into subfields Subject,Verb 
,Object subfields.


Re: spell suggest response

2011-01-16 Thread Grijesh

I have configured a request handled for autosuggest (used term component for
autosuggest) with last component as spellcheck component.

and query with terms.prefix and spellcheck=true

It gives the output from terms component and also output from spellchek
component
exact handler configuration currently don't have 


-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2270743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regardind subfield Enquiry

2011-01-16 Thread Grijesh

In schema you can define as many fields you want .There is not any concept of
subfield.If you want many values can be added then use MultiValued field fro
that.

-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Regardind-subfield-Enquiry-tp2270574p2270869.html
Sent from the Solr - User mailing list archive at Nabble.com.