How to use Solr in my project

2013-12-26 Thread Fatima Issawi
Hello,

First off, I apologize if this was sent twice. I was having issues subscribing 
to the list.

I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me 
figure out how to implement Solr in my project. I have gone through some 
tutorials online and I was able to import and query text in some Arabic PDF 
documents.

We have some scans of Historical Handwritten Arabic documents that will have 
text extracted into a database (or PDF). We would like the user to be able to 
search the document for text, then have the scanned image show up in a viewer 
with the text highlighted. I would like to use Solr to index the text in the 
documents, but I'm unsure how to store and get the "word location" in Solr  
(area of text that needs to be highlighted).

Do I index and store the full document in the Solr? How do l link the "search 
term" to the "word location" on the page?
The only way I can figure out how to do this involves querying the database for 
the "word" and "location" after querying Solr for the search term, but is that 
defeating the purpose of using Solr?

I would really appreciate help figuring this out.

Thank you,
Fatima



RE: How to use Solr in my project

2013-12-26 Thread Fatima Issawi
Hi,

I should clarify. We have another application extracting the text from the 
document. The full text from each document will be stored in a database either 
at the document level or page level (this hasn't been decided yet). We will 
also be storing word location of each word on the page in the database. 

What I'm having problems with is deciding on the schema. We want a user to be 
able to search for a word in the database, have a list of documents that word 
is located in, and location in the document that word is located it. When he 
selects the search results, we want the scanned picture to have that word 
highlighted on the page. 

I want to index the document using Solr, but I'm having trouble figuring out 
how to design the schema to return that "word location" of a search term on the 
scanned picture in order to highlight it.

Does this make more sense?

Fatima

-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com] 
Sent: Thursday, December 26, 2013 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Solr in my project

On 26 December 2013 10:54, Fatima Issawi  wrote:
> Hello,
>
> First off, I apologize if this was sent twice. I was having issues 
> subscribing to the list.
>
> I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me 
> figure out how to implement Solr in my project. I have gone through some 
> tutorials online and I was able to import and query text in some Arabic PDF 
> documents.
>
> We have some scans of Historical Handwritten Arabic documents that will have 
> text extracted into a database (or PDF). We would like the user to be able to 
> search the document for text, then have the scanned image show up in a viewer 
> with the text highlighted.

This will not work for scanned images which do not actually contain the text. 
If you have the text of the documents, the best that you can do is break the 
text into pages corresponding to the scanned images, and index into Solr the 
text from the pages and the scanned image that should be linked to the text. 
For a user search, you will need to show the scanned image for the entire page: 
Highlighting of the search term in an image is not possible without optical 
character recognition (OCR).

Similarly, if you are indexing from PDFs, you will need to ensure that they 
contain text, and not just images.

Regards,
Gora


RE: How to use Solr in my project

2013-12-28 Thread Fatima Issawi
> What do you mean by "word location"? The number on the page? What
> purpose would this serve?

I mean the (x, y) coordinates of the word on the page. We want to be able to 
highlight the image of the word that was extracted from the text.

> I think that you might be confusing things:
> * If you have the full-text, you can highlight where the word was found. Solr
>   highlighting handles this for you, and there is no need to store word 
> location
> * You can have different images (presumably, individual scanned pages)
> linked
>to different sections of text, and show the entire image.
> Highlighting in the image
>is not possible, unless by "word location" you mean the (x, y) coordinates 
> of
>the word on the page. Even then:
>- It will be prohibitively expensive to store the location of every word in
> every
>  image for a large number of documents
>- Some image processing will be required to handle the highlighting after
> the
>  scanned image is retrieved

We will have the full text stored, but we want to highlight the text in the 
original image. I expect to process the image after retrieval. We do plan on 
storing the (x, y) coordinates of the words in a database - I suspected that it 
would be too expensive to store them in Solr. I guess I'm still confused about 
how to use Solr to index the document, but then retrieve the (x, y) coordinates 
of the search term from the database. Is this possible? If it can, can you give 
an example how this can be done?

Thank you!


RE: How to use Solr in my project

2013-12-28 Thread Fatima Issawi
Hello,

Our pages are images of handwritten text in Arabic so OCR'ing is not possible. 
We will be extracting the text during pre-processing and storing the words and 
(x, y) coordinates in a database. Would your process apply to our images?

> Step 1:
> For sending the extracted text content from text pdf to solr, use a low level
> pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get
> the coordinates and page no. of each word. Store it in a seperate file as word
> map. This word map will contain page+coordinates mapping to occurence
> number for word.

Can we generate a word map manually? Is this used by Solr and requires a 
specific format?

> Step 2:
> Solr highlighter needs to be changed to get the word and their occurence
> number in the text document, rather than the character offsets for each hit.

How is this done? I read the solr highlighting wiki, but don't see how this can 
be done.

> Step 3:
> Combine the solr output to the word map created in step 1 and the pdf page
> and coordinates can be generated for original pdf docuemnt which can be
> highlighted by any viewer.

Can I get more information about how to do this?

Thanks!


RE: How to use Solr in my project

2013-12-29 Thread Fatima Issawi
Hi again,

We have another program that will be extracting the text, and it will be 
extracting the top right and bottom left corners of the words. You are right, I 
do expect to have a lot of data.

When would solr start experiencing issues in performance? Is it better to:

INDEX: 
- document metadata 
- words  

STORE: 
- document metadata
- words 
- coordinates 

in Solr rather than in the database? How would I set up the schema in order to 
store the coordinates?

If storing the coordinates in solr is not recommended, what would be the best 
process to get the coordinates after indexing the words and metadata? Do I 
search in solr and then use the documentID to then search the database for the 
words and coordinates?

Thanks for your patience. I don't have much choice in the use case. 


> -Original Message-
> From: Gora Mohanty [mailto:g...@mimirtech.com]
> Sent: Sunday, December 29, 2013 2:48 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Solr in my project
> 
> On 29 December 2013 11:10, Fatima Issawi  wrote:
> [...]
> > We will have the full text stored, but we want to highlight the text in the
> original image. I expect to process the image after retrieval. We do plan on
> storing the (x, y) coordinates of the words in a database - I suspected that 
> it
> would be too expensive to store them in Solr. I guess I'm still confused about
> how to use Solr to index the document, but then retrieve the (x, y)
> coordinates of the search term from the database. Is this possible? If it can,
> can you give an example how this can be done?
> 
> Storing, and retrieving the coordinates from Solr will likely be faster than
> from the database. However, I still think that you should think more carefully
> about your use case of highlighting the images. It can be done, but is a
> significant amount of work, and will need storage, and computational
> resources.
> 1. For highlighting in the image, you will need to store two sets
> of coordinates (e.g., top right and bottom left corners) as you
> not know the length of the word in the image. Thus, say with
> 15 words per line, 50 lines per page, 100 pages per document,
> you will need to store:
>   4 x 15 x 50 x 100 = 3,00,000 coordinates/document 2. Also, how are you
> going to get the coordinates in the first
> place?
> 
> Regards,
> Gora


RE: How to use Solr in my project

2013-12-30 Thread Fatima Issawi
I think we may have up to 100,000 books, but I don't think the site will have a 
lot of traffic.

Thank you for your help. I think it is a little more clear and will try to 
implement it now.

> -Original Message-
> From: Gora Mohanty [mailto:g...@mimirtech.com]
> Sent: Monday, December 30, 2013 11:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Solr in my project
> 
> On 30 December 2013 11:27, Fatima Issawi  wrote:
> > Hi again,
> >
> > We have another program that will be extracting the text, and it will be
> extracting the top right and bottom left corners of the words. You are right, 
> I
> do expect to have a lot of data.
> >
> > When would solr start experiencing issues in performance? Is it better to:
> >
> > INDEX:
> > - document metadata
> > - words
> >
> > STORE:
> > - document metadata
> > - words
> > - coordinates
> >
> > in Solr rather than in the database? How would I set up the schema in order
> to store the coordinates?
> 
> You do not mention the number of documents, but for a few tens of
> thousands of documents, your problem should be tractable in Solr. Not sure
> what document metadata you have, and if you need to search through it, but
> what I would do is index the words, and store the coordinates in Solr, the
> assumption being that words are searched but not retrieved from Solr, while
> coordinates are retrieved but never searched.
> 
> Off the top of my head, each record can be:
>   
>
> ...
>   ...
> ...
>  ...
> 
> *  and  from Solr search results let you retrieve the image
>   from the filesystem
> * The coordinates allow post-processing to highlight the word in the image
> 
> As always, set up a prototype system with a subset of the records in order to
> measure performance.
> 
> > If storing the coordinates in solr is not recommended, what would be the
> best process to get the coordinates after indexing the words and metadata?
> Do I search in solr and then use the documentID to then search the database
> for the words and coordinates?
> 
> You could do that, but Solr by itself should be fine.
> 
> Regards,
> Gora


Highlighting not working

2014-01-22 Thread Fatima Issawi
Hello,

I'm trying to highlight content that is returned from a Solr query, but I can't 
seem to get it working.

I would like to highlight the "documentname" and the "pagetext" or "content" 
results, but when I run the search I don't get anything returned. I thought 
that the "content" field is supposed to be used for hightlighting? And that 
[termVectors="true" termPositions="true" termOffsets="true"] needs to be added 
to the fields that need to be highlighted? Is there something else I'm missing?


Here is my schema:

   
   
   
   
  
   
   
   />
   

   

   

   
   
   
   
   
   
   


Thanks,
Fatima


RE: Highlighting not working

2014-01-22 Thread Fatima Issawi
Also my highlighting defaults...

  
 

   
   on
   content documentname
   html
   <b>
   </b>
   0
   documentname
   3
   200
   content
   750

> -Original Message-
> From: Fatima Issawi [mailto:issa...@qu.edu.qa]
> Sent: Wednesday, January 22, 2014 11:34 AM
> To: solr-user@lucene.apache.org
> Subject: Highlighting not working
> 
> Hello,
> 
> I'm trying to highlight content that is returned from a Solr query, but I 
> can't
> seem to get it working.
> 
> I would like to highlight the "documentname" and the "pagetext" or
> "content" results, but when I run the search I don't get anything returned. I
> thought that the "content" field is supposed to be used for hightlighting?
> And that [termVectors="true" termPositions="true" termOffsets="true"]
> needs to be added to the fields that need to be highlighted? Is there
> something else I'm missing?
> 
> 
> Here is my schema:
> 
> required="true" multiValued="false" />
> omitNorms="true"/>
> stored="true" termVectors="true"  termPositions="true"
> termOffsets="true"/>
>
>   
>
>
>/>
> termVectors="true" termPositions="true" termOffsets="true"/>
> 
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> 
> multiValued="true"/>
> 
>
>
>
>
>
>
>
> 
> 
> Thanks,
> Fatima


RE: Highlighting not working

2014-01-22 Thread Fatima Issawi
Hi,

I have stored=true for my "content" field, but I get an error saying there is a 
mismatch of settings on that field (I think) because of the "term*=true"  
settings.

Thanks again,
Fatima



> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, January 22, 2014 5:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlighting not working
> 
> Hi Fatima,
> 
> To enable higlighting (both standard and fastvector) you need to make
> stored="true".
> 
> Term vectors may speed up standard highlighter. Plus they are mandatory
> for FastVectorHighligher.
> 
> https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+
> Case
> 
> Ahmet
> 
> 
> 
> 
> 
> On Wednesday, January 22, 2014 10:44 AM, Fatima Issawi
>  wrote:
> Also my highlighting defaults...
> 
>   
>      
> 
>        
>        on
>        content documentname
>        html
>        <b>
>        </b>
>        0
>        documentname
>        3
>        200
>        content
>        750
> 
> 
> > -Original Message-
> > From: Fatima Issawi [mailto:issa...@qu.edu.qa]
> > Sent: Wednesday, January 22, 2014 11:34 AM
> > To: solr-user@lucene.apache.org
> > Subject: Highlighting not working
> >
> > Hello,
> >
> > I'm trying to highlight content that is returned from a Solr query,
> > but I can't seem to get it working.
> >
> > I would like to highlight the "documentname" and the "pagetext" or
> > "content" results, but when I run the search I don't get anything
> > returned. I thought that the "content" field is supposed to be used for
> hightlighting?
> > And that [termVectors="true" termPositions="true" termOffsets="true"]
> > needs to be added to the fields that need to be highlighted? Is there
> > something else I'm missing?
> >
> >
> > Here is my schema:
> >
> >     > required="true" multiValued="false" />
> >     > omitNorms="true"/>
> >     > stored="true" termVectors="true"  termPositions="true"
> > termOffsets="true"/>
> >    
> >   
> >     >stored="true"/>
> >    
> >     >stored="true"/>/>
> >     > termVectors="true" termPositions="true" termOffsets="true"/>
> >
> >     > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true"/>
> >
> >     > multiValued="true"/>
> >
> >    
> >    
> >    
> >    
> >    
> >    
> >    
> >
> >
> > Thanks,
> > Fatima


Highlight results in Arabic are backword

2014-02-06 Thread Fatima Issawi
Hello,

I am getting highlight results in Arabic, but the order of the words are 
backwards. Querying on that field gives me the correct result, though. Is there 
are setting I’m missing?

An extract from an example query from my Solr Console is below:

{
  "responseHeader": {
"status": 0,
"QTime": 1,
"params": {
  "indent": "true",
  "q": "author:\"فيشر\"",
  "_": "1391692704242",
  "hl.simple.pre": "",
  "hl.simple.post": "",
  "hl.fl": "author",
  "wt": "json",
  "hl": "true"
}
  },
  "response": {
"numFound": 4,
"start": 0,
"docs": [
  {
"pagenumber": 1,
"id": "1",
"author": "د. فيشر السعر",
"author_s": "د. فيشر السعر",
"collector": "فاطمة عيساوي",
  },
  "highlighting": {
"1": {
  "author": [
"د. فيشر السعر"
  ]


RE: Highlight results in Arabic are backword

2014-02-08 Thread Fatima Issawi
Thank you both for responding. 

Is there a way to specify to Solr  to add those attributes on the field when it 
returns results (e.g. Language is Arabic, English. Or direction is LTR or 
RTL.)?  

Right now I only have Arabic content indexed, but we plan to add English in the 
near future. I don't want to have to re-do everything later if there is a 
better way of designing this now.

Regards,
Fatima

> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, February 07, 2014 3:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlight results in Arabic are backword
> 
> Arabic if complex. Basically, don't trust anything you see until you put that
> content on the screen with the surrounding tag marked with attribute
> dir='rtl' (e.g. arabic test).
> 
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at once.
> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe  wrote:
> > Hi Fatima,
> >
> > I don’t think there’s an actual problem, it just looks like it because the
> program you’re using to look at the JSON makes a different choice for laying
> out the highlighting results than it does for the field values.
> >
> > In fact, all the bytes are the same, and in the same order for both the
> “author” field text and the highlighting text, though some space characters
> are ASCII space (U+0020) in one and non-breaking space (U+00A0) in the
> other.
> >
> > By the way, I see the same thing as you in my email client (OS X Mail.app). 
> >  I
> assume there is a rule shared by our programs about complex layout like this,
> where right-to-left text is mixed with left-to-right text, likely based on the
> proportion of each, that triggers a left-to-right word sequencing instead of
> the expected right-to-left word sequencing.
> >
> > Anyway, I pulled out the author field and highlighting texts into an HTML
> document and viewed it in my browser (Safari), and both are layed out the
> same (with the exception of the emphasis given the highlighted word):
> >
> > ——
> > 
> > 
> > "author": "د. فيشر السعر",
> > "highlighting": { "1": { "author": [ "د. فيشر السعر" ] }
> > }   ——
> >
> > Steve
> >
> > On Feb 6, 2014, at 8:23 AM, Fatima Issawi  wrote:
> >
> >> Hello,
> >>
> >> I am getting highlight results in Arabic, but the order of the words are
> backwards. Querying on that field gives me the correct result, though. Is
> there are setting I’m missing?
> >>
> >> An extract from an example query from my Solr Console is below:
> >>
> >> {
> >>  "responseHeader": {
> >>"status": 0,
> >>"QTime": 1,
> >>"params": {
> >>  "indent": "true",
> >>  "q": "author:\"فيشر\"",
> >>  "_": "1391692704242",
> >>  "hl.simple.pre": "",
> >>  "hl.simple.post": "",
> >>  "hl.fl": "author",
> >>  "wt": "json",
> >>  "hl": "true"
> >>}
> >>  },
> >>  "response": {
> >>"numFound": 4,
> >>"start": 0,
> >>"docs": [
> >>  {
> >>"pagenumber": 1,
> >>"id": "1",
> >>"author": "د. فيشر السعر",
> >>"author_s": "د. فيشر السعر",
> >>"collector": "فاطمة عيساوي",
> >>  },
> >>  "highlighting": {
> >>"1": {
> >>  "author": [
> >>"د. فيشر السعر"
> >>  ]
> >


RE: Highlight results in Arabic are backword

2014-02-08 Thread Fatima Issawi
Thank you. I will look into that.

> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Sunday, February 09, 2014 9:35 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlight results in Arabic are backword
> 
> You will most probably put your English and Arabic content into different
> fields. Mostly because you will want to apply different field type definitions
> to your English and Arabic text (tokenizers, etc).
> 
> Also, I would search around the web for articles on multilingual approach to
> Solr, if you are doing some deliberate design now. There are some deeper
> issues. Some good questions are covered here:
> http://info.basistech.com/blog/bid/171842/Indexing-Strategies-for-
> Multilingual-Search-with-Solr-and-Rosette
> (even if it is talking about the commercial tool). There is also a series of 
> 12
> blog posts on dealing with Solr for CJK in the libraries.
> Your issues will be different, but there will be overlap.
> 
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at once.
> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Sun, Feb 9, 2014 at 12:56 PM, Fatima Issawi  wrote:
> > Thank you both for responding.
> >
> > Is there a way to specify to Solr  to add those attributes on the field 
> > when it
> returns results (e.g. Language is Arabic, English. Or direction is LTR or 
> RTL.)?
> >
> > Right now I only have Arabic content indexed, but we plan to add English in
> the near future. I don't want to have to re-do everything later if there is a
> better way of designing this now.
> >
> > Regards,
> > Fatima
> >
> >> -Original Message-
> >> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> >> Sent: Friday, February 07, 2014 3:48 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Highlight results in Arabic are backword
> >>
> >> Arabic if complex. Basically, don't trust anything you see until you
> >> put that content on the screen with the surrounding tag marked with
> >> attribute dir='rtl' (e.g. arabic test).
> >>
> >> Regards,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all at
> once.
> >> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> >> book)
> >>
> >>
> >> On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe 
> wrote:
> >> > Hi Fatima,
> >> >
> >> > I don’t think there’s an actual problem, it just looks like it
> >> > because the
> >> program you’re using to look at the JSON makes a different choice for
> >> laying out the highlighting results than it does for the field values.
> >> >
> >> > In fact, all the bytes are the same, and in the same order for both
> >> > the
> >> “author” field text and the highlighting text, though some space
> >> characters are ASCII space (U+0020) in one and non-breaking space
> >> (U+00A0) in the other.
> >> >
> >> > By the way, I see the same thing as you in my email client (OS X
> >> > Mail.app).  I
> >> assume there is a rule shared by our programs about complex layout
> >> like this, where right-to-left text is mixed with left-to-right text,
> >> likely based on the proportion of each, that triggers a left-to-right
> >> word sequencing instead of the expected right-to-left word sequencing.
> >> >
> >> > Anyway, I pulled out the author field and highlighting texts into
> >> > an HTML
> >> document and viewed it in my browser (Safari), and both are layed out
> >> the same (with the exception of the emphasis given the highlighted
> word):
> >> >
> >> > ——
> >> > 
> >> > 
> >> > "author": "د. فيشر السعر",
> >> > "highlighting": { "1": { "author": [ "د. فيشر السعر" ]
> >> > } }   ——
> >> >
> >> > Steve
> >> >
> >> > On Feb 6, 2014, at 8:23 AM, Fatima Issawi  wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I am getting highlight results in Arabic, but the order of the
>