Re: Does apache solr stores the file?

2017-12-07 Thread Charlie Hull

On 06/12/2017 10:10, Gora Mohanty wrote:

On 6 December 2017 at 10:39, Munish Kumar Arora
 wrote:


So the questions are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
 b. If it does not store then, is there a way to store THE FILE?


Normal practice would be to store the PDF file somewhere on the file
system where it can be served through a HTTP request. Then, store the
filesystem path to the PDF file in Solr so that it can be returned in
a Solr search request.

Regards,
Gora

Yes you *can* store the entire contents of an indexed file in Solr. No, 
you really, really shouldn't. Always make sure you can regenerate your 
index from the original sources if you need to - a search engine is not 
a database.


I'll just write that again: a search engine is not a database.

The method described above is the usual way to deal with this situation.

Best

Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Time-Series data indexing into Solr

2017-12-07 Thread Greenhorn Techie
Hi,

Is there any recommended approach to index and search time-series data in
Solr?

Thanks in Advance.


Re: Solr DR Replication

2017-12-07 Thread Greenhorn Techie
Any thoughts / help on this please.

Thanks in advance.

On Wed, 6 Dec 2017 at 16:21 Greenhorn Techie 
wrote:

> Hi,
>
> We are on Solr 5.5.2 and wondering what is the best mechanism for
> replicating Solr indexes from a Disaster Recovery perspective. As I
> understand only from Solr6 onwards, we have CDCR. However, I couldn't find
> much content around index replication management for older versions.
> Wondering if there is any such documented solution is available.
>
> From replication perspective, apart from SolrColud collection data, what
> other information need to be copied over from source cluster to the target
> cluster? Should we copy the ZK data as well for the collection?
>
> TIA
>


Re: Issue while searching with escape characters

2017-12-07 Thread Emir Arnautović
Hi Roopesh,
If escaping special char with \ does not result in error but in no results, 
then it might be worth checking if your indexing is ok - does it strip 
parenthesis.

Can you share example query and schema snippet where you define your field and 
fieldType.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Dec 2017, at 16:39, Roopesh Uniyal  wrote:
> 
> Oh, that might be because I made DVeto1 in bold but it converted bold into
> *. So you can ignore both *.
> 
> On Wed, Dec 6, 2017 at 10:35 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Roopesh,
>> What are *? Is it wildcard or special char as well? Examples that you
>> provided are  not what you said you want to search - * are not on the same
>> position. If you are not finding anything, that can be due to your analysis
>> - are you sure that your analysis does not trim parenthesis?
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 6 Dec 2017, at 16:13, Roopesh Uniyal 
>> wrote:
>>> 
>>> Thanks Emir & Jan!
>>> 
>>> I have a situation where I need to search a field value is
>>> between parenthesis () like - *(DVeto1)*
>>> 
>>> Based on the documentation
>>> > apache/lucene/queryparser/classic/package-summary.html#
>> Escaping_Special_Characters>
>>> parenthesis just need escape character but no matter what way I provide
>> it,
>>> its not providing resultset.
>>> 
>>> %28*DVeto1*%29
>>> %5C%28*DVeto1*%5C%29
>>> %22%28%22*DVeto1*%22%29%22
>>> %22%5C%28%22*DVeto1*%22%5C%29%22
>>> 
>>> 
>>> Thanks,
>>> Roopesh
>>> 
>>> On Wed, Dec 6, 2017 at 7:44 AM, Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
 Hi,
 You need to escape special chars with \ and if you are sending it in URL
 you can URL encode it, but that is URL related thing not Solr.
 
 Here is the list of Lucene characters that need to be escaped:
 http://lucene.apache.org/core/7_1_0/queryparser/org/apache/
 lucene/queryparser/classic/package-summary.html#Escaping_
 Special_Characters > core/7_1_0/queryparser/org/
 apache/lucene/queryparser/classic/package-summary.html#
 Escaping_Special_Characters>
 
 HTH,
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 
 
 
> On 6 Dec 2017, at 10:33, Roopesh Uniyal 
 wrote:
> 
> Thanks Jan. It must be a late night. Not sure what I was thinking.
> 
> I provided *%5C%28DVeto1%5C%29* but still not able to get the search
 results
> 
> I also have a situation where I have to search something like
 *(ID#DVeto2)*
> and I am providing *%5C%28ID%23DVeto2%5C%2*9 and still not able to get
 the
> resultsets.
> 
> Its not throwing any error but no result found in these two scenarios.
> Although we know there should be some records.
> 
> Am I missing anything?
> 
> Thanks!
> 
> 
> On Wed, Dec 6, 2017 at 4:13 AM, <
> jan.christopher.schluchtmann-...@continental-corporation.com> wrote:
> 
>> hmm ... it seems, you are using XML/HTML-encoding, but you need
>> HTTP-encoding, which looks like this:
>> 
>> 
>> ␣  !   "   #   $   %   &   '   (
>> )
>> *   +   ,   -   .   /   :   ;   <
 =
>>>?   @   [   \   ]   {   |   }
>> 
>> %20 %21 %22 %23 %24 %25 %26 %27 %28
>> %29
>> %2A %2B %2C %2D %2E %2F %3A %3B %3C
>> %3D
>> %3E %3F %40 %5B %5C %5D %7B %7C %7D
>> 
>> 
>> good luck! :)
>> 
>> 
>> Mit freundlichen Grüßen/ With kind regards
>> 
>> Jan Schluchtmann
>> Systems Engineering Cluster Instruments
>> VW Group
>> Continental Automotive GmbH
>> Division Interior
>> ID S3 RM
>> VDO-Strasse 1, 64832 Babenhausen, Germany
>> 
>> Telefon/Phone: +49 6073 12-4346
>> Telefax: +49 6073 12-79-4346
>> 
>> 
>> 
>> Von:Roopesh Uniyal 
>> An: solr-user@lucene.apache.org,
>> Datum:  06.12.2017 09:57
>> Betreff:Issue while searching with escape characters
>> 
>> 
>> 
>> Hello I am searching Solr 6 via http call by providing a "UserID".
>> 
>> Its just that the data is in the format of (DVeto1)
>> 
>> So, in my call I have to provide parenthesis but since its a special
>> character I need to provide escape also. Looks like its not working
>> 
>> Provided the search string over http like t

TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
Hi, 

I'd love to have this supported, but SOLR-8311 states there are issues, and i 
lack the understanding of the mentioned issues. So, can i add it?

Many thanks,
Markus



Re: Issue while searching with escape characters

2017-12-07 Thread Roopesh Uniyal
Thanks Emir.
Got it fixed. End customer's solr was not having the records itself. They
were trying to compare apples with oranges.

On Thu, Dec 7, 2017 at 7:43 AM Emir Arnautović 
wrote:

> Hi Roopesh,
> If escaping special char with \ does not result in error but in no
> results, then it might be worth checking if your indexing is ok - does it
> strip parenthesis.
>
> Can you share example query and schema snippet where you define your field
> and fieldType.
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 6 Dec 2017, at 16:39, Roopesh Uniyal 
> wrote:
> >
> > Oh, that might be because I made DVeto1 in bold but it converted bold
> into
> > *. So you can ignore both *.
> >
> > On Wed, Dec 6, 2017 at 10:35 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Roopesh,
> >> What are *? Is it wildcard or special char as well? Examples that you
> >> provided are  not what you said you want to search - * are not on the
> same
> >> position. If you are not finding anything, that can be due to your
> analysis
> >> - are you sure that your analysis does not trim parenthesis?
> >>
> >> Regards,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 6 Dec 2017, at 16:13, Roopesh Uniyal 
> >> wrote:
> >>>
> >>> Thanks Emir & Jan!
> >>>
> >>> I have a situation where I need to search a field value is
> >>> between parenthesis () like - *(DVeto1)*
> >>>
> >>> Based on the documentation
> >>>  >> apache/lucene/queryparser/classic/package-summary.html#
> >> Escaping_Special_Characters>
> >>> parenthesis just need escape character but no matter what way I provide
> >> it,
> >>> its not providing resultset.
> >>>
> >>> %28*DVeto1*%29
> >>> %5C%28*DVeto1*%5C%29
> >>> %22%28%22*DVeto1*%22%29%22
> >>> %22%5C%28%22*DVeto1*%22%5C%29%22
> >>>
> >>>
> >>> Thanks,
> >>> Roopesh
> >>>
> >>> On Wed, Dec 6, 2017 at 7:44 AM, Emir Arnautović <
> >>> emir.arnauto...@sematext.com> wrote:
> >>>
>  Hi,
>  You need to escape special chars with \ and if you are sending it in
> URL
>  you can URL encode it, but that is URL related thing not Solr.
> 
>  Here is the list of Lucene characters that need to be escaped:
>  http://lucene.apache.org/core/7_1_0/queryparser/org/apache/
>  lucene/queryparser/classic/package-summary.html#Escaping_
>  Special_Characters  >> core/7_1_0/queryparser/org/
>  apache/lucene/queryparser/classic/package-summary.html#
>  Escaping_Special_Characters>
> 
>  HTH,
>  Emir
>  --
>  Monitoring - Log Management - Alerting - Anomaly Detection
>  Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> 
> 
> 
> > On 6 Dec 2017, at 10:33, Roopesh Uniyal 
>  wrote:
> >
> > Thanks Jan. It must be a late night. Not sure what I was thinking.
> >
> > I provided *%5C%28DVeto1%5C%29* but still not able to get the search
>  results
> >
> > I also have a situation where I have to search something like
>  *(ID#DVeto2)*
> > and I am providing *%5C%28ID%23DVeto2%5C%2*9 and still not able to
> get
>  the
> > resultsets.
> >
> > Its not throwing any error but no result found in these two
> scenarios.
> > Although we know there should be some records.
> >
> > Am I missing anything?
> >
> > Thanks!
> >
> >
> > On Wed, Dec 6, 2017 at 4:13 AM, <
> > jan.christopher.schluchtmann-...@continental-corporation.com> wrote:
> >
> >> hmm ... it seems, you are using XML/HTML-encoding, but you need
> >> HTTP-encoding, which looks like this:
> >>
> >>
> >> ␣  !   "   #   $   %   &   '   (
> >> )
> >> *   +   ,   -   .   /   :   ;   <
>  =
> >>>?   @   [   \   ]   {   |   }
> >>
> >> %20 %21 %22 %23 %24 %25 %26 %27 %28
> >> %29
> >> %2A %2B %2C %2D %2E %2F %3A %3B %3C
> >> %3D
> >> %3E %3F %40 %5B %5C %5D %7B %7C %7D
> >>
> >>
> >> good luck! :)
> >>
> >>
> >> Mit freundlichen Grüßen/ With kind regards
> >>
> >> Jan Schluchtmann
> >> Systems Engineering Cluster Instruments
> >> VW Group
> >> Continental Automotive GmbH
> >> Division Interior
> >> ID S3 RM
> >> VDO-Strasse 1, 64832 Babenhausen, Germany
> >>
> >> Telefon/Phone: +49 6073 12-4346
> >> Telefax: +49 6073 12-79-4346
> >>
> >>
> >>
> >> Von:Roopesh Uniyal 
> >> An: solr-user@lucene.apache.org,
> >> Datum:  06.12.2017 09:

Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Gilcan Machado
Hi.

I have a Solr 4 in production (+ Drupal).

And I want to migrate Solr to versoin 7 (at the end).

But I guess  it's more safe to migrate from 4 to 5 first.

Anyway, I'm searching a lot and I couldn't find a documentation that shows
how to pick a Solr 4 (in full production) and upgrade to a Solr 5.

[]s
Gil


RE: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
Created SOLR-11735 for tracking.
https://issues.apache.org/jira/browse/SOLR-11735
 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 7th December 2017 14:49
> To: Solr-user 
> Subject: TransformerFactory does not support SolrCoreAware
> 
> Hi, 
> 
> I'd love to have this supported, but SOLR-8311 states there are issues, and i 
> lack the understanding of the mentioned issues. So, can i add it?
> 
> Many thanks,
> Markus
> 
> 


RE: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Markus Jelsma
https://lucene.apache.org/solr/5_0_0/changes/Changes.html

 
 
-Original message-
> From:Gilcan Machado 
> Sent: Thursday 7th December 2017 14:55
> To: solr-user@lucene.apache.org
> Subject: Where can I find documentation to migrate Solr 4 to 5?
> 
> Hi.
> 
> I have a Solr 4 in production (+ Drupal).
> 
> And I want to migrate Solr to versoin 7 (at the end).
> 
> But I guess  it's more safe to migrate from 4 to 5 first.
> 
> Anyway, I'm searching a lot and I couldn't find a documentation that shows
> how to pick a Solr 4 (in full production) and upgrade to a Solr 5.
> 
> []s
> Gil
> 


Re: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Gilcan Machado
Jesus... Thank you very much!!!

[]s
Gil

2017-12-07 11:58 GMT-02:00 Markus Jelsma :

> https://lucene.apache.org/solr/5_0_0/changes/Changes.html
>
>
>
> -Original message-
> > From:Gilcan Machado 
> > Sent: Thursday 7th December 2017 14:55
> > To: solr-user@lucene.apache.org
> > Subject: Where can I find documentation to migrate Solr 4 to 5?
> >
> > Hi.
> >
> > I have a Solr 4 in production (+ Drupal).
> >
> > And I want to migrate Solr to versoin 7 (at the end).
> >
> > But I guess  it's more safe to migrate from 4 to 5 first.
> >
> > Anyway, I'm searching a lot and I couldn't find a documentation that
> shows
> > how to pick a Solr 4 (in full production) and upgrade to a Solr 5.
> >
> > []s
> > Gil
> >
>


RE: Time-Series data indexing into Solr

2017-12-07 Thread Markus Jelsma
One of our collections is time-series data, processing hundreds of queries per 
second. But apart from having a time field, set it indexed and docValues 
enabled, i wouldn't know about any specific recommendations.

 
-Original message-
> From:Greenhorn Techie 
> Sent: Thursday 7th December 2017 12:42
> To: solr-user@lucene.apache.org
> Subject: Time-Series data indexing into Solr
> 
> Hi,
> 
> Is there any recommended approach to index and search time-series data in
> Solr?
> 
> Thanks in Advance.
> 


Re: No Live SolrServer available to handle this request

2017-12-07 Thread Steve Rowe
Hi Selvam,

This sounds like it may be a bug - could you please create a JIRA?  (See 

 for more info.)

Thanks,

--
Steve
www.lucidworks.com

> On Dec 6, 2017, at 9:56 PM, Selvam Raman  wrote:
> 
> Yes. you are right. we are using preanalyzed field and that causing the
> problem.
> The actual problem is preanalyzed with highlight option. if i disable
> highlight option it works fine. Please let me know if there is work around
> to solve it.
> 
> On Wed, Dec 6, 2017 at 10:19 PM, Erick Erickson 
> wrote:
> 
>> This looks like you're using "pre analyzed fields" which have a very
>> specific format. PreAnalyzedFields are actually pretty rarely used,
>> did you enable them by mistake?
>> 
>> On Tue, Dec 5, 2017 at 11:37 PM, Selvam Raman  wrote:
>>> When i look at the solr logs i find the below exception
>>> 
>>> Caused by: java.io.IOException: Invalid JSON type java.lang.String,
>>> expected Map
>>> at
>>> org.apache.solr.schema.JsonPreAnalyzedParser.parse(
>> JsonPreAnalyzedParser.java:86)
>>> at
>>> org.apache.solr.schema.PreAnalyzedField$PreAnalyzedTokenizer.
>> decodeInput(PreAnalyzedField.java:345)
>>> at
>>> org.apache.solr.schema.PreAnalyzedField$PreAnalyzedTokenizer.access$
>> 000(PreAnalyzedField.java:280)
>>> at
>>> org.apache.solr.schema.PreAnalyzedField$PreAnalyzedAnalyzer$1.
>> setReader(PreAnalyzedField.java:375)
>>> at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:202)
>>> at
>>> org.apache.lucene.search.uhighlight.AnalysisOffsetStrategy.tokenStream(
>> AnalysisOffsetStrategy.java:58)
>>> at
>>> org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.
>> getOffsetsEnums(MemoryIndexOffsetStrategy.java:106)
>>> ... 37 more
>>> 
>>> 
>>> 
>>> I am setting up lot of fields (fq, score, highlight,etc) then put it
>> into
>>> solrquery.
>>> 
>>> On Wed, Dec 6, 2017 at 11:22 AM, Selvam Raman  wrote:
>>> 
 When i am firing query it returns the doc as expected. (Example:
 q=synthesis)
 
 I am facing the problem when i include wildcard character in the query.
 (Example: q=synthesi*)
 
 
 org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
 Error from server at http://localhost:8983/solr/Metadata2:
 org.apache.solr.client.solrj.SolrServerException:
 
 No live SolrServers available to handle this request:[/solr/Metadata2_
 shard1_replica1,
  solr/Metadata2_shard2_replica2,
  solr/Metadata2_shard1_replica2]
 
 --
 Selvam Raman
 "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
 
>>> 
>>> 
>>> 
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>> 
> 
> 
> 
> -- 
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"



Re: indexing XML stored on HDFS

2017-12-07 Thread Matthew Roth
Yes the post tool would also be an acceptable option and one I am familiar
with. However, I also am not seeing exactly how I would query hdfs. The
hadoop-solr [0
] tool by
lucidworks looks the most promising. I have a meeting to attend to shortly,
and maybe I can explore that further in the afternoon.

I also would like to look further into solrj. I have no real reason to
store the results of the XSL transformation anywhere other than solr. I am
simply not familiar with it. But on the surface it seems like it might be
the most performant way to handle this problem.

If I do pursue this with solrj and spark will solr handle multiple solrj
connections all trying to add documents?

[0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers

On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson 
wrote:

> Perhaps the bin/post tool? See:
> https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
> On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth  wrote:
> > Hi All,
> >
> > Is there a DIH for HDFS? I see this old feature request [0
> > ] that never seems to
> have
> > gone anywhere. Google searches and searches on this list don't get me to
> > far.
> >
> > Essentially my workflow is that I have many thousands of XML documents
> > stored in hdfs. I run an xslt transformation in spark [1
> > ]. This transforms
> to
> > the expected solr input of . This is
> > than written the back to hdfs. Now how do I get it back to solr? I
> suppose
> > I could move the data back to the local fs, but on the surface that feels
> > like the wrong way.
> >
> > I don't need to store the documents in HDFS after the spark
> transformation,
> > I wonder if I can write them using solrj. However, I am not really
> familiar
> > with solrj. I am also running a single node. Most of the material I have
> > read on spark-solr expects you to be running SolrCloud.
> >
> > Best,
> > Matt
> >
> >
> >
> > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > [1] https://github.com/elsevierlabs-os/spark-xml-utils
>


Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew,
Do you have some sort of script calling xslt? Sorry, I do not know Scala and I 
did not have time to look into your spark utils.  The script or Scala could 
then shell out to curl, or if it is python it could use the request library to 
send a doc to Solr. Extra points for batching the documents. 

Erick
The last time I used the post tool, it was spinning up a jvm each time I called 
it (natch). Is there a simple way to launch it from a Java app server so you 
can call it repeatedly without the start-up overhead? It has been a few years, 
maybe I am wrong.
Cheers -- Rick

On December 6, 2017 5:36:51 PM EST, Erick Erickson  
wrote:
>Perhaps the bin/post tool? See:
>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth 
>wrote:
>> Hi All,
>>
>> Is there a DIH for HDFS? I see this old feature request [0
>> ] that never seems
>to have
>> gone anywhere. Google searches and searches on this list don't get me
>to
>> far.
>>
>> Essentially my workflow is that I have many thousands of XML
>documents
>> stored in hdfs. I run an xslt transformation in spark [1
>> ]. This
>transforms to
>> the expected solr input of . This
>is
>> than written the back to hdfs. Now how do I get it back to solr? I
>suppose
>> I could move the data back to the local fs, but on the surface that
>feels
>> like the wrong way.
>>
>> I don't need to store the documents in HDFS after the spark
>transformation,
>> I wonder if I can write them using solrj. However, I am not really
>familiar
>> with solrj. I am also running a single node. Most of the material I
>have
>> read on spark-solr expects you to be running SolrCloud.
>>
>> Best,
>> Matt
>>
>>
>>
>> [0] https://issues.apache.org/jira/browse/SOLR-2096
>> [1] https://github.com/elsevierlabs-os/spark-xml-utils

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be 
able to re-index quickly so you can try out different analysis chains. XSLT may 
not be fast enough for this if you have millions of docs. So I would be 
inclined to save the docs to a normal filesystem, perhaps in JSONL. Then use 
DIH or post tool or Python to post the docs to Solr.
Rick

On December 7, 2017 10:14:37 AM EST, Rick Leir  wrote:
>Matthew,
>Do you have some sort of script calling xslt? Sorry, I do not know
>Scala and I did not have time to look into your spark utils.  The
>script or Scala could then shell out to curl, or if it is python it
>could use the request library to send a doc to Solr. Extra points for
>batching the documents. 
>
>Erick
>The last time I used the post tool, it was spinning up a jvm each time
>I called it (natch). Is there a simple way to launch it from a Java app
>server so you can call it repeatedly without the start-up overhead? It
>has been a few years, maybe I am wrong.
>Cheers -- Rick
>
>On December 6, 2017 5:36:51 PM EST, Erick Erickson
> wrote:
>>Perhaps the bin/post tool? See:
>>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>>
>>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth 
>>wrote:
>>> Hi All,
>>>
>>> Is there a DIH for HDFS? I see this old feature request [0
>>> ] that never seems
>>to have
>>> gone anywhere. Google searches and searches on this list don't get
>me
>>to
>>> far.
>>>
>>> Essentially my workflow is that I have many thousands of XML
>>documents
>>> stored in hdfs. I run an xslt transformation in spark [1
>>> ]. This
>>transforms to
>>> the expected solr input of . This
>>is
>>> than written the back to hdfs. Now how do I get it back to solr? I
>>suppose
>>> I could move the data back to the local fs, but on the surface that
>>feels
>>> like the wrong way.
>>>
>>> I don't need to store the documents in HDFS after the spark
>>transformation,
>>> I wonder if I can write them using solrj. However, I am not really
>>familiar
>>> with solrj. I am also running a single node. Most of the material I
>>have
>>> read on spark-solr expects you to be running SolrCloud.
>>>
>>> Best,
>>> Matt
>>>
>>>
>>>
>>> [0] https://issues.apache.org/jira/browse/SOLR-2096
>>> [1] https://github.com/elsevierlabs-os/spark-xml-utils
>
>-- 
>Sorry for being brief. Alternate email is rickleir at yahoo dot com

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Howto search for § character

2017-12-07 Thread Shawn Heisey
On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
> we have defined a field named "_text_" for a full text search based on 
> field-type "text_general":
>  stored="false"/>"
>
> When trying to search for the "§" character, we have strange behaviour:
>
> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes 
> where entityClass:StructureNodeImpl)
> q=_text_:§ => numFound:0
>
> How can we search for the occurence of the § character?

We can't see how your "text_general" type is defined, but if it is
anything like the same type included in Solr examples, then it probably
is using StandardTokenizerFactory.  It appears that this tokenizer
treats the § character as a word break and removes it from the token
stream.  Most likely, the reason the search with the extra clause works
is that the part with that character is removed, and the query ends up
ONLY being the extra clause.

You will need a fieldType with an analysis chain that doesn't remove the
§ character, and it's almost guaranteed that you'll need to reindex. 
Unless you do that, searching for that character is not going to be
possible.

Also keep in mind that searching for a single character may not do what
you expect if that character is not a single word in the text, and that
certain filters can end up trimming out really short terms like that.

Thanks,
Shawn



Re: Howto search for § character

2017-12-07 Thread Erick Erickson
The admin UI/(select core)/analysis page will help you see exactly
what happens. Additionally, the "schema browser" bit will show you
exactly what's in the index, i.e. the terms as they actually appear
after all the analysis chain is completed. Those will definitively
tell you what exactly happens with that character.

Best,
Erick

On Thu, Dec 7, 2017 at 7:37 AM, Shawn Heisey  wrote:
> On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
>> we have defined a field named "_text_" for a full text search based on 
>> field-type "text_general":
>> > stored="false"/>"
>>
>> When trying to search for the "§" character, we have strange behaviour:
>>
>> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes 
>> where entityClass:StructureNodeImpl)
>> q=_text_:§ => numFound:0
>>
>> How can we search for the occurence of the § character?
>
> We can't see how your "text_general" type is defined, but if it is
> anything like the same type included in Solr examples, then it probably
> is using StandardTokenizerFactory.  It appears that this tokenizer
> treats the § character as a word break and removes it from the token
> stream.  Most likely, the reason the search with the extra clause works
> is that the part with that character is removed, and the query ends up
> ONLY being the extra clause.
>
> You will need a fieldType with an analysis chain that doesn't remove the
> § character, and it's almost guaranteed that you'll need to reindex.
> Unless you do that, searching for that character is not going to be
> possible.
>
> Also keep in mind that searching for a single character may not do what
> you expect if that character is not a single word in the text, and that
> certain filters can end up trimming out really short terms like that.
>
> Thanks,
> Shawn
>


Re: Howto search for § character

2017-12-07 Thread Bernd Schmidt

Indeed, I saw in the analysis tab of the solr admin that the § char will be 
removed when using type text_general.
But in this use case we want to make a full text search like "_text_:§45" or 
"_text_:§*" to find words starting with §.
We need a text field here, not a string field!
What is your recommended way to deal with it? 
Is it possible to remove the word break behaviour for the  § char?
Or is the best way to encode all § chars when indexing and searching?



Thanks, Bernd



 Mit freundlichen Grüßen

 Bernd Schmidt
 SOFTWARE-ENTWICKLUNG 

 b.schm...@eggheads.de



 Von:   Shawn Heisey  
 An:
 Gesendet:   07.12.2017 16:37 
 Betreff:   Re: Howto search for § character 

On 12/6/2017 9:09 AM, Bernd Schmidt wrote: 
> we have defined a field named "_text_" for a full text search based on 
> field-type "text_general": 
>  stored="false"/>" 
> 
> When trying to search for the "§" character, we have strange behaviour: 
> 
> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes 
> where entityClass:StructureNodeImpl) 
> q=_text_:§ => numFound:0 
> 
> How can we search for the occurence of the § character? 
 
We can't see how your "text_general" type is defined, but if it is 
anything like the same type included in Solr examples, then it probably 
is using StandardTokenizerFactory.  It appears that this tokenizer 
treats the § character as a word break and removes it from the token 
stream.  Most likely, the reason the search with the extra clause works 
is that the part with that character is removed, and the query ends up 
ONLY being the extra clause. 
 
You will need a fieldType with an analysis chain that doesn't remove the 
§ character, and it's almost guaranteed that you'll need to reindex.  
Unless you do that, searching for that character is not going to be 
possible. 
 
Also keep in mind that searching for a single character may not do what 
you expect if that character is not a single word in the text, and that 
certain filters can end up trimming out really short terms like that. 
 
Thanks, 
Shawn 
 




 eggheads GmbH
 Herner Straße 370
44807 Bochum

Fon +49 234 89397-0
Fax +49 234 89397-28
 
 www.eggheads.de
 ---


Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS 
COOK, TUI, WILO SE, WÜRTH, u.v.m.


Leistungen Standardsoftware für Product Information Management, Cross Media 
Publishing & Multi Channel Commerce, Prozessberatung


Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in der 
Kategorie E-Commerce. Mehr

---

Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 12.12.2017. 
Mehr

---


Re: Time-Series data indexing into Solr

2017-12-07 Thread Erick Erickson
You can also use "implicit" (sometimes called "manual") routing. This
allows you to create shards on the fly so one pattern is to create,
say, a shard per day. Say you have 30 day retention requirements: You
can create a new shard every day and delete any shards 31 or more days
old.

There are pros and cons to this. Personally unless you want to take
total control over the topology I prefer Shawn's approach but YMMV.

Best,
Erick

On Thu, Dec 7, 2017 at 6:02 AM, Markus Jelsma
 wrote:
> One of our collections is time-series data, processing hundreds of queries 
> per second. But apart from having a time field, set it indexed and docValues 
> enabled, i wouldn't know about any specific recommendations.
>
>
> -Original message-
>> From:Greenhorn Techie 
>> Sent: Thursday 7th December 2017 12:42
>> To: solr-user@lucene.apache.org
>> Subject: Time-Series data indexing into Solr
>>
>> Hi,
>>
>> Is there any recommended approach to index and search time-series data in
>> Solr?
>>
>> Thanks in Advance.
>>


Re: Howto search for § character

2017-12-07 Thread Erick Erickson
You have to use a different analysis chain. There are about a zillion
options, here's a _start_:
https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html
You'll probably be defining one similar to how text_general is
defined, a  then use your new type in your . This is
really the heart of how you make Solr do what you want when it comes
to what's searchable and what's not.

When you use the admin/analysis page, hover over the light gray
two-letter abbreviations and it'll pop up the class used for that
transformation.

You can start with WhitespaceTokenizerFactory which will break only on
whitespace. Be aware that other filters can then also manipulate the
tokens created by the tokenizer. WhitespaceTokenizerFactory will _not_
remove punctuation for instance, so you have to deal with that. For
example periods at the end of a sentence "I Like Cake." would be
included in the emitted tokens, so you'e have
I
Like
Cake.

You can use one of the filters to deal with that.

I would be very reluctant to use the "string" type, it's not analyzed
in any way and is almost always the wrong solution for something like
this. So input like this
I Like Cake.
would match _only_ I\ Like\ Cake.
You couldn't search on just the term "like", or even "Like" but only
"*Like*" which rather defeats the purpose of using tokenized search.

Best,
Erick

On Thu, Dec 7, 2017 at 8:37 AM, Bernd Schmidt  wrote:
>
> Indeed, I saw in the analysis tab of the solr admin that the § char will be 
> removed when using type text_general.
> But in this use case we want to make a full text search like "_text_:§45" or 
> "_text_:§*" to find words starting with §.
> We need a text field here, not a string field!
> What is your recommended way to deal with it?
> Is it possible to remove the word break behaviour for the  § char?
> Or is the best way to encode all § chars when indexing and searching?
>
>
>
> Thanks, Bernd
>
>
>
>  Mit freundlichen Grüßen
>
>  Bernd Schmidt
>  SOFTWARE-ENTWICKLUNG
>
>  b.schm...@eggheads.de
>
>
>
>  Von:   Shawn Heisey 
>  An:   
>  Gesendet:   07.12.2017 16:37
>  Betreff:   Re: Howto search for § character
>
> On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
>> we have defined a field named "_text_" for a full text search based on 
>> field-type "text_general":
>> > stored="false"/>"
>>
>> When trying to search for the "§" character, we have strange behaviour:
>>
>> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes 
>> where entityClass:StructureNodeImpl)
>> q=_text_:§ => numFound:0
>>
>> How can we search for the occurence of the § character?
>
> We can't see how your "text_general" type is defined, but if it is
> anything like the same type included in Solr examples, then it probably
> is using StandardTokenizerFactory.  It appears that this tokenizer
> treats the § character as a word break and removes it from the token
> stream.  Most likely, the reason the search with the extra clause works
> is that the part with that character is removed, and the query ends up
> ONLY being the extra clause.
>
> You will need a fieldType with an analysis chain that doesn't remove the
> § character, and it's almost guaranteed that you'll need to reindex.
> Unless you do that, searching for that character is not going to be
> possible.
>
> Also keep in mind that searching for a single character may not do what
> you expect if that character is not a single word in the text, and that
> certain filters can end up trimming out really short terms like that.
>
> Thanks,
> Shawn
>
>
>
>
>
>  eggheads GmbH
>  Herner Straße 370
> 44807 Bochum
>
> Fon +49 234 89397-0
> Fax +49 234 89397-28
>
>  www.eggheads.de
>  ---
>
>
> Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS 
> COOK, TUI, WILO SE, WÜRTH, u.v.m.
>
>
> Leistungen Standardsoftware für Product Information Management, Cross Media 
> Publishing & Multi Channel Commerce, Prozessberatung
>
>
> Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in 
> der Kategorie E-Commerce. Mehr
>
> ---
>
> Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 
> 12.12.2017. Mehr
>
> ---


Re: Howto search for § character

2017-12-07 Thread Shawn Heisey
On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> Indeed, I saw in the analysis tab of the solr admin that the § char will be 
> removed when using type text_general.
> But in this use case we want to make a full text search like "_text_:§45" or 
> "_text_:§*" to find words starting with §.
> We need a text field here, not a string field!
> What is your recommended way to deal with it? 
> Is it possible to remove the word break behaviour for the  § char?
> Or is the best way to encode all § chars when indexing and searching?

This character is classified by Unicode as punctuation:

http://www.fileformat.info/info/unicode/char/00a7/index.htm

Almost any example field type for full-text search that you're likely to
encounter is going to be designed to split on punctuation and remove it
from the token stream.  That's one of the most common things that
full-text search engines do.

You're going to need to design a new analysis chain that *doesn't* do
this, apply the fieldType containing that analysis to your field,
restart/reload, and reindex.

Designing analysis chains is an art form, and tends to be one of the
hardest parts of setting up a production Solr install.  It took me at
least a month of almost constant work to settle on the schema design for
the indexes that I maintain.  All of the "solr.TextField" types in my
schema are completely custom -- none of the analysis chains in Solr
examples are in that schema.

Thanks,
Shawn



RE: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
cc list:

Hello Mikhail,

Well, disregarding the warning notes in SolrResourceLoader, my meager patch 
adds TransformerFactory, and the code now runs well. I obviously lack the 
understanding of this patch with regard to SOLR-8311, but we are fine.

So, the patch and the custom code using it are doing fine so far. Will it be 
fine regarding SOLR-8311, i do not know. I am hoping an expert on these 
specifics can share their opinion.

Thanks,
Markus

 
 
-Original message-
> From:Mikhail Khludnev 
> Sent: Thursday 7th December 2017 21:50
> To: Markus Jelsma 
> Subject: Re: TransformerFactory does not support SolrCoreAware
> 
> Do you know the workaround?  
> 
> On Thu, Dec 7, 2017 at 4:56 PM, Markus Jelsma  > wrote:
> Created SOLR-11735 for tracking.
 
> https://issues.apache.org/jira/browse/SOLR-11735 
> 
 
> 
 
> 
 
> -Original message-
 
> > From:Markus Jelsma  > >
 
> > Sent: Thursday 7th December 2017 14:49
 
> > To: Solr-user  > >
 
> > Subject: TransformerFactory does not support SolrCoreAware
 
> >
 
> > Hi,
 
> >
 
> > Id love to have this supported, but SOLR-8311 states there are issues, and 
> > i lack the understanding of the mentioned issues. So, can i add it?
 
> >
 
> > Many thanks,
 
> > Markus
 
> >
 
> >
 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev


Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Hi,

We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction.  
Trying to see if any optimization done to
decrease the index sizes , couldn’t locate.  If anyone knows why please share.


Thank you,
Rajeswari


Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Shawn Heisey
On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote:
> We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction.  
> Trying to see if any optimization done to decrease the index sizes , couldn’t 
> locate.  If anyone knows why please share.

Here's a history where you can see the a summary of the changes in
Lucene's index format in various versions:

https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html#History

Looking over the history, I would guess that the changes mentioned
between 4.5 and 4.10 would make little difference in most indexes, but
for some configurations, might actually *increase* index size slightly. 
Chances are that the change would only happen after performing some kind
of operation on the whole index, though.

Did you do anything other than simply open the 4.5.1 index in 4.10.4
with the same config/schema?  This would include things like running an
optimize operation on the index, running IndexUpgrader on the index,
completely reindexing from scratch rather than using the old index, or
any number of other possibilities.  Operations like those I mentioned
would have eliminated deleted documents from the index, which can result
in a size reduction.  If you changed your schema at all, that can have
an effect on index size -- in either direction.

Thanks,
Shawn



Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Thanks a lot for the response. We did not change schema or config. We simply 
opened 4.5 indexes with 4.10 libraries.
Thank you,
Rajeswari

On 12/7/17, 3:17 PM, "Shawn Heisey"  wrote:

On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote:
> We have upgraded solr from 4.5.1 to 4.10.4 and we see index size 
reduction.  Trying to see if any optimization done to decrease the index sizes 
, couldn’t locate.  If anyone knows why please share.

Here's a history where you can see the a summary of the changes in
Lucene's index format in various versions:


https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html#History

Looking over the history, I would guess that the changes mentioned
between 4.5 and 4.10 would make little difference in most indexes, but
for some configurations, might actually *increase* index size slightly. 
Chances are that the change would only happen after performing some kind
of operation on the whole index, though.

Did you do anything other than simply open the 4.5.1 index in 4.10.4
with the same config/schema?  This would include things like running an
optimize operation on the index, running IndexUpgrader on the index,
completely reindexing from scratch rather than using the old index, or
any number of other possibilities.  Operations like those I mentioned
would have eliminated deleted documents from the index, which can result
in a size reduction.  If you changed your schema at all, that can have
an effect on index size -- in either direction.

Thanks,
Shawn





Re: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Shawn Heisey
On 12/7/2017 6:55 AM, Gilcan Machado wrote:
> I have a Solr 4 in production (+ Drupal).
>
> And I want to migrate Solr to versoin 7 (at the end).
>
> But I guess  it's more safe to migrate from 4 to 5 first.
>
> Anyway, I'm searching a lot and I couldn't find a documentation that shows
> how to pick a Solr 4 (in full production) and upgrade to a Solr 5.

The approach that I personally would use for this scenario is to create
a new config/schema for 7.x (either based on the 7.x examples, or
received from whoever wrote the Solr plugin for Drupal), upgrade
directly to the final version, and reindex from scratch into a fresh index.

Compared to version 4, version 7 is a VERY significant upgrade, and it
is likely that at some point in the 4>5>6>7 upgrade process that you're
going to want to change something that's going to require a reindex
anyway.  There's even a good chance that you'll be FORCED to do a
reindex, because something in your config/schema might not work in a
newer version.  Whether you're forced into a reindex would depend on how
your indexes are configured.  In three major versions, a lot of older
functionality gets removed, so many 4.x configs will not work in 7.x. 
Many of the changes that are required to successfully upgrade are NOT
compatible with an existing index.

I have never used Drupal, but their Solr plugin probably has an option
to reindex the entire database for situations where the old index is
missing or doesn't have the right info in it.

Thanks,
Shawn



Re: SolrIndexSearcher count

2017-12-07 Thread Shawn Heisey
On 12/5/2017 6:02 AM, Rick Dig wrote:
> is it normal to have many instances (100+) of SolrIndexSearchers to be open
> at the same time? Our Heap Analysis shows this to be the case.
>
> We have autoCommit for every 5 minutes, with openSearcher=true, would this
> close the old searcher and create a new one or just create a new one with
> the old one still not getting dereferenced? if so, when do the older
> searchers get cleaned up ?

How many cores is that Solr instance hosting?

Also, have you verified that those searcher objects are actually live
and not slated for garbage collection?  If not, then what I'm saying
below may not apply, and you should find out how many are live before
going any further.  I can imagine a situation where a heap might have
over 100 searcher objects, but most of them are actually dead and ready
for GC.

Each core will have at least one active searcher object at all times. 
When a commit is made that opens a new searcher, you'll temporarily have
an extra searcher object on that core.  The old one SHOULD be removed as
soon as everything using it (which would include queries) is done with
it and the reference counter is fully decremented.  This is why Erick
asked about custom code.  Sometimes when there are Solr plugins that use
the searcher object, the author doesn't close them properly, and that
can lead to the objects building up and never getting removed.

If there are a large number of very slow queries at the same time as
updates are happening and there are frequent commits that open a new
searcher, then you can end up in a situation where there are several
searcher objects which are all running the slow queries at the same time.

I'm not going to claim that a bug in Solr where searchers don't get
closed is impossible, but we haven't seen any evidence in current
versions that this is the case.  Resource leaks have happened in Solr,
but they're very rare.

When I check the list history for the past three years, I only see one
other thread you've been involved in, about a month ago.  The problems
you described at that time and the problem with lots of searcher objects
*might* be related to each other, but I can't be sure.  Can you put a
solr.log file on a sharing/paste website so we can see if there's
anything unusual in it?  You've said that your autoCommit interval is
five minutes, but I suspect that you may have explicit commits happening
on a much more frequent basis.  The solr.log file would reveal what's
actually happening.

Do you have autoSoftCommit configured?  If so, what is the config?  Are
you including a commitWithin parameter on your indexing requests?

Is there ever any time when the Solr instance is mostly idle, when there
are no updates happening and the query rate is zero or very low?  If so,
can you see how many searchers are found on the heap at that time?  Make
sure that the heap dump only includes live objects, so objects that can
be collected as garbage are not included.

Thanks,
Shawn



Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
I am indexing PDFs and a separate process has converted any image PDFs to 
search PDF before solr gets near it. I notice that tika is very slow at parsing 
some PDFs. I don't need any metadata (which I suspect is slowing tika down), 
just the text. Has anyone used an alternative PDF text extraction library in a 
SOLRJ context?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Howto search for § character

2017-12-07 Thread Tim Casey
My last company we ended up writing a custom analyzer to handle
punctuation.  But this was for lucent 2 or 3.  That analyzer was carried
forward as we updated and was used for all human derived text.

Although now there are way better analyzers and way better ways to hook
them up, as noted above by Erick, We really cared about how this was done
and all of the work put into the analyzer paid off.

I would expect there to be an analyzer which would maintain punctuation
tokens for search.  One of the issues which comes up is if you want
multiple-runs of punctuation to be a single token or separate tokens.  So
what happens to "§!"  or "§?" or "?§", and in the case of things like
text/email what happens to "§".

In any event, my 2 pence worth

tim

On Thu, Dec 7, 2017 at 10:00 AM, Shawn Heisey  wrote:

> On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> > Indeed, I saw in the analysis tab of the solr admin that the § char will
> be removed when using type text_general.
> > But in this use case we want to make a full text search like
> "_text_:§45" or "_text_:§*" to find words starting with §.
> > We need a text field here, not a string field!
> > What is your recommended way to deal with it?
> > Is it possible to remove the word break behaviour for the  § char?
> > Or is the best way to encode all § chars when indexing and searching?
>
> This character is classified by Unicode as punctuation:
>
> http://www.fileformat.info/info/unicode/char/00a7/index.htm
>
> Almost any example field type for full-text search that you're likely to
> encounter is going to be designed to split on punctuation and remove it
> from the token stream.  That's one of the most common things that
> full-text search engines do.
>
> You're going to need to design a new analysis chain that *doesn't* do
> this, apply the fieldType containing that analysis to your field,
> restart/reload, and reindex.
>
> Designing analysis chains is an art form, and tends to be one of the
> hardest parts of setting up a production Solr install.  It took me at
> least a month of almost constant work to settle on the schema design for
> the indexes that I maintain.  All of the "solr.TextField" types in my
> schema are completely custom -- none of the analysis chains in Solr
> examples are in that schema.
>
> Thanks,
> Shawn
>
>


Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Erick Erickson
I'm going to guess it's the exact opposite. The meta-data is the "semi
structured" part which is much easier to collect than the PDF. I mean
there are parameters to tweak that consider how much space between
letters in words (in the body text) should be allowed and still
consider it a single word. I'm not quite sure how to prove that, but
I'd be willing to make a bet ;)

Erick

On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
> I am indexing PDFs and a separate process has converted any image PDFs to 
> search PDF before solr gets near it. I notice that tika is very slow at 
> parsing some PDFs. I don't need any metadata (which I suspect is slowing tika 
> down), just the text. Has anyone used an alternative PDF text extraction 
> library in a SOLRJ context?
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.


Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Walter Underwood
No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.

I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 7, 2017, at 5:29 PM, Erick Erickson  wrote:
> 
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
> 
> Erick
> 
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.



RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
Well I have a lot OCRed PDF, but the extremely slow text extract is hard to pin 
down. The bulk of the OCRed one arent too slow, but then I have one that will 
take several minutes.  I use a little utility, pdftotext.exe, for making a 
crude guess at whether OCR is necessary and it is much faster (but not that 
easy to use in the indexing workflow). Some of the  big modern ones (fully 
digital) can also be very slow. Maybe the amount of inline imagery?? Doesn’t 
seem to bother pdftotext.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Friday, 8 December 2017 3:36 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Alternatives to tika for extracting text out of PDFs

No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.

I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 7, 2017, at 5:29 PM, Erick Erickson  wrote:
>
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
>
> Erick
>
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Mikhail Khludnev
I haven't look at SOLR-8311. But for those who need any plugin class to be
SolrCoreAware, you can mark it as "implements QueryResponseWriter" this
allow to workaround SolrCoreAware restrictions for any class.

On Thu, Dec 7, 2017 at 11:56 PM, Markus Jelsma 
wrote:

> cc list:
>
> Hello Mikhail,
>
> Well, disregarding the warning notes in SolrResourceLoader, my meager
> patch adds TransformerFactory, and the code now runs well. I obviously lack
> the understanding of this patch with regard to SOLR-8311, but we are fine.
>
> So, the patch and the custom code using it are doing fine so far. Will it
> be fine regarding SOLR-8311, i do not know. I am hoping an expert on these
> specifics can share their opinion.
>
> Thanks,
> Markus
>
>
>
> -Original message-
> > From:Mikhail Khludnev 
> > Sent: Thursday 7th December 2017 21:50
> > To: Markus Jelsma 
> > Subject: Re: TransformerFactory does not support SolrCoreAware
> >
> > Do you know the workaround?
> >
> > On Thu, Dec 7, 2017 at 4:56 PM, Markus Jelsma <
> markus.jel...@openindex.io > wrote:
> > Created SOLR-11735 for tracking.
>
> > https://issues.apache.org/jira/browse/SOLR-11735 <
> https://issues.apache.org/jira/browse/SOLR-11735>
>
> >
>
> >
>
> > -Original message-
>
> > > From:Markus Jelsma mailto:markus.jelsma@
> openindex.io>>
>
> > > Sent: Thursday 7th December 2017 14:49
>
> > > To: Solr-user mailto:solr-user@lucene.
> apache.org>>
>
> > > Subject: TransformerFactory does not support SolrCoreAware
>
> > >
>
> > > Hi,
>
> > >
>
> > > Id love to have this supported, but SOLR-8311 states there are issues,
> and i lack the understanding of the mentioned issues. So, can i add it?
>
> > >
>
> > > Many thanks,
>
> > > Markus
>
> > >
>
> > >
>
> >
> > 
> > --
> > Sincerely yours
> > Mikhail Khludnev
>



-- 
Sincerely yours
Mikhail Khludnev