PDF writer

2016-10-17 Thread Matthew Roth
Hi Group,

Is there a documented or preferred path to have a PDF response writer? I am
using solr 5.3.x for an internal project. I have an XSL-FO transformation
that I am able to return via the XSLT response writer. Is there a
documented way to produce  a PDF via solr? Alternatively, I was thinking of
passing the response through an eXist-db instance [0] we have running.
However, a pdf response writer would be ideal.

Best,
Matt

[0] http://exist-db.org/


Re: PDF writer

2016-10-17 Thread Matthew Roth
Thanks Erick. That is as anticipated. Scouring my other resources didn't
indicate the existence of a PDF writer. I thought I'd try the group be
embarking on a custom solution.


Matt

On Mon, Oct 17, 2016 at 11:58 AM, Erick Erickson 
wrote:

> There's no PDF writer that I know of, and I doubt there's much
> enthusiasm for creating one as part of Solr. ResponseWriters are
> pluggable so this would certainly be possible.
>
> At root, in a response writer you just have a map of key/value pairs
> (it's a little more complicated than that, but not much) that you can
> do whatever you want with, either on Solr or on a SolrJ client.
>
> Not much help I know...
>
> Best,
> Erick
>
> On Mon, Oct 17, 2016 at 10:01 AM, Matthew Roth 
> wrote:
> > Hi Group,
> >
> > Is there a documented or preferred path to have a PDF response writer? I
> am
> > using solr 5.3.x for an internal project. I have an XSL-FO transformation
> > that I am able to return via the XSLT response writer. Is there a
> > documented way to produce  a PDF via solr? Alternatively, I was thinking
> of
> > passing the response through an eXist-db instance [0] we have running.
> > However, a pdf response writer would be ideal.
> >
> > Best,
> > Matt
> >
> > [0] http://exist-db.org/
>


Re: PDF writer

2016-10-21 Thread Matthew Roth
Hi Shawn,

Thanks for the thoughtful response on middleware and the solr philosophy.
You are correct and I intend to handle this outside of Solr. This inquiry
was me doing some forethought on a distant project. When I see an
XSLTResponseWriter the jump-to-conclusions part of my brain jumps to PDF.
The separation you are describing is very logical.

At this point I intend to make use of an XSLTresponse to produce formatting
objects that I will process at a later point in the application. Or maybe I
won't. Solr isn't my upstream source. The data is relational, but my
indexes are in solr. I could always process the upstream relational data to
produce my PDF reports.


Matt

On Wed, Oct 19, 2016 at 10:53 AM, Shawn Heisey  wrote:

> On 10/17/2016 8:01 AM, Matthew Roth wrote:
> > Is there a documented or preferred path to have a PDF response writer?
> > I am using solr 5.3.x for an internal project. I have an XSL-FO
> > transformation that I am able to return via the XSLT response writer.
> > Is there a documented way to produce a PDF via solr? Alternatively, I
> > was thinking of passing the response through an eXist-db instance [0]
> > we have running. However, a pdf response writer would be ideal.
>
> Solr responses are designed to be processed by a program making a search
> query, not read by an end user.  Solr is middleware.  There are multiple
> formats (json, xml, javabin) because we do not know what kind of program
> will consume the response.
>
> https://en.wikipedia.org/wiki/Middleware
>
> PDF is an end-user format for display and print, not a middleware
> response format.  Creating content like that is best handled by other
> pieces of software, not Solr.
>
> For best results that fit your needs perfectly, that software is likely
> to be something you write yourself.  The Solr project has absolutely no
> idea how you will define your schema, or how you would like the data in
> a Solr response transformed, integrated, and formatted in a PDF.
>
> Designing the feature you want would be something best handled as an
> software project separate from Solr.  The software would take a Solr
> response and turn it into a PDF.  It doesn't fit into Solr's core usage,
> so making it a part of Solr is not a good fit and unlikely to happen.
>
> No matter where the development for a general feature like that happens,
> it would likely take weeks or months of work just to reach alpha
> quality.  After that, it would take weeks or months of additional work
> to reach release quality ... and even then it probably wouldn't produce
> the exact results you want without extensive and complicated
> configuration.  Handling complicated configuration is itself very
> complicated, which is one reason why development would take so long.
>
> Thanks,
> Shawn
>
>


Re: PDF writer

2016-10-21 Thread Matthew Roth
> I think this is the best option.

I really do too once I think about it some more. Rubber Ducky strikes
again. Once I say it aloud--in this case type it out--it seems much clearer
what the answer is to this question.

Thanks again. I've really appreciated all the feedback on this question.

Matt


On Fri, Oct 21, 2016 at 10:44 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> If the PDF report is truly a report, I agree with this.   We have a
> use-case with IBM InfoSphere Watson Explorer where our users want a PDF
> report on the results for their query to be generated on the fly.   They
> can then save the query and have the report emailed to them :)   Not only
> is Solr middleware - Search engines in general should be Middleware,
> because these sorts of business requirements keep coming up.   We've
> invested a lot in IBM InfoSphere Watson Explorer because it can create a
> GUI for us, but that often ends-up biting you in the end.
>
> This creates search UI's that are maintained by the "search team" while
> the corresponding application is maintained by the "developer team", and so
> look and feel can often be replicated while using different HTML,
> JavaScript, and CSS.   So, updates can be hard, and achieving the same
> mobile responsive behavior can be nearly impossible.
>
> Search engines *should* be middleware.   I value having a back-office for
> crawling the web that allows a crawl to be defined entirely through a GUI,
> but question whether it really is much better than a FOSS architecture.
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, October 21, 2016 10:35 AM
> To: solr-user 
> Subject: Re: PDF writer
>
> On 21 October 2016 at 09:58, Matthew Roth  wrote:
> > . I could always process the upstream relational data to produce my
> > PDF reports.
>
> I think this is the best option. This allows you to mangle/de-normalize
> your data stored in Solr to be the best fit for search.
>
> Regards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG Newsletter and resources for Solr beginners and
> intermediates:
> http://www.solr-start.com/
>


indexing XML stored on HDFS

2017-12-06 Thread Matthew Roth
Hi All,

Is there a DIH for HDFS? I see this old feature request [0
] that never seems to have
gone anywhere. Google searches and searches on this list don't get me to
far.

Essentially my workflow is that I have many thousands of XML documents
stored in hdfs. I run an xslt transformation in spark [1
]. This transforms to
the expected solr input of . This is
than written the back to hdfs. Now how do I get it back to solr? I suppose
I could move the data back to the local fs, but on the surface that feels
like the wrong way.

I don't need to store the documents in HDFS after the spark transformation,
I wonder if I can write them using solrj. However, I am not really familiar
with solrj. I am also running a single node. Most of the material I have
read on spark-solr expects you to be running SolrCloud.

Best,
Matt



[0] https://issues.apache.org/jira/browse/SOLR-2096
[1] https://github.com/elsevierlabs-os/spark-xml-utils


Re: indexing XML stored on HDFS

2017-12-07 Thread Matthew Roth
Yes the post tool would also be an acceptable option and one I am familiar
with. However, I also am not seeing exactly how I would query hdfs. The
hadoop-solr [0
<https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
lucidworks looks the most promising. I have a meeting to attend to shortly,
and maybe I can explore that further in the afternoon.

I also would like to look further into solrj. I have no real reason to
store the results of the XSL transformation anywhere other than solr. I am
simply not familiar with it. But on the surface it seems like it might be
the most performant way to handle this problem.

If I do pursue this with solrj and spark will solr handle multiple solrj
connections all trying to add documents?

[0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers

On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson 
wrote:

> Perhaps the bin/post tool? See:
> https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
> On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth  wrote:
> > Hi All,
> >
> > Is there a DIH for HDFS? I see this old feature request [0
> > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to
> have
> > gone anywhere. Google searches and searches on this list don't get me to
> > far.
> >
> > Essentially my workflow is that I have many thousands of XML documents
> > stored in hdfs. I run an xslt transformation in spark [1
> > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms
> to
> > the expected solr input of . This is
> > than written the back to hdfs. Now how do I get it back to solr? I
> suppose
> > I could move the data back to the local fs, but on the surface that feels
> > like the wrong way.
> >
> > I don't need to store the documents in HDFS after the spark
> transformation,
> > I wonder if I can write them using solrj. However, I am not really
> familiar
> > with solrj. I am also running a single node. Most of the material I have
> > read on spark-solr expects you to be running SolrCloud.
> >
> > Best,
> > Matt
> >
> >
> >
> > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > [1] https://github.com/elsevierlabs-os/spark-xml-utils
>


Re: indexing XML stored on HDFS

2017-12-08 Thread Matthew Roth
Thanks Rick,

While long term storage of the documents in HDFS is not necessary you do
raise that easy access to these documents durning the development phase
will be useful.

Cassandra,

spark-solr I am under the impression that I must be running SolrCloud. At
this time I need some of the features that are not available in SolrCloud.
E.g. Joining across cores. Additionally, the projected demands of solr mean
running it as a single node will be acceptable.

The hadoop-solr project does look the most promising at the moment. I am
hoping to play with it some this afternoon, but it may have to wait until
the new week.

Thanks for the help.

Best,
Matt

On Fri, Dec 8, 2017 at 1:36 PM, Cassandra Targett 
wrote:

> Matthew,
>
> The hadoop-solr project you mention would give you the ability to index
> files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params
> you need and it processes the files and sends them to Solr. It might not be
> the fastest thing in the world since it uses MapReduce but we (I work at
> Lucidworks) do have a number of people using it.
>
> However, you mention that you're already processing your files with Spark,
> and you don't really need them in HDFS in the long run - have you seen the
> Spark-Solr project at https://github.com/lucidworks/spark-solr/? It has an
> RDD for indexing docs to Solr, so you would be able to get the files from
> wherever they originate, transform them in Spark, and get them into Solr.
> It might be a better solution for your existing workflow.
>
> Hope it helps -
> Cassandra
>
> On Thu, Dec 7, 2017 at 9:03 AM, Matthew Roth  wrote:
>
> > Yes the post tool would also be an acceptable option and one I am
> familiar
> > with. However, I also am not seeing exactly how I would query hdfs. The
> > hadoop-solr [0
> > <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
> > lucidworks looks the most promising. I have a meeting to attend to
> shortly,
> > and maybe I can explore that further in the afternoon.
> >
> > I also would like to look further into solrj. I have no real reason to
> > store the results of the XSL transformation anywhere other than solr. I
> am
> > simply not familiar with it. But on the surface it seems like it might be
> > the most performant way to handle this problem.
> >
> > If I do pursue this with solrj and spark will solr handle multiple solrj
> > connections all trying to add documents?
> >
> > [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers
> >
> > On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson 
> > wrote:
> >
> > > Perhaps the bin/post tool? See:
> > > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
> > >
> > > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth 
> wrote:
> > > > Hi All,
> > > >
> > > > Is there a DIH for HDFS? I see this old feature request [0
> > > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
> to
> > > have
> > > > gone anywhere. Google searches and searches on this list don't get me
> > to
> > > > far.
> > > >
> > > > Essentially my workflow is that I have many thousands of XML
> documents
> > > > stored in hdfs. I run an xslt transformation in spark [1
> > > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
> transforms
> > > to
> > > > the expected solr input of . This
> is
> > > > than written the back to hdfs. Now how do I get it back to solr? I
> > > suppose
> > > > I could move the data back to the local fs, but on the surface that
> > feels
> > > > like the wrong way.
> > > >
> > > > I don't need to store the documents in HDFS after the spark
> > > transformation,
> > > > I wonder if I can write them using solrj. However, I am not really
> > > familiar
> > > > with solrj. I am also running a single node. Most of the material I
> > have
> > > > read on spark-solr expects you to be running SolrCloud.
> > > >
> > > > Best,
> > > > Matt
> > > >
> > > >
> > > >
> > > > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > > > [1] https://github.com/elsevierlabs-os/spark-xml-utils
> > >
> >
>


Build suggester in different directory (not /tmp).

2017-12-20 Thread Matthew Roth
Hi List,

I am building a few suggester's and I am receiving the error that I have no
space left on device.



No space left on device

java.io.IOException: No space left on device at
sun.nio.ch.FileDispatcherImpl.write0(Native Method) at
...



At first this threw me. df showed I had over 100 G free. the /data dir the
suggester is being constructed from is only 4G. On a subsequent run I
notice that the suggester is first being built in /tmp. When setting up the
LVM I only allotted 2g's to that directory and I prefer to keep it that
way. Is there a way to build the suggester's in an alternative dir? I am
not seeing anything in the documentation (
https://lucene.apache.org/solr/guide/6_6/suggester.html)

I should note that I am using solr 6.6.0

Best,
Matt


Re: No space left on device - When I execute suggester component.

2017-12-20 Thread Matthew Roth
Oh, this seems relevant to my recent post to the list. My problem is that
the suggester's are first being built in /tmp and moved to /var. tmp has a
total of 2g's free whereas /var has near 100G.

Perhaps you are running into the same problem I am in this regard? How does
your /tmp dir look when building?

Matt


On Wed, Dec 20, 2017 at 2:59 AM, Shawn Heisey  wrote:

> On 12/20/2017 12:21 AM, Fiz Newyorker wrote:
>
>> I tried df -h , during suggest.build command.
>>
>> Size.   Used   Avail Use%  Mounted on
>>
>>   63G   17G 44G  28% /ngs/app
>>
>
> That cannot be the entire output of that command.  Here's what I get when
> I do it:
>
> root@smeagol:~# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> udev 12G 0   12G   0% /dev
> tmpfs   2.4G  251M  2.2G  11% /run
> /dev/sda5   220G   15G  194G   8% /
> tmpfs12G  412K   12G   1% /dev/shm
> tmpfs   5.0M 0  5.0M   0% /run/lock
> tmpfs12G 0   12G   0% /sys/fs/cgroup
> /dev/sda147G  248M   45G   1% /boot
> tmpfs   2.4G   84K  2.4G   1% /run/user/1000
> tmpfs   2.4G 0  2.4G   0% /run/user/141
> tmpfs   2.4G 0  2.4G   0% /run/user/0
>
> If the disk has enough free space, then there is probably something else
> at work, like a filesystem quota for the user that is running Solr, or some
> other kind of limitation that has been configured.
>
> Thanks,
> Shawn
>


Re: Build suggester in different directory (not /tmp).

2017-12-20 Thread Matthew Roth
I have an incomplete solution. I was trying to build three suggester's at
once. If I added the ?suggest.dictionary= parameter and built one at
a time it worked out fine. However, this means I will need to set
buildOnCommit and buildOnStartup to false. This is less than ideal.
Building in a different directory would still be preferable.


Best,
Matt

On Wed, Dec 20, 2017 at 12:05 PM, Matthew Roth  wrote:

> Hi List,
>
> I am building a few suggester's and I am receiving the error that I have
> no space left on device.
>
>
> 
> No space left on device
> 
> java.io.IOException: No space left on device at
> sun.nio.ch.FileDispatcherImpl.write0(Native Method) at
> ...
>
>
>
> At first this threw me. df showed I had over 100 G free. the /data dir
> the suggester is being constructed from is only 4G. On a subsequent run I
> notice that the suggester is first being built in /tmp. When setting up
> the LVM I only allotted 2g's to that directory and I prefer to keep it that
> way. Is there a way to build the suggester's in an alternative dir? I am
> not seeing anything in the documentation (https://lucene.apache.org/
> solr/guide/6_6/suggester.html)
>
> I should note that I am using solr 6.6.0
>
> Best,
> Matt
>


Re: Build suggester in different directory (not /tmp).

2017-12-20 Thread Matthew Roth
Thanks Erick,

I'll head your warning. Ultimately, the index will be rather static so I do
not fear much from buildingOnComit. But I think building on startup would
likely be set to false regardless.

Shawn,

Thank you as well. That is very informative regarding java.io.tmpdir. I am
starting this as a service, but I think I can handle making the required
changes.

Best,
Matt

On Wed, Dec 20, 2017 at 2:58 PM, Shawn Heisey  wrote:

> On 12/20/2017 10:05 AM, Matthew Roth wrote:
> > I am building a few suggester's and I am receiving the error that I have
> no
> > space left on device.
>
> 
>
> > At first this threw me. df showed I had over 100 G free. the /data dir
> the
> > suggester is being constructed from is only 4G. On a subsequent run I
> > notice that the suggester is first being built in /tmp. When setting up
> the
> > LVM I only allotted 2g's to that directory and I prefer to keep it that
> > way.
>
> The code is utilizing the "java.io.tmpdir" system property to determine
> a temporary directory location to use for the build, before it is put in
> the final location.  On POSIX platforms, this will default to /tmp.
>
> If you are starting Solr manually, then you would just need to add the
> following parameter to the bin/solr commandline (including the quotes)
> to change this location:
>
> -a "-Djava.io.tmpdir=/other/tmp/path"
>
> If you've installed Solr as a service, then I do not think there's any
> easy way to adjust this property, other than manually editing bin/solr
> to add the -D option to the startup commandline.  We'll need an
> enhancement issue in Jira to modify the script so it can set
> java.io.tmpdir from an environment variable.
>
> Note that adjusting this property may result in other things that Solr
> creates being moved away from /tmp.
>
> Since most POSIX operating systems will automatically delete old files
> in /tmp, it's always possible that when you move Java's temp directory,
> you'll end up with cruft in the new location that never gets deleted.
> Developers do generally try to clean up temporary files, but sometimes
> things go wrong that weren't anticipated.  If that does happen and a
> temporary file is created by Lucene/Solr that doesn't get deleted, then
> I would consider that a bug that should be fixed.
>
> On Windows systems, Java asks the OS where the temp directory is.  The
> info I've found says that the TMP environment variable will override
> this location for Windows, but not for other platforms.
>
> Thanks,
> Shawn
>
>


Re: Build suggester in different directory (not /tmp).

2017-12-20 Thread Matthew Roth
Erick,

oh, yes, I think I was misunderstanding buildOnCommit. I presumed it would
run following the completion of my DIH. The behavior you described would be
very problematic!

Thank you for taking the time to point that out!

Best,
Matt

On Wed, Dec 20, 2017 at 3:47 PM, Erick Erickson 
wrote:

> Matthew:
>
> I think you'll be awfully unhappy with buildOnCommit. Say you're
> bulk-indexing and committing every 15 seconds
>
> buildOnStartup is problematical as well since it'd rebuild everytime
> you bounced Solr even if the index hadn't changed.
>
> Personally I'd alter my indexing process to fire a build command when
> it was done.
>
> Or, if you can afford to optimize after _every_ set of updates (say
> you only update every day or less often) then buildOnOptimize makes
> sense.
>
> Best,
> Erick
>
> On Wed, Dec 20, 2017 at 12:40 PM, Matthew Roth  wrote:
> > Thanks Erick,
> >
> > I'll head your warning. Ultimately, the index will be rather static so I
> do
> > not fear much from buildingOnComit. But I think building on startup would
> > likely be set to false regardless.
> >
> > Shawn,
> >
> > Thank you as well. That is very informative regarding java.io.tmpdir. I
> am
> > starting this as a service, but I think I can handle making the required
> > changes.
> >
> > Best,
> > Matt
> >
> > On Wed, Dec 20, 2017 at 2:58 PM, Shawn Heisey 
> wrote:
> >
> >> On 12/20/2017 10:05 AM, Matthew Roth wrote:
> >> > I am building a few suggester's and I am receiving the error that I
> have
> >> no
> >> > space left on device.
> >>
> >> 
> >>
> >> > At first this threw me. df showed I had over 100 G free. the /data dir
> >> the
> >> > suggester is being constructed from is only 4G. On a subsequent run I
> >> > notice that the suggester is first being built in /tmp. When setting
> up
> >> the
> >> > LVM I only allotted 2g's to that directory and I prefer to keep it
> that
> >> > way.
> >>
> >> The code is utilizing the "java.io.tmpdir" system property to determine
> >> a temporary directory location to use for the build, before it is put in
> >> the final location.  On POSIX platforms, this will default to /tmp.
> >>
> >> If you are starting Solr manually, then you would just need to add the
> >> following parameter to the bin/solr commandline (including the quotes)
> >> to change this location:
> >>
> >> -a "-Djava.io.tmpdir=/other/tmp/path"
> >>
> >> If you've installed Solr as a service, then I do not think there's any
> >> easy way to adjust this property, other than manually editing bin/solr
> >> to add the -D option to the startup commandline.  We'll need an
> >> enhancement issue in Jira to modify the script so it can set
> >> java.io.tmpdir from an environment variable.
> >>
> >> Note that adjusting this property may result in other things that Solr
> >> creates being moved away from /tmp.
> >>
> >> Since most POSIX operating systems will automatically delete old files
> >> in /tmp, it's always possible that when you move Java's temp directory,
> >> you'll end up with cruft in the new location that never gets deleted.
> >> Developers do generally try to clean up temporary files, but sometimes
> >> things go wrong that weren't anticipated.  If that does happen and a
> >> temporary file is created by Lucene/Solr that doesn't get deleted, then
> >> I would consider that a bug that should be fixed.
> >>
> >> On Windows systems, Java asks the OS where the temp directory is.  The
> >> info I've found says that the TMP environment variable will override
> >> this location for Windows, but not for other platforms.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>