Fastest database solution

2009-02-06 Thread Curt Hash
I'm writing a small application for detecting source code plagiarism that
currently relies on a database to store lines of code.

The application has two primary functions: adding a new file to the database
and comparing a file to those that are already stored in the database.

I started out using sqlite3, but was not satisfied with the performance
results. I then tried using psycopg2 with a local postgresql server, and the
performance got even worse. My simple benchmarks show that sqlite3 is an
average of 3.5 times faster at inserting a file, and on average less than a
tenth of a second slower than psycopg2 at matching a file.

I expected postgresql to be a lot faster ... is there some peculiarity in
psycopg2 that could be causing slowdown? Are these performance results
typical? Any suggestions on what to try from here? I don't think my
code/queries are inherently slow, but I'm not a DBA or a very accomplished
Python developer, so I could be wrong.

Any advice is appreciated.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Fastest database solution

2009-02-06 Thread Curt Hash
On Fri, Feb 6, 2009 at 2:12 AM, Roger Binns  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Curt Hash wrote:
> > I started out using sqlite3, but was not satisfied with the performance
> > results. I then tried using psycopg2 with a local postgresql server, and
> > the performance got even worse.
>
> SQLite is in the same process.  Communication with postgres is via
> another process so marshalling the traffic and context switches will
> impose overhead as you found.
>
> > I don't think
> > my code/queries are inherently slow, but I'm not a DBA or a very
> > accomplished Python developer, so I could be wrong.
>
> It doesn't sound like a database is the best solution to your issue
> anyway.  A better solution would likely be some form of hashing the
> lines and storing something that gives quick hash lookups.  The hash
> would have to do things like not care what variable names are used etc.
>
> There are already lots of plagiarism detectors out there so it may be
> more prudent using one of them, or at least learn how they do things so
> your own system could improve on them.

Currently, I am stripping extra whitespace and end-of-line characters
from each line of source code and storing that in addition to its hash
in a table. That table is used for exact-match comparisons. I am also
passing the source code through flex/bison to canonicalize identifiers
-- the resulting lines are also hashed and stored in a table. That
table is used for structural matching. Both tables are queried to find
matching hashes. I'm not sure how I could make the hash lookups
faster...

On my small test dataset, this solution has detected all of the
plagiarism with high confidence.

It's also beneficial to me to use this Python application as I can
easily integrate it with other Python scripts I use to prepare code
for review.

>
> Roger
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkmL/wgACgkQmOOfHg372QTAmACg0INMfUKA10Uc6UJwNhYhDeoV
> EKwAoKpDMRzr7GzCKeYxn93TU69nDx4X
> =4r01
> -END PGP SIGNATURE-
>
> --
> http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Fastest database solution

2009-02-06 Thread Curt Hash
On Fri, Feb 6, 2009 at 5:19 AM, M.-A. Lemburg  wrote:
> On 2009-02-06 09:10, Curt Hash wrote:
>> I'm writing a small application for detecting source code plagiarism that
>> currently relies on a database to store lines of code.
>>
>> The application has two primary functions: adding a new file to the database
>> and comparing a file to those that are already stored in the database.
>>
>> I started out using sqlite3, but was not satisfied with the performance
>> results. I then tried using psycopg2 with a local postgresql server, and the
>> performance got even worse. My simple benchmarks show that sqlite3 is an
>> average of 3.5 times faster at inserting a file, and on average less than a
>> tenth of a second slower than psycopg2 at matching a file.
>>
>> I expected postgresql to be a lot faster ... is there some peculiarity in
>> psycopg2 that could be causing slowdown? Are these performance results
>> typical? Any suggestions on what to try from here? I don't think my
>> code/queries are inherently slow, but I'm not a DBA or a very accomplished
>> Python developer, so I could be wrong.
>>
>> Any advice is appreciated.
>
> In general, if you do bulk insert into a large table, you should consider
> turning off indexing on the table and recreate/update the indexes in one
> go afterwards.
>
> But regardless of this detail, I think you should consider a filesystem
> based approach. This is going to be a lot faster than using a
> database to store the source code line by line. You can still use
> a database for the administration and indexing of the data, e.g.
> by storing a hash of each line in the database.
>

I can see how reconstructing source code from individual lines in the
database would be much slower than a filesystem-based approach.
However, what is of particular importance is that the matching itself
be fast. While the original lines of code are stored in the database,
I am performing matching based on only hashes. Would storing the
original code in the same table as the hash cause significant slowdown
if I am querying by hash only?

I think I may try this approach anyways, just to make retrieving the
original source code after finding a match faster, but I am still
primarily concerned with the speed of the hash lookups.

> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Feb 06 2009)
>>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/
> 
>
> ::: Try our new mxODBC.Connect Python Database Interface for free ! 
>
>
>   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>   Registered at Amtsgericht Duesseldorf: HRB 46611
>   http://www.egenix.com/company/contact/
>
--
http://mail.python.org/mailman/listinfo/python-list


Re: Extract an image from a RTF file

2009-02-14 Thread Curt Hash
On Sat, Feb 14, 2009 at 11:01 AM, Terry Reedy  wrote:
>
> [email protected] wrote:
>>
>> I have a large amount of RTF files where the only thing in them is an
>> image.  I would like to extract them an save them as a png.
>> Eventually, I would like to also grab some text that is on the image.
>> I think PIL has something for this.
>>
>> Does anyone have any suggestion on how to start this?
>
> Wikepedia Rich Text Format has several links, which lead to
> http://pyrtf.sourceforge.net/
> http://code.google.com/p/pyrtf-ng/
> The former says rtf generation, including images.
> The latter says rtf generation and parsing, but only claims to be a rewrite 
> of the former.
>
> --
> http://mail.python.org/mailman/listinfo/python-list

I've written an RTF parser in Python before, but for the purpose of
filtering and discarding content rather than extracting it.

Take a look at the specification here:
http://www.microsoft.com/downloads/details.aspx?familyid=dd422b8d-ff06-4207-b476-6b5396a18a2b&displaylang=en

You will find that images are specified by one or more RTF control
words followed by a long string of hex data. For this special purpose,
you will not need to write a parser for the entire specification. Just
search the file for the correct sequence of control words, extract the
hex data that follows, and save it to a file.

It helps if you open the RTF document in a text editor and locate the
specific control group that contains the image, as the format and
order of control words varies depending on the application that
created it. If all of your documents are created with the same
application, it will be much easier.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Searching Google?

2009-02-17 Thread Curt Hash
On Tue, Feb 17, 2009 at 4:15 PM, Oltmans  wrote:
>
> Hey all,
>
> I want to search Google.com using a specific keyword and I just want
> to read back the response using Pyhon. After some thorough Googling I
> realized that I probably need a Search API key to do that. Is that
> correct? Now, I don't have a search key so is there a workaround?
> Please enlighten me.
>
> Thanks,
> Oltmans
> --
> http://mail.python.org/mailman/listinfo/python-list

You just need to change your User-Agent so that Google doesn't know a
Python script is making the request:

import urllib2
headers = {'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.04 (hardy) Firefox/3.0.6'}
query = 'foo'
req = urllib2.Request('http://www.google.com/search?&q=' + query,
headers=headers)
response = urllib2.urlopen(req)
results = response.read()
--
http://mail.python.org/mailman/listinfo/python-list


Re: Newby Question for reading a file

2009-02-19 Thread Curt Hash
On Thu, Feb 19, 2009 at 12:07 PM, steven.oldner  wrote:
>
> On Feb 19, 12:40 pm, Mike Driscoll  wrote:
> > On Feb 19, 12:32 pm, "steven.oldner"  wrote:
> >
> > > Simple question but I haven't found an answer.  I program in ABAP, and
> > > in ABAP you define the data structure of the file and move the file
> > > line into the structure, and then do something to the fields.  That's
> > > my mental reference.
> >
> > > How do I separate or address each field in the file line with PYTHON?
> > > What's the correct way of thinking?
> >
> > > Thanks!
> >
> > I don't really follow what you mean since I've never used ABAP, but
> > here's how I typically read a file in Python:
> >
> > f = open("someFile.txt")
> > for line in f:
> > # do something with the line
> > print line
> > f.close()
> >
> > Of course, you can read just portions of the file too, using something
> > like this:
> >
> > f.read(64)
> >
> > Which will read 64 bytes. For more info, check the following out:
> >
> > http://www.diveintopython.org/file_handling/file_objects.html
> >
> >  - Mike
>
> Hi Mike,
>
> ABAP is loosely based on COBOL.
>
> Here is what I was trying to do, but ended up just coding in ABAP.
>
> Read a 4 column text file of about 1,000 lines and compare the 2
> middle field of each line.  If there is a difference, output the line.
>
> The line's definition in ABAP is PERNR(8) type c, ENDDA(10) type c,
> BEGDA(10) type c, and LGART(4) type c.
> In ABAP the code is:
> LOOP AT in_file.
>  IF in_file-endda <> in_file-begda.
>WRITE:\ in_file. " that's same as python's print
>  ENDIF.
> ENDLOOP.
>
> I can read the file, but didn't know how to look st the fields in the
> line.  From what you wrote, I need to read each segment/field of the
> line?
>
> Thanks,
>
> Steve
> --
> http://mail.python.org/mailman/listinfo/python-list

You could do something like this:

f = open('file.txt', 'r')
for line in f:
a,b = line.split()[1:-1]   # tokenize the string into sequence of
length 4 and store two middle values in a and b
if a != b:
print line
f.close()
--
http://mail.python.org/mailman/listinfo/python-list