Fastest database solution
I'm writing a small application for detecting source code plagiarism that currently relies on a database to store lines of code. The application has two primary functions: adding a new file to the database and comparing a file to those that are already stored in the database. I started out using sqlite3, but was not satisfied with the performance results. I then tried using psycopg2 with a local postgresql server, and the performance got even worse. My simple benchmarks show that sqlite3 is an average of 3.5 times faster at inserting a file, and on average less than a tenth of a second slower than psycopg2 at matching a file. I expected postgresql to be a lot faster ... is there some peculiarity in psycopg2 that could be causing slowdown? Are these performance results typical? Any suggestions on what to try from here? I don't think my code/queries are inherently slow, but I'm not a DBA or a very accomplished Python developer, so I could be wrong. Any advice is appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Fastest database solution
On Fri, Feb 6, 2009 at 2:12 AM, Roger Binns wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Curt Hash wrote: > > I started out using sqlite3, but was not satisfied with the performance > > results. I then tried using psycopg2 with a local postgresql server, and > > the performance got even worse. > > SQLite is in the same process. Communication with postgres is via > another process so marshalling the traffic and context switches will > impose overhead as you found. > > > I don't think > > my code/queries are inherently slow, but I'm not a DBA or a very > > accomplished Python developer, so I could be wrong. > > It doesn't sound like a database is the best solution to your issue > anyway. A better solution would likely be some form of hashing the > lines and storing something that gives quick hash lookups. The hash > would have to do things like not care what variable names are used etc. > > There are already lots of plagiarism detectors out there so it may be > more prudent using one of them, or at least learn how they do things so > your own system could improve on them. Currently, I am stripping extra whitespace and end-of-line characters from each line of source code and storing that in addition to its hash in a table. That table is used for exact-match comparisons. I am also passing the source code through flex/bison to canonicalize identifiers -- the resulting lines are also hashed and stored in a table. That table is used for structural matching. Both tables are queried to find matching hashes. I'm not sure how I could make the hash lookups faster... On my small test dataset, this solution has detected all of the plagiarism with high confidence. It's also beneficial to me to use this Python application as I can easily integrate it with other Python scripts I use to prepare code for review. > > Roger > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.9 (GNU/Linux) > > iEYEARECAAYFAkmL/wgACgkQmOOfHg372QTAmACg0INMfUKA10Uc6UJwNhYhDeoV > EKwAoKpDMRzr7GzCKeYxn93TU69nDx4X > =4r01 > -END PGP SIGNATURE- > > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Fastest database solution
On Fri, Feb 6, 2009 at 5:19 AM, M.-A. Lemburg wrote: > On 2009-02-06 09:10, Curt Hash wrote: >> I'm writing a small application for detecting source code plagiarism that >> currently relies on a database to store lines of code. >> >> The application has two primary functions: adding a new file to the database >> and comparing a file to those that are already stored in the database. >> >> I started out using sqlite3, but was not satisfied with the performance >> results. I then tried using psycopg2 with a local postgresql server, and the >> performance got even worse. My simple benchmarks show that sqlite3 is an >> average of 3.5 times faster at inserting a file, and on average less than a >> tenth of a second slower than psycopg2 at matching a file. >> >> I expected postgresql to be a lot faster ... is there some peculiarity in >> psycopg2 that could be causing slowdown? Are these performance results >> typical? Any suggestions on what to try from here? I don't think my >> code/queries are inherently slow, but I'm not a DBA or a very accomplished >> Python developer, so I could be wrong. >> >> Any advice is appreciated. > > In general, if you do bulk insert into a large table, you should consider > turning off indexing on the table and recreate/update the indexes in one > go afterwards. > > But regardless of this detail, I think you should consider a filesystem > based approach. This is going to be a lot faster than using a > database to store the source code line by line. You can still use > a database for the administration and indexing of the data, e.g. > by storing a hash of each line in the database. > I can see how reconstructing source code from individual lines in the database would be much slower than a filesystem-based approach. However, what is of particular importance is that the matching itself be fast. While the original lines of code are stored in the database, I am performing matching based on only hashes. Would storing the original code in the same table as the hash cause significant slowdown if I am querying by hash only? I think I may try this approach anyways, just to make retrieving the original source code after finding a match faster, but I am still primarily concerned with the speed of the hash lookups. > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Source (#1, Feb 06 2009) >>>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ > > > ::: Try our new mxODBC.Connect Python Database Interface for free ! > > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > -- http://mail.python.org/mailman/listinfo/python-list
Re: Extract an image from a RTF file
On Sat, Feb 14, 2009 at 11:01 AM, Terry Reedy wrote: > > [email protected] wrote: >> >> I have a large amount of RTF files where the only thing in them is an >> image. I would like to extract them an save them as a png. >> Eventually, I would like to also grab some text that is on the image. >> I think PIL has something for this. >> >> Does anyone have any suggestion on how to start this? > > Wikepedia Rich Text Format has several links, which lead to > http://pyrtf.sourceforge.net/ > http://code.google.com/p/pyrtf-ng/ > The former says rtf generation, including images. > The latter says rtf generation and parsing, but only claims to be a rewrite > of the former. > > -- > http://mail.python.org/mailman/listinfo/python-list I've written an RTF parser in Python before, but for the purpose of filtering and discarding content rather than extracting it. Take a look at the specification here: http://www.microsoft.com/downloads/details.aspx?familyid=dd422b8d-ff06-4207-b476-6b5396a18a2b&displaylang=en You will find that images are specified by one or more RTF control words followed by a long string of hex data. For this special purpose, you will not need to write a parser for the entire specification. Just search the file for the correct sequence of control words, extract the hex data that follows, and save it to a file. It helps if you open the RTF document in a text editor and locate the specific control group that contains the image, as the format and order of control words varies depending on the application that created it. If all of your documents are created with the same application, it will be much easier. -- http://mail.python.org/mailman/listinfo/python-list
Re: Searching Google?
On Tue, Feb 17, 2009 at 4:15 PM, Oltmans wrote:
>
> Hey all,
>
> I want to search Google.com using a specific keyword and I just want
> to read back the response using Pyhon. After some thorough Googling I
> realized that I probably need a Search API key to do that. Is that
> correct? Now, I don't have a search key so is there a workaround?
> Please enlighten me.
>
> Thanks,
> Oltmans
> --
> http://mail.python.org/mailman/listinfo/python-list
You just need to change your User-Agent so that Google doesn't know a
Python script is making the request:
import urllib2
headers = {'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.04 (hardy) Firefox/3.0.6'}
query = 'foo'
req = urllib2.Request('http://www.google.com/search?&q=' + query,
headers=headers)
response = urllib2.urlopen(req)
results = response.read()
--
http://mail.python.org/mailman/listinfo/python-list
Re: Newby Question for reading a file
On Thu, Feb 19, 2009 at 12:07 PM, steven.oldner wrote:
>
> On Feb 19, 12:40 pm, Mike Driscoll wrote:
> > On Feb 19, 12:32 pm, "steven.oldner" wrote:
> >
> > > Simple question but I haven't found an answer. I program in ABAP, and
> > > in ABAP you define the data structure of the file and move the file
> > > line into the structure, and then do something to the fields. That's
> > > my mental reference.
> >
> > > How do I separate or address each field in the file line with PYTHON?
> > > What's the correct way of thinking?
> >
> > > Thanks!
> >
> > I don't really follow what you mean since I've never used ABAP, but
> > here's how I typically read a file in Python:
> >
> > f = open("someFile.txt")
> > for line in f:
> > # do something with the line
> > print line
> > f.close()
> >
> > Of course, you can read just portions of the file too, using something
> > like this:
> >
> > f.read(64)
> >
> > Which will read 64 bytes. For more info, check the following out:
> >
> > http://www.diveintopython.org/file_handling/file_objects.html
> >
> > - Mike
>
> Hi Mike,
>
> ABAP is loosely based on COBOL.
>
> Here is what I was trying to do, but ended up just coding in ABAP.
>
> Read a 4 column text file of about 1,000 lines and compare the 2
> middle field of each line. If there is a difference, output the line.
>
> The line's definition in ABAP is PERNR(8) type c, ENDDA(10) type c,
> BEGDA(10) type c, and LGART(4) type c.
> In ABAP the code is:
> LOOP AT in_file.
> IF in_file-endda <> in_file-begda.
>WRITE:\ in_file. " that's same as python's print
> ENDIF.
> ENDLOOP.
>
> I can read the file, but didn't know how to look st the fields in the
> line. From what you wrote, I need to read each segment/field of the
> line?
>
> Thanks,
>
> Steve
> --
> http://mail.python.org/mailman/listinfo/python-list
You could do something like this:
f = open('file.txt', 'r')
for line in f:
a,b = line.split()[1:-1] # tokenize the string into sequence of
length 4 and store two middle values in a and b
if a != b:
print line
f.close()
--
http://mail.python.org/mailman/listinfo/python-list
