Helmut Grohne [Sat, 8 Mar 2014 16:38:26 +0000 (17:38 +0100)]
add sqlalchemy.text wrapper
Without the wrapper sqlalchemy chokes on the query and tries to index
some dictionary on postgres.
Helmut Grohne [Sat, 8 Mar 2014 16:26:53 +0000 (17:26 +0100)]
get rid of lastrowid usage
On psycopg2 the lastrowid attribute is always 0. The documentation
advises to use inserted_primary_key instead, but in order to use that,
the sqlalchemy expression language must be used.
Helmut Grohne [Sat, 8 Mar 2014 12:30:20 +0000 (13:30 +0100)]
enable result buffering for postgres
Helmut Grohne [Sat, 8 Mar 2014 12:26:42 +0000 (13:26 +0100)]
restrict sqlite-specific configuration to sqlite databases
Helmut Grohne [Sat, 8 Mar 2014 11:54:37 +0000 (12:54 +0100)]
autoimport: fix --database option broken in merge
Helmut Grohne [Sat, 8 Mar 2014 11:39:32 +0000 (12:39 +0100)]
Merge branch 'master' into sqlalchemy
In the mean time, the master branch evolved quite a bit and the schema
changed again (eqclass added to function table). The main reason for the
merge is to resolve the large amounts of conflicts once, so development
of the sqlalchemy branch can continue and still benefit from changes in
the master branch such as schema compatibility, adapting the indent
level in web app due to the use of contextlib.closing which resembles
sqlalchemy's "with db.begin() as conn:".
Conflicts:
autoimport.py
dedup/utils.py
readyaml.py
update_sharing.py
webapp.py
Helmut Grohne [Sat, 8 Mar 2014 08:48:17 +0000 (09:48 +0100)]
schema: make syntax compatible with postgres
Helmut Grohne [Sun, 23 Feb 2014 19:12:18 +0000 (20:12 +0100)]
Merge branch updatesharing-eqclass
Helmut Grohne [Sun, 23 Feb 2014 17:19:35 +0000 (18:19 +0100)]
spell check comments
Helmut Grohne [Sun, 23 Feb 2014 16:29:41 +0000 (17:29 +0100)]
fix spelling mistake
Reported-By: Stefan Kaltenbrunner
Helmut Grohne [Sun, 23 Feb 2014 14:44:03 +0000 (15:44 +0100)]
webapp: fix eqclass usage in package comparison
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
Helmut Grohne [Fri, 21 Feb 2014 20:59:04 +0000 (21:59 +0100)]
update_sharing: weaken assumptions about db layout
Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.
Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.
Helmut Grohne [Wed, 19 Feb 2014 13:21:20 +0000 (14:21 +0100)]
blacklist content rather than hashes
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
Helmut Grohne [Wed, 19 Feb 2014 13:19:56 +0000 (14:19 +0100)]
GzipDecompressor: don't treat checksum as garbage trailer
Helmut Grohne [Wed, 19 Feb 2014 06:54:21 +0000 (07:54 +0100)]
DecompressedHash should fail on trailing input
Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.
Reported-By: Olly Betts
Helmut Grohne [Thu, 3 Oct 2013 06:51:41 +0000 (08:51 +0200)]
work around python-debian's #670679
Helmut Grohne [Wed, 11 Sep 2013 06:35:41 +0000 (08:35 +0200)]
webapp: open cursors less often
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
Helmut Grohne [Tue, 10 Sep 2013 07:39:40 +0000 (09:39 +0200)]
webapp: close database cursors
Leaking them can result in running out of available filedescriptors.
Helmut Grohne [Wed, 4 Sep 2013 08:15:59 +0000 (10:15 +0200)]
webapp: serve static files from /static
Helmut Grohne [Mon, 2 Sep 2013 16:51:20 +0000 (18:51 +0200)]
add option -d --database for db path to all scripts
Helmut Grohne [Mon, 2 Sep 2013 08:00:44 +0000 (10:00 +0200)]
autoimport: avoid hard coded temporary directory
Helmut Grohne [Mon, 2 Sep 2013 07:30:05 +0000 (09:30 +0200)]
importpkg: move library-like parts to dedup.debpkg
Helmut Grohne [Mon, 19 Aug 2013 09:52:39 +0000 (11:52 +0200)]
importpkg: don't blacklist boring gzip_sha512 hashes
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
Helmut Grohne [Fri, 16 Aug 2013 20:45:18 +0000 (22:45 +0200)]
make debian version_compare available in sql
Helmut Grohne [Fri, 16 Aug 2013 20:36:04 +0000 (22:36 +0200)]
webapp templates: add an anchor for file issues
Helmut Grohne [Sat, 3 Aug 2013 20:14:40 +0000 (22:14 +0200)]
convert remaining code to sqlalchemy
No explicit "import sqlite3" left. It's still a bit rough around the
corners, particularly since sqlalchemy's support for executemany is
totally broken.
Helmut Grohne [Fri, 2 Aug 2013 13:21:56 +0000 (15:21 +0200)]
model comparability as an equivalence relation
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
Helmut Grohne [Fri, 2 Aug 2013 06:40:49 +0000 (08:40 +0200)]
Merge branch master into sqlalchemy
This makes the sqlalchemy branch schema-compatible with master again.
The biggest change on master was the introduction of the function table.
It caused most of the conflicts. Note that webapp had one conflict not
detected by git: The selecting of issues in show_package needed
sqlalchemy conversion.
Conflicts:
README
update_sharing.py
webapp.py
Helmut Grohne [Thu, 1 Aug 2013 21:06:26 +0000 (23:06 +0200)]
support hashing gif images
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
Helmut Grohne [Tue, 30 Jul 2013 16:15:56 +0000 (18:15 +0200)]
templates/binary: space between package and compare
Helmut Grohne [Tue, 30 Jul 2013 14:03:16 +0000 (16:03 +0200)]
templates: wiki.d.o redirects to https now
Helmut Grohne [Tue, 30 Jul 2013 13:52:22 +0000 (15:52 +0200)]
fix update_sharing to work after functionid merge
Helmut Grohne [Mon, 29 Jul 2013 19:44:56 +0000 (21:44 +0200)]
importpkg.py: support uncompressed data.tar
Helmut Grohne [Sat, 27 Jul 2013 07:39:14 +0000 (09:39 +0200)]
also move the static directory into the dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:32:03 +0000 (09:32 +0200)]
move templates to dedup package
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
Helmut Grohne [Fri, 26 Jul 2013 19:53:11 +0000 (21:53 +0200)]
verify package hashes when importing via http
Helmut Grohne [Fri, 26 Jul 2013 13:04:02 +0000 (15:04 +0200)]
Merge branch functionid
Actual savings on the full data set are around 7%.
Conflicts:
README
Helmut Grohne [Thu, 25 Jul 2013 11:28:19 +0000 (13:28 +0200)]
display "issues" with files in package view
Currently this is invalid .gz files and png files not named .png.
Helmut Grohne [Thu, 25 Jul 2013 10:48:45 +0000 (12:48 +0200)]
README: foo.PNG is also a valid png name
Helmut Grohne [Wed, 24 Jul 2013 07:07:39 +0000 (09:07 +0200)]
sqlalchemy's fetchmany defaults to being fetchall
This voids the benefits of processing rows during row generation as has
been observed on postgres.
Helmut Grohne [Wed, 24 Jul 2013 05:20:19 +0000 (07:20 +0200)]
readyaml: cache the whole function table
This should reduce the query bandwidth to the rdbms.
Helmut Grohne [Tue, 23 Jul 2013 21:32:00 +0000 (23:32 +0200)]
webapp: make html for index valid
Helmut Grohne [Tue, 23 Jul 2013 21:26:52 +0000 (23:26 +0200)]
README: fix typo in query
Helmut Grohne [Tue, 23 Jul 2013 21:26:28 +0000 (23:26 +0200)]
webapp: remove unused function
Helmut Grohne [Tue, 23 Jul 2013 19:54:41 +0000 (21:54 +0200)]
adapt queries in README to new schema
Helmut Grohne [Tue, 23 Jul 2013 16:53:55 +0000 (18:53 +0200)]
schema: reference hash functions by integer key
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
Helmut Grohne [Mon, 22 Jul 2013 10:03:35 +0000 (12:03 +0200)]
schema: extend content_package_index
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
Helmut Grohne [Sat, 20 Jul 2013 12:11:45 +0000 (14:11 +0200)]
another missing sqlalchemy.text wrapper
Helmut Grohne [Sat, 20 Jul 2013 12:09:30 +0000 (14:09 +0200)]
use sqlalchemy.text
Without using this wrapper the sql statements are not munged by
sqlalchemy. Specifically paramstyle is not translated. For sqlite3 this
did not matter, because it allows the changed paramstyle, but for
postgres it fails without sqlalchemy.text wrappers.
Helmut Grohne [Wed, 17 Jul 2013 14:27:08 +0000 (16:27 +0200)]
Merge branch master into sqlalchemy
This basically pulls the packageid branch into sqlalchemy. The merge was
complex, because many sql statements diverged. The merge brings us one
step closer to supporting postgres, because an "INSERT OR REPLACE" was
removed from readyaml.py in the packageid branch.
Conflicts:
update_sharing.py
webapp.py
Helmut Grohne [Mon, 15 Jul 2013 05:21:09 +0000 (07:21 +0200)]
Merge branch 'packageid'
Helmut Grohne [Fri, 12 Jul 2013 13:24:09 +0000 (15:24 +0200)]
importpkg: simplify state logic
Helmut Grohne [Fri, 12 Jul 2013 13:12:09 +0000 (15:12 +0200)]
importpkg: split process_package to process_control
Helmut Grohne [Wed, 10 Jul 2013 20:01:13 +0000 (22:01 +0200)]
use sqlalchemy paramstyle
By using the :name syntax inside sql statements, sqlalchemy will replace
the contents with whatever paramstyle the underlying dbapi2 module
needs. In case of psycopg2 the paramstyle is not qmark for instance.
Helmut Grohne [Wed, 10 Jul 2013 20:00:17 +0000 (22:00 +0200)]
webapp: fix handling of total_size
The expression "total_size and 0" masks any positive integer to 0.
Helmut Grohne [Wed, 10 Jul 2013 14:16:45 +0000 (16:16 +0200)]
schema: reference package table by integer key
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
Helmut Grohne [Wed, 10 Jul 2013 13:23:15 +0000 (15:23 +0200)]
schema.sql: drop unused index
sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.
Helmut Grohne [Wed, 3 Jul 2013 19:56:11 +0000 (21:56 +0200)]
README: explain update_sharing.py
Helmut Grohne [Sun, 23 Jun 2013 14:33:19 +0000 (16:33 +0200)]
update_sharing: postgres does not support "INSERT OR IGNORE"
Helmut Grohne [Sun, 23 Jun 2013 11:30:04 +0000 (13:30 +0200)]
dedup.utils: add enbale_sqlite_foreign_keys helper
Makes usage of sqlalchemy easier, cause I can invoke it once and it
works for all connections.
Helmut Grohne [Sun, 23 Jun 2013 11:04:30 +0000 (13:04 +0200)]
Merge master into sqlalchemy
This is necessary to avoid severe merge conflicts when converting
importpkg.py to sqlalchemy. The actual sql invocation has moved to a
different file in master.
Conflicts:
README (diverged set of dependencies)
Helmut Grohne [Sun, 23 Jun 2013 11:01:56 +0000 (13:01 +0200)]
port update_sharing.py to sqlalchemy
Helmut Grohne [Sun, 23 Jun 2013 10:00:36 +0000 (12:00 +0200)]
Merge branch yamlimport
+ Way faster on multiple cores.
+ More reliable, cause http connections do not time out when the db
blocks.
- Way slower on single core with contended io path. No clue why.
Still update_sharing.py makes up the bulk of processing time.
Helmut Grohne [Wed, 19 Jun 2013 06:35:26 +0000 (08:35 +0200)]
webapp: fix hash example link after git upload
The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.
Helmut Grohne [Thu, 13 Jun 2013 13:00:39 +0000 (15:00 +0200)]
webapp: use sqlalchemy
* Arguably the interface is nicer.
* Actually closes connections. => wal files get deleted.
* Permits switching from sqlite to anything.
Helmut Grohne [Tue, 11 Jun 2013 21:22:10 +0000 (23:22 +0200)]
autoimport: don't fork for readyaml
This appears to be a huge performance boost.
Helmut Grohne [Tue, 11 Jun 2013 21:11:39 +0000 (23:11 +0200)]
autoimport: support processing individual files
This gets back the original functionality of importpkg.py.
Helmut Grohne [Mon, 10 Jun 2013 16:22:29 +0000 (18:22 +0200)]
split the import phase to a yaml stream
importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.
Helmut Grohne [Mon, 27 May 2013 09:59:33 +0000 (11:59 +0200)]
dedup.image: img.convert can also raise that crazy stuff
Helmut Grohne [Thu, 9 May 2013 07:20:03 +0000 (09:20 +0200)]
webapp: declare html5 and utf-8
Helmut Grohne [Thu, 9 May 2013 06:32:14 +0000 (08:32 +0200)]
webapp: enrich comparison page with version info
Helmut Grohne [Wed, 8 May 2013 13:52:42 +0000 (15:52 +0200)]
fix attribution of logo
I remembered the wrong name. The logo was made by Sune Vuorela.
Helmut Grohne [Sun, 5 May 2013 15:24:50 +0000 (17:24 +0200)]
webapp: markup error in /source template
Helmut Grohne [Sun, 5 May 2013 15:22:33 +0000 (17:22 +0200)]
webapp: validator complained about <link> with sizes
Helmut Grohne [Sun, 5 May 2013 15:10:24 +0000 (17:10 +0200)]
webapp: reference favicon from base.html
Helmut Grohne [Sun, 5 May 2013 14:19:10 +0000 (16:19 +0200)]
added favicon.ico
Authored: Cyril Brulebois
Helmut Grohne [Thu, 2 May 2013 17:28:24 +0000 (19:28 +0200)]
webapp: use jinja's filesizeformat
Except it doesn't work, so replace it with our version. At least we
might be able to drop this code in a future update.
Helmut Grohne [Thu, 2 May 2013 16:48:14 +0000 (18:48 +0200)]
webapp: reduce size of comparison output
Only add rowspan when it carries a meaning.
Helmut Grohne [Sat, 27 Apr 2013 08:55:21 +0000 (10:55 +0200)]
webapp: add a css class binary-package
Helmut Grohne [Thu, 25 Apr 2013 12:19:58 +0000 (14:19 +0200)]
webapp: total_size is None if num_files is 0
Helmut Grohne [Thu, 25 Apr 2013 12:10:47 +0000 (14:10 +0200)]
webapp: color filenames when hovering them
Helmut Grohne [Thu, 25 Apr 2013 12:10:18 +0000 (14:10 +0200)]
webapp: turn the <br> after filename into a style
Helmut Grohne [Thu, 25 Apr 2013 12:02:48 +0000 (14:02 +0200)]
move css to /style.css
Helmut Grohne [Thu, 25 Apr 2013 12:01:11 +0000 (14:01 +0200)]
webapp: make filenames css styleable
Helmut Grohne [Thu, 25 Apr 2013 07:33:03 +0000 (09:33 +0200)]
webapp: top-align fields in /compare pages
Suggested by Paul Wise.
Helmut Grohne [Thu, 25 Apr 2013 07:32:46 +0000 (09:32 +0200)]
fix markup in base.html
Helmut Grohne [Wed, 24 Apr 2013 18:56:46 +0000 (20:56 +0200)]
implement the /compare/pkg1/pkg2 page differently
The original version had two major drawbacks:
1) The SQL query used would cause a btree sort, so the time waiting
for the first output was rather long.
2) For packages with many equal files, the output would grow with
O(n^2).
Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.
Helmut Grohne [Sun, 14 Apr 2013 08:31:55 +0000 (10:31 +0200)]
webapp: added some useful notes
Helmut Grohne [Sat, 13 Apr 2013 07:59:45 +0000 (09:59 +0200)]
base.html: add link to wiki.debian.org
Helmut Grohne [Mon, 8 Apr 2013 12:41:23 +0000 (14:41 +0200)]
README: improve query after schemachange
Helmut Grohne [Tue, 26 Mar 2013 15:23:37 +0000 (16:23 +0100)]
webapp: fix problem from the previous merge
Helmut Grohne [Tue, 26 Mar 2013 14:59:48 +0000 (15:59 +0100)]
Merge branch schemachange
Helmut Grohne [Wed, 20 Mar 2013 18:19:50 +0000 (19:19 +0100)]
webapp: report correct sizes
Helmut Grohne [Wed, 20 Mar 2013 18:12:25 +0000 (19:12 +0100)]
webapp: remove broken assert
Fails on long inputs.
Helmut Grohne [Mon, 18 Mar 2013 15:51:17 +0000 (16:51 +0100)]
dedup.image: mask errors from PIL
Helmut Grohne [Tue, 12 Mar 2013 07:38:57 +0000 (08:38 +0100)]
dedup.arreader: missing bytes marker
Helmut Grohne [Tue, 12 Mar 2013 07:24:49 +0000 (08:24 +0100)]
move ArReader from importpkg to dedup.arreader
Also document it.
Helmut Grohne [Sun, 10 Mar 2013 06:38:22 +0000 (07:38 +0100)]
README: update queries to match content table split
Helmut Grohne [Sat, 9 Mar 2013 17:43:47 +0000 (18:43 +0100)]
split content table to a hash table
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
Helmut Grohne [Sat, 9 Mar 2013 17:37:24 +0000 (18:37 +0100)]
webapp: drop unused function compute_sharedstats
The sharing table works great and I don't want to adapt it for the next
step in the schema change.