Helmut Grohne [Fri, 16 Aug 2013 20:45:18 +0000 (22:45 +0200)]
make debian version_compare available in sql
Helmut Grohne [Fri, 16 Aug 2013 20:36:04 +0000 (22:36 +0200)]
webapp templates: add an anchor for file issues
Helmut Grohne [Fri, 2 Aug 2013 13:21:56 +0000 (15:21 +0200)]
model comparability as an equivalence relation
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
Helmut Grohne [Thu, 1 Aug 2013 21:06:26 +0000 (23:06 +0200)]
support hashing gif images
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
Helmut Grohne [Tue, 30 Jul 2013 16:15:56 +0000 (18:15 +0200)]
templates/binary: space between package and compare
Helmut Grohne [Tue, 30 Jul 2013 14:03:16 +0000 (16:03 +0200)]
templates: wiki.d.o redirects to https now
Helmut Grohne [Tue, 30 Jul 2013 13:52:22 +0000 (15:52 +0200)]
fix update_sharing to work after functionid merge
Helmut Grohne [Mon, 29 Jul 2013 19:44:56 +0000 (21:44 +0200)]
importpkg.py: support uncompressed data.tar
Helmut Grohne [Sat, 27 Jul 2013 07:39:14 +0000 (09:39 +0200)]
also move the static directory into the dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:32:03 +0000 (09:32 +0200)]
move templates to dedup package
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
Helmut Grohne [Fri, 26 Jul 2013 19:53:11 +0000 (21:53 +0200)]
verify package hashes when importing via http
Helmut Grohne [Fri, 26 Jul 2013 13:04:02 +0000 (15:04 +0200)]
Merge branch functionid
Actual savings on the full data set are around 7%.
Conflicts:
README
Helmut Grohne [Thu, 25 Jul 2013 11:28:19 +0000 (13:28 +0200)]
display "issues" with files in package view
Currently this is invalid .gz files and png files not named .png.
Helmut Grohne [Thu, 25 Jul 2013 10:48:45 +0000 (12:48 +0200)]
README: foo.PNG is also a valid png name
Helmut Grohne [Wed, 24 Jul 2013 05:20:19 +0000 (07:20 +0200)]
readyaml: cache the whole function table
This should reduce the query bandwidth to the rdbms.
Helmut Grohne [Tue, 23 Jul 2013 21:32:00 +0000 (23:32 +0200)]
webapp: make html for index valid
Helmut Grohne [Tue, 23 Jul 2013 21:26:52 +0000 (23:26 +0200)]
README: fix typo in query
Helmut Grohne [Tue, 23 Jul 2013 21:26:28 +0000 (23:26 +0200)]
webapp: remove unused function
Helmut Grohne [Tue, 23 Jul 2013 19:54:41 +0000 (21:54 +0200)]
adapt queries in README to new schema
Helmut Grohne [Tue, 23 Jul 2013 16:53:55 +0000 (18:53 +0200)]
schema: reference hash functions by integer key
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
Helmut Grohne [Mon, 22 Jul 2013 10:03:35 +0000 (12:03 +0200)]
schema: extend content_package_index
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
Helmut Grohne [Mon, 15 Jul 2013 05:21:09 +0000 (07:21 +0200)]
Merge branch 'packageid'
Helmut Grohne [Fri, 12 Jul 2013 13:24:09 +0000 (15:24 +0200)]
importpkg: simplify state logic
Helmut Grohne [Fri, 12 Jul 2013 13:12:09 +0000 (15:12 +0200)]
importpkg: split process_package to process_control
Helmut Grohne [Wed, 10 Jul 2013 14:16:45 +0000 (16:16 +0200)]
schema: reference package table by integer key
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
Helmut Grohne [Wed, 10 Jul 2013 13:23:15 +0000 (15:23 +0200)]
schema.sql: drop unused index
sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.
Helmut Grohne [Wed, 3 Jul 2013 19:56:11 +0000 (21:56 +0200)]
README: explain update_sharing.py
Helmut Grohne [Sun, 23 Jun 2013 10:00:36 +0000 (12:00 +0200)]
Merge branch yamlimport
+ Way faster on multiple cores.
+ More reliable, cause http connections do not time out when the db
blocks.
- Way slower on single core with contended io path. No clue why.
Still update_sharing.py makes up the bulk of processing time.
Helmut Grohne [Wed, 19 Jun 2013 06:35:26 +0000 (08:35 +0200)]
webapp: fix hash example link after git upload
The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.
Helmut Grohne [Tue, 11 Jun 2013 21:22:10 +0000 (23:22 +0200)]
autoimport: don't fork for readyaml
This appears to be a huge performance boost.
Helmut Grohne [Tue, 11 Jun 2013 21:11:39 +0000 (23:11 +0200)]
autoimport: support processing individual files
This gets back the original functionality of importpkg.py.
Helmut Grohne [Mon, 10 Jun 2013 16:22:29 +0000 (18:22 +0200)]
split the import phase to a yaml stream
importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.
Helmut Grohne [Mon, 27 May 2013 09:59:33 +0000 (11:59 +0200)]
dedup.image: img.convert can also raise that crazy stuff
Helmut Grohne [Thu, 9 May 2013 07:20:03 +0000 (09:20 +0200)]
webapp: declare html5 and utf-8
Helmut Grohne [Thu, 9 May 2013 06:32:14 +0000 (08:32 +0200)]
webapp: enrich comparison page with version info
Helmut Grohne [Wed, 8 May 2013 13:52:42 +0000 (15:52 +0200)]
fix attribution of logo
I remembered the wrong name. The logo was made by Sune Vuorela.
Helmut Grohne [Sun, 5 May 2013 15:24:50 +0000 (17:24 +0200)]
webapp: markup error in /source template
Helmut Grohne [Sun, 5 May 2013 15:22:33 +0000 (17:22 +0200)]
webapp: validator complained about <link> with sizes
Helmut Grohne [Sun, 5 May 2013 15:10:24 +0000 (17:10 +0200)]
webapp: reference favicon from base.html
Helmut Grohne [Sun, 5 May 2013 14:19:10 +0000 (16:19 +0200)]
added favicon.ico
Authored: Cyril Brulebois
Helmut Grohne [Thu, 2 May 2013 17:28:24 +0000 (19:28 +0200)]
webapp: use jinja's filesizeformat
Except it doesn't work, so replace it with our version. At least we
might be able to drop this code in a future update.
Helmut Grohne [Thu, 2 May 2013 16:48:14 +0000 (18:48 +0200)]
webapp: reduce size of comparison output
Only add rowspan when it carries a meaning.
Helmut Grohne [Sat, 27 Apr 2013 08:55:21 +0000 (10:55 +0200)]
webapp: add a css class binary-package
Helmut Grohne [Thu, 25 Apr 2013 12:19:58 +0000 (14:19 +0200)]
webapp: total_size is None if num_files is 0
Helmut Grohne [Thu, 25 Apr 2013 12:10:47 +0000 (14:10 +0200)]
webapp: color filenames when hovering them
Helmut Grohne [Thu, 25 Apr 2013 12:10:18 +0000 (14:10 +0200)]
webapp: turn the <br> after filename into a style
Helmut Grohne [Thu, 25 Apr 2013 12:02:48 +0000 (14:02 +0200)]
move css to /style.css
Helmut Grohne [Thu, 25 Apr 2013 12:01:11 +0000 (14:01 +0200)]
webapp: make filenames css styleable
Helmut Grohne [Thu, 25 Apr 2013 07:33:03 +0000 (09:33 +0200)]
webapp: top-align fields in /compare pages
Suggested by Paul Wise.
Helmut Grohne [Thu, 25 Apr 2013 07:32:46 +0000 (09:32 +0200)]
fix markup in base.html
Helmut Grohne [Wed, 24 Apr 2013 18:56:46 +0000 (20:56 +0200)]
implement the /compare/pkg1/pkg2 page differently
The original version had two major drawbacks:
1) The SQL query used would cause a btree sort, so the time waiting
for the first output was rather long.
2) For packages with many equal files, the output would grow with
O(n^2).
Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.
Helmut Grohne [Sun, 14 Apr 2013 08:31:55 +0000 (10:31 +0200)]
webapp: added some useful notes
Helmut Grohne [Sat, 13 Apr 2013 07:59:45 +0000 (09:59 +0200)]
base.html: add link to wiki.debian.org
Helmut Grohne [Mon, 8 Apr 2013 12:41:23 +0000 (14:41 +0200)]
README: improve query after schemachange
Helmut Grohne [Tue, 26 Mar 2013 15:23:37 +0000 (16:23 +0100)]
webapp: fix problem from the previous merge
Helmut Grohne [Tue, 26 Mar 2013 14:59:48 +0000 (15:59 +0100)]
Merge branch schemachange
Helmut Grohne [Wed, 20 Mar 2013 18:19:50 +0000 (19:19 +0100)]
webapp: report correct sizes
Helmut Grohne [Wed, 20 Mar 2013 18:12:25 +0000 (19:12 +0100)]
webapp: remove broken assert
Fails on long inputs.
Helmut Grohne [Mon, 18 Mar 2013 15:51:17 +0000 (16:51 +0100)]
dedup.image: mask errors from PIL
Helmut Grohne [Tue, 12 Mar 2013 07:38:57 +0000 (08:38 +0100)]
dedup.arreader: missing bytes marker
Helmut Grohne [Tue, 12 Mar 2013 07:24:49 +0000 (08:24 +0100)]
move ArReader from importpkg to dedup.arreader
Also document it.
Helmut Grohne [Sun, 10 Mar 2013 06:38:22 +0000 (07:38 +0100)]
README: update queries to match content table split
Helmut Grohne [Sat, 9 Mar 2013 17:43:47 +0000 (18:43 +0100)]
split content table to a hash table
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
Helmut Grohne [Sat, 9 Mar 2013 17:37:24 +0000 (18:37 +0100)]
webapp: drop unused function compute_sharedstats
The sharing table works great and I don't want to adapt it for the next
step in the schema change.
Helmut Grohne [Thu, 7 Mar 2013 08:05:48 +0000 (09:05 +0100)]
use "ON DELETE CASCADE" clauses
Helmut Grohne [Thu, 7 Mar 2013 07:43:15 +0000 (08:43 +0100)]
enable enforcing foreign keys
Helmut Grohne [Thu, 7 Mar 2013 07:41:35 +0000 (08:41 +0100)]
schema.sql: remove unsatisfiable foreign key
In the dependency table we will insert dependencies on packages which
are not tracked. This happens during initial import and for virtual
packages. Therefore the "required" column cannot be a foreign key.
Helmut Grohne [Thu, 7 Mar 2013 07:28:56 +0000 (08:28 +0100)]
schema.sql: annotat foreign keys of sharing
Helmut Grohne [Thu, 7 Mar 2013 07:24:44 +0000 (08:24 +0100)]
integrate the source table into the package table
Helmut Grohne [Thu, 7 Mar 2013 07:12:01 +0000 (08:12 +0100)]
README: explain queries
Helmut Grohne [Wed, 6 Mar 2013 14:36:49 +0000 (15:36 +0100)]
README: added interesting query
Helmut Grohne [Tue, 5 Mar 2013 07:39:06 +0000 (08:39 +0100)]
webapp: added /source/<pkg> page
Helmut Grohne [Tue, 5 Mar 2013 07:38:39 +0000 (08:38 +0100)]
webapp: helper function function_combination
Helmut Grohne [Tue, 5 Mar 2013 07:21:13 +0000 (08:21 +0100)]
importpkg: source header may contain a version
Helmut Grohne [Mon, 4 Mar 2013 17:53:23 +0000 (18:53 +0100)]
webapp: fix index template
Apparently not all browsers understand <a ... /> in all rendering modes.
Helmut Grohne [Mon, 4 Mar 2013 17:49:54 +0000 (18:49 +0100)]
webapp: use caching table "shared" for /binary page
Helmut Grohne [Mon, 4 Mar 2013 12:49:22 +0000 (13:49 +0100)]
webapp: generate /comparison pages in constant-space
Helmut Grohne [Mon, 4 Mar 2013 10:44:24 +0000 (11:44 +0100)]
importpkg: record the source package relationship
Helmut Grohne [Sat, 2 Mar 2013 21:33:39 +0000 (22:33 +0100)]
update_sharing: wrong database name
Helmut Grohne [Sat, 2 Mar 2013 21:29:04 +0000 (22:29 +0100)]
add sharing table
The sharing table is a cache for the /binary web pages. It essentially
contains the numbers presented. This caching table is not automatically
populated. It needs to be reconstructed after every (group of) package
imports.
Helmut Grohne [Sat, 2 Mar 2013 20:46:47 +0000 (21:46 +0100)]
update README
* Tell about schema.sql.
* Explain WAL.
Helmut Grohne [Sat, 2 Mar 2013 20:24:18 +0000 (21:24 +0100)]
move fetchiter from webapp to dedup.utils
Helmut Grohne [Sat, 2 Mar 2013 20:18:14 +0000 (21:18 +0100)]
move sql schema to a separate file
Helmut Grohne [Sat, 2 Mar 2013 10:25:53 +0000 (11:25 +0100)]
added html form to main page
Thanks to Jan Luehr for doing the work.
Helmut Grohne [Mon, 25 Feb 2013 10:56:09 +0000 (11:56 +0100)]
webapp: open database cursor lazily
Makes things more correct when using Application in multiprocessing
context.
Helmut Grohne [Mon, 25 Feb 2013 10:52:05 +0000 (11:52 +0100)]
webapp: pass database to Application class
Helmut Grohne [Mon, 25 Feb 2013 10:49:27 +0000 (11:49 +0100)]
README: another interesting query
Helmut Grohne [Mon, 25 Feb 2013 09:00:50 +0000 (10:00 +0100)]
Merge branch 'crosshash'
Conflicts in webapp.py:
* The fetchall -> fetchiter change caused big conflicts.
* New hash combination (image_sha512, image_sha512) added.
Helmut Grohne [Mon, 25 Feb 2013 08:55:35 +0000 (09:55 +0100)]
webapp: complete cross hash support
Helmut Grohne [Mon, 25 Feb 2013 07:55:53 +0000 (08:55 +0100)]
autoimport: this is not how foreign key constraints work
Helmut Grohne [Sun, 24 Feb 2013 00:03:30 +0000 (01:03 +0100)]
hash image contents
Helmut Grohne [Sun, 24 Feb 2013 00:02:38 +0000 (01:02 +0100)]
README: fix mistake
Helmut Grohne [Sat, 23 Feb 2013 08:53:33 +0000 (09:53 +0100)]
importpkg: ignore filenames with encoding errors
Helmut Grohne [Sat, 23 Feb 2013 08:36:15 +0000 (09:36 +0100)]
autoimport: log which packages are dropped
Helmut Grohne [Fri, 22 Feb 2013 18:59:00 +0000 (19:59 +0100)]
autoimport: fix version check to actually work
Don't fail on new packages and skip versions already processed again.
Helmut Grohne [Fri, 22 Feb 2013 18:55:31 +0000 (19:55 +0100)]
autoimport: skip old versions entirely
Presumably this is responsible for the blocking curl processes, since
importpkg will terminate early when processing an old version.
Helmut Grohne [Fri, 22 Feb 2013 17:33:22 +0000 (18:33 +0100)]
webapp: add caching headers
Helmut Grohne [Fri, 22 Feb 2013 17:21:44 +0000 (18:21 +0100)]
webapp: stream responses
Maybe this gets memory usage down for large responses.
Helmut Grohne [Fri, 22 Feb 2013 16:47:14 +0000 (17:47 +0100)]
webapp: attempt to reduce memory usage
Helmut Grohne [Fri, 22 Feb 2013 13:12:33 +0000 (14:12 +0100)]
webapp: support matching sha512 against gzip_sha512
This covers only the /binary page. The comparison may still be empty.