~helmut/debian-dedup.git
7 years agowebapp: allow git-like hash truncation
Helmut Grohne [Sun, 11 May 2014 13:25:46 +0000 (15:25 +0200)]
webapp: allow git-like hash truncation

7 years agoautoimport: support protocols besides http
Helmut Grohne [Mon, 21 Apr 2014 10:50:15 +0000 (12:50 +0200)]
autoimport: support protocols besides http

7 years agoschema: make syntax compatible with postgres
Helmut Grohne [Sat, 8 Mar 2014 08:48:17 +0000 (09:48 +0100)]
schema: make syntax compatible with postgres

7 years agoMerge branch updatesharing-eqclass
Helmut Grohne [Sun, 23 Feb 2014 19:12:18 +0000 (20:12 +0100)]
Merge branch updatesharing-eqclass

7 years agospell check comments
Helmut Grohne [Sun, 23 Feb 2014 17:19:35 +0000 (18:19 +0100)]
spell check comments

7 years agofix spelling mistake
Helmut Grohne [Sun, 23 Feb 2014 16:29:41 +0000 (17:29 +0100)]
fix spelling mistake

Reported-By: Stefan Kaltenbrunner
7 years agowebapp: fix eqclass usage in package comparison
Helmut Grohne [Sun, 23 Feb 2014 14:44:03 +0000 (15:44 +0100)]
webapp: fix eqclass usage in package comparison

When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.

7 years agoupdate_sharing: weaken assumptions about db layout
Helmut Grohne [Fri, 21 Feb 2014 20:59:04 +0000 (21:59 +0100)]
update_sharing: weaken assumptions about db layout

Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.

Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.

7 years agoblacklist content rather than hashes
Helmut Grohne [Wed, 19 Feb 2014 13:21:20 +0000 (14:21 +0100)]
blacklist content rather than hashes

Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.

7 years agoGzipDecompressor: don't treat checksum as garbage trailer
Helmut Grohne [Wed, 19 Feb 2014 13:19:56 +0000 (14:19 +0100)]
GzipDecompressor: don't treat checksum as garbage trailer

7 years agoDecompressedHash should fail on trailing input
Helmut Grohne [Wed, 19 Feb 2014 06:54:21 +0000 (07:54 +0100)]
DecompressedHash should fail on trailing input

Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.

Reported-By: Olly Betts
8 years agowork around python-debian's #670679
Helmut Grohne [Thu, 3 Oct 2013 06:51:41 +0000 (08:51 +0200)]
work around python-debian's #670679

8 years agowebapp: open cursors less often
Helmut Grohne [Wed, 11 Sep 2013 06:35:41 +0000 (08:35 +0200)]
webapp: open cursors less often

On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.

8 years agowebapp: close database cursors
Helmut Grohne [Tue, 10 Sep 2013 07:39:40 +0000 (09:39 +0200)]
webapp: close database cursors

Leaking them can result in running out of available filedescriptors.

8 years agowebapp: serve static files from /static
Helmut Grohne [Wed, 4 Sep 2013 08:15:59 +0000 (10:15 +0200)]
webapp: serve static files from /static

8 years agoadd option -d --database for db path to all scripts
Helmut Grohne [Mon, 2 Sep 2013 16:51:20 +0000 (18:51 +0200)]
add option -d --database for db path to all scripts

8 years agoautoimport: avoid hard coded temporary directory
Helmut Grohne [Mon, 2 Sep 2013 08:00:44 +0000 (10:00 +0200)]
autoimport: avoid hard coded temporary directory

8 years agoimportpkg: move library-like parts to dedup.debpkg
Helmut Grohne [Mon, 2 Sep 2013 07:30:05 +0000 (09:30 +0200)]
importpkg: move library-like parts to dedup.debpkg

8 years agoimportpkg: don't blacklist boring gzip_sha512 hashes
Helmut Grohne [Mon, 19 Aug 2013 09:52:39 +0000 (11:52 +0200)]
importpkg: don't blacklist boring gzip_sha512 hashes

 * In practise there are very few compressed files with trivial hashes.
 * Blacklisting these values results in false positives in the gzip
   issues.

8 years agomake debian version_compare available in sql
Helmut Grohne [Fri, 16 Aug 2013 20:45:18 +0000 (22:45 +0200)]
make debian version_compare available in sql

8 years agowebapp templates: add an anchor for file issues
Helmut Grohne [Fri, 16 Aug 2013 20:36:04 +0000 (22:36 +0200)]
webapp templates: add an anchor for file issues

8 years agomodel comparability as an equivalence relation
Helmut Grohne [Fri, 2 Aug 2013 13:21:56 +0000 (15:21 +0200)]
model comparability as an equivalence relation

webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.

8 years agosupport hashing gif images
Helmut Grohne [Thu, 1 Aug 2013 21:06:26 +0000 (23:06 +0200)]
support hashing gif images

 * Rename "image_sha512" to "png_sha512".
 * dedup.image.ImageHash is now a base class for image hashes such as
   PNGHash and GIFHash.
 * Enable both hashes in importpkg.
 * Fix README.
 * Add new hash combinations to webapp.
 * Add "gif file not named *.gif" to issues in update_sharing.
 * Add redirect for "image_sha512" to webapp for backwards
   compatibility.

8 years agotemplates/binary: space between package and compare
Helmut Grohne [Tue, 30 Jul 2013 16:15:56 +0000 (18:15 +0200)]
templates/binary: space between package and compare

8 years agotemplates: wiki.d.o redirects to https now
Helmut Grohne [Tue, 30 Jul 2013 14:03:16 +0000 (16:03 +0200)]
templates: wiki.d.o redirects to https now

8 years agofix update_sharing to work after functionid merge
Helmut Grohne [Tue, 30 Jul 2013 13:52:22 +0000 (15:52 +0200)]
fix update_sharing to work after functionid merge

8 years agoimportpkg.py: support uncompressed data.tar
Helmut Grohne [Mon, 29 Jul 2013 19:44:56 +0000 (21:44 +0200)]
importpkg.py: support uncompressed data.tar

8 years agoalso move the static directory into the dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:39:14 +0000 (09:39 +0200)]
also move the static directory into the dedup package

8 years agomove templates to dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:32:03 +0000 (09:32 +0200)]
move templates to dedup package

They cluttered webapp.py and now vim can give proper highlighting for
the templates.

8 years agoverify package hashes when importing via http
Helmut Grohne [Fri, 26 Jul 2013 19:53:11 +0000 (21:53 +0200)]
verify package hashes when importing via http

8 years agoMerge branch functionid
Helmut Grohne [Fri, 26 Jul 2013 13:04:02 +0000 (15:04 +0200)]
Merge branch functionid

Actual savings on the full data set are around 7%.

Conflicts:
README

8 years agodisplay "issues" with files in package view
Helmut Grohne [Thu, 25 Jul 2013 11:28:19 +0000 (13:28 +0200)]
display "issues" with files in package view

Currently this is invalid .gz files and png files not named .png.

8 years agoREADME: foo.PNG is also a valid png name
Helmut Grohne [Thu, 25 Jul 2013 10:48:45 +0000 (12:48 +0200)]
README: foo.PNG is also a valid png name

8 years agoreadyaml: cache the whole function table
Helmut Grohne [Wed, 24 Jul 2013 05:20:19 +0000 (07:20 +0200)]
readyaml: cache the whole function table

This should reduce the query bandwidth to the rdbms.

8 years agowebapp: make html for index valid
Helmut Grohne [Tue, 23 Jul 2013 21:32:00 +0000 (23:32 +0200)]
webapp: make html for index valid

8 years agoREADME: fix typo in query
Helmut Grohne [Tue, 23 Jul 2013 21:26:52 +0000 (23:26 +0200)]
README: fix typo in query

8 years agowebapp: remove unused function
Helmut Grohne [Tue, 23 Jul 2013 21:26:28 +0000 (23:26 +0200)]
webapp: remove unused function

8 years agoadapt queries in README to new schema
Helmut Grohne [Tue, 23 Jul 2013 19:54:41 +0000 (21:54 +0200)]
adapt queries in README to new schema

8 years agoschema: reference hash functions by integer key
Helmut Grohne [Tue, 23 Jul 2013 16:53:55 +0000 (18:53 +0200)]
schema: reference hash functions by integer key

This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.

8 years agoschema: extend content_package_index
Helmut Grohne [Mon, 22 Jul 2013 10:03:35 +0000 (12:03 +0200)]
schema: extend content_package_index

We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.

8 years agoMerge branch 'packageid'
Helmut Grohne [Mon, 15 Jul 2013 05:21:09 +0000 (07:21 +0200)]
Merge branch 'packageid'

8 years agoimportpkg: simplify state logic
Helmut Grohne [Fri, 12 Jul 2013 13:24:09 +0000 (15:24 +0200)]
importpkg: simplify state logic

8 years agoimportpkg: split process_package to process_control
Helmut Grohne [Fri, 12 Jul 2013 13:12:09 +0000 (15:12 +0200)]
importpkg: split process_package to process_control

8 years agoschema: reference package table by integer key
Helmut Grohne [Wed, 10 Jul 2013 14:16:45 +0000 (16:16 +0200)]
schema: reference package table by integer key

One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.

8 years agoschema.sql: drop unused index
Helmut Grohne [Wed, 10 Jul 2013 13:23:15 +0000 (15:23 +0200)]
schema.sql: drop unused index

sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.

8 years agoREADME: explain update_sharing.py
Helmut Grohne [Wed, 3 Jul 2013 19:56:11 +0000 (21:56 +0200)]
README: explain update_sharing.py

8 years agoMerge branch yamlimport
Helmut Grohne [Sun, 23 Jun 2013 10:00:36 +0000 (12:00 +0200)]
Merge branch yamlimport

 + Way faster on multiple cores.
 + More reliable, cause http connections do not time out when the db
   blocks.
 - Way slower on single core with contended io path. No clue why.
   Still update_sharing.py makes up the bulk of processing time.

8 years agowebapp: fix hash example link after git upload
Helmut Grohne [Wed, 19 Jun 2013 06:35:26 +0000 (08:35 +0200)]
webapp: fix hash example link after git upload

The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.

8 years agoautoimport: don't fork for readyaml
Helmut Grohne [Tue, 11 Jun 2013 21:22:10 +0000 (23:22 +0200)]
autoimport: don't fork for readyaml

This appears to be a huge performance boost.

8 years agoautoimport: support processing individual files
Helmut Grohne [Tue, 11 Jun 2013 21:11:39 +0000 (23:11 +0200)]
autoimport: support processing individual files

This gets back the original functionality of importpkg.py.

8 years agosplit the import phase to a yaml stream
Helmut Grohne [Mon, 10 Jun 2013 16:22:29 +0000 (18:22 +0200)]
split the import phase to a yaml stream

importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.

8 years agodedup.image: img.convert can also raise that crazy stuff
Helmut Grohne [Mon, 27 May 2013 09:59:33 +0000 (11:59 +0200)]
dedup.image: img.convert can also raise that crazy stuff

8 years agowebapp: declare html5 and utf-8
Helmut Grohne [Thu, 9 May 2013 07:20:03 +0000 (09:20 +0200)]
webapp: declare html5 and utf-8

8 years agowebapp: enrich comparison page with version info
Helmut Grohne [Thu, 9 May 2013 06:32:14 +0000 (08:32 +0200)]
webapp: enrich comparison page with version info

8 years agofix attribution of logo
Helmut Grohne [Wed, 8 May 2013 13:52:42 +0000 (15:52 +0200)]
fix attribution of logo

I remembered the wrong name. The logo was made by Sune Vuorela.

8 years agowebapp: markup error in /source template
Helmut Grohne [Sun, 5 May 2013 15:24:50 +0000 (17:24 +0200)]
webapp: markup error in /source template

8 years agowebapp: validator complained about <link> with sizes
Helmut Grohne [Sun, 5 May 2013 15:22:33 +0000 (17:22 +0200)]
webapp: validator complained about <link> with sizes

8 years agowebapp: reference favicon from base.html
Helmut Grohne [Sun, 5 May 2013 15:10:24 +0000 (17:10 +0200)]
webapp: reference favicon from base.html

8 years agoadded favicon.ico
Helmut Grohne [Sun, 5 May 2013 14:19:10 +0000 (16:19 +0200)]
added favicon.ico

Authored: Cyril Brulebois

8 years agowebapp: use jinja's filesizeformat
Helmut Grohne [Thu, 2 May 2013 17:28:24 +0000 (19:28 +0200)]
webapp: use jinja's filesizeformat

Except it doesn't work, so replace it with our version. At least we
might be able to drop this code in a future update.

8 years agowebapp: reduce size of comparison output
Helmut Grohne [Thu, 2 May 2013 16:48:14 +0000 (18:48 +0200)]
webapp: reduce size of comparison output

Only add rowspan when it carries a meaning.

8 years agowebapp: add a css class binary-package
Helmut Grohne [Sat, 27 Apr 2013 08:55:21 +0000 (10:55 +0200)]
webapp: add a css class binary-package

8 years agowebapp: total_size is None if num_files is 0
Helmut Grohne [Thu, 25 Apr 2013 12:19:58 +0000 (14:19 +0200)]
webapp: total_size is None if num_files is 0

8 years agowebapp: color filenames when hovering them
Helmut Grohne [Thu, 25 Apr 2013 12:10:47 +0000 (14:10 +0200)]
webapp: color filenames when hovering them

8 years agowebapp: turn the <br> after filename into a style
Helmut Grohne [Thu, 25 Apr 2013 12:10:18 +0000 (14:10 +0200)]
webapp: turn the <br> after filename into a style

8 years agomove css to /style.css
Helmut Grohne [Thu, 25 Apr 2013 12:02:48 +0000 (14:02 +0200)]
move css to /style.css

8 years agowebapp: make filenames css styleable
Helmut Grohne [Thu, 25 Apr 2013 12:01:11 +0000 (14:01 +0200)]
webapp: make filenames css styleable

8 years agowebapp: top-align fields in /compare pages
Helmut Grohne [Thu, 25 Apr 2013 07:33:03 +0000 (09:33 +0200)]
webapp: top-align fields in /compare pages

Suggested by Paul Wise.

8 years agofix markup in base.html
Helmut Grohne [Thu, 25 Apr 2013 07:32:46 +0000 (09:32 +0200)]
fix markup in base.html

8 years agoimplement the /compare/pkg1/pkg2 page differently
Helmut Grohne [Wed, 24 Apr 2013 18:56:46 +0000 (20:56 +0200)]
implement the /compare/pkg1/pkg2 page differently

The original version had two major drawbacks:
 1) The SQL query used would cause a btree sort, so the time waiting
    for the first output was rather long.
 2) For packages with many equal files, the output would grow with
    O(n^2).

Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.

8 years agowebapp: added some useful notes
Helmut Grohne [Sun, 14 Apr 2013 08:31:55 +0000 (10:31 +0200)]
webapp: added some useful notes

8 years agobase.html: add link to wiki.debian.org
Helmut Grohne [Sat, 13 Apr 2013 07:59:45 +0000 (09:59 +0200)]
base.html: add link to wiki.debian.org

8 years agoREADME: improve query after schemachange
Helmut Grohne [Mon, 8 Apr 2013 12:41:23 +0000 (14:41 +0200)]
README: improve query after schemachange

8 years agowebapp: fix problem from the previous merge
Helmut Grohne [Tue, 26 Mar 2013 15:23:37 +0000 (16:23 +0100)]
webapp: fix problem from the previous merge

8 years agoMerge branch schemachange
Helmut Grohne [Tue, 26 Mar 2013 14:59:48 +0000 (15:59 +0100)]
Merge branch schemachange

8 years agowebapp: report correct sizes
Helmut Grohne [Wed, 20 Mar 2013 18:19:50 +0000 (19:19 +0100)]
webapp: report correct sizes

8 years agowebapp: remove broken assert
Helmut Grohne [Wed, 20 Mar 2013 18:12:25 +0000 (19:12 +0100)]
webapp: remove broken assert

Fails on long inputs.

8 years agodedup.image: mask errors from PIL
Helmut Grohne [Mon, 18 Mar 2013 15:51:17 +0000 (16:51 +0100)]
dedup.image: mask errors from PIL

8 years agodedup.arreader: missing bytes marker
Helmut Grohne [Tue, 12 Mar 2013 07:38:57 +0000 (08:38 +0100)]
dedup.arreader: missing bytes marker

8 years agomove ArReader from importpkg to dedup.arreader
Helmut Grohne [Tue, 12 Mar 2013 07:24:49 +0000 (08:24 +0100)]
move ArReader from importpkg to dedup.arreader

Also document it.

8 years agoREADME: update queries to match content table split
Helmut Grohne [Sun, 10 Mar 2013 06:38:22 +0000 (07:38 +0100)]
README: update queries to match content table split

8 years agosplit content table to a hash table
Helmut Grohne [Sat, 9 Mar 2013 17:43:47 +0000 (18:43 +0100)]
split content table to a hash table

In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.

8 years agowebapp: drop unused function compute_sharedstats
Helmut Grohne [Sat, 9 Mar 2013 17:37:24 +0000 (18:37 +0100)]
webapp: drop unused function compute_sharedstats

The sharing table works great and I don't want to adapt it for the next
step in the schema change.

8 years agouse "ON DELETE CASCADE" clauses
Helmut Grohne [Thu, 7 Mar 2013 08:05:48 +0000 (09:05 +0100)]
use "ON DELETE CASCADE" clauses

8 years agoenable enforcing foreign keys
Helmut Grohne [Thu, 7 Mar 2013 07:43:15 +0000 (08:43 +0100)]
enable enforcing foreign keys

8 years agoschema.sql: remove unsatisfiable foreign key
Helmut Grohne [Thu, 7 Mar 2013 07:41:35 +0000 (08:41 +0100)]
schema.sql: remove unsatisfiable foreign key

In the dependency table we will insert dependencies on packages which
are not tracked. This happens during initial import and for virtual
packages. Therefore the "required" column cannot be a foreign key.

8 years agoschema.sql: annotat foreign keys of sharing
Helmut Grohne [Thu, 7 Mar 2013 07:28:56 +0000 (08:28 +0100)]
schema.sql: annotat foreign keys of sharing

8 years agointegrate the source table into the package table
Helmut Grohne [Thu, 7 Mar 2013 07:24:44 +0000 (08:24 +0100)]
integrate the source table into the package table

8 years agoREADME: explain queries
Helmut Grohne [Thu, 7 Mar 2013 07:12:01 +0000 (08:12 +0100)]
README: explain queries

8 years agoREADME: added interesting query
Helmut Grohne [Wed, 6 Mar 2013 14:36:49 +0000 (15:36 +0100)]
README: added interesting query

8 years agowebapp: added /source/<pkg> page
Helmut Grohne [Tue, 5 Mar 2013 07:39:06 +0000 (08:39 +0100)]
webapp: added /source/<pkg> page

8 years agowebapp: helper function function_combination
Helmut Grohne [Tue, 5 Mar 2013 07:38:39 +0000 (08:38 +0100)]
webapp: helper function function_combination

8 years agoimportpkg: source header may contain a version
Helmut Grohne [Tue, 5 Mar 2013 07:21:13 +0000 (08:21 +0100)]
importpkg: source header may contain a version

8 years agowebapp: fix index template
Helmut Grohne [Mon, 4 Mar 2013 17:53:23 +0000 (18:53 +0100)]
webapp: fix index template

Apparently not all browsers understand <a ... /> in all rendering modes.

8 years agowebapp: use caching table "shared" for /binary page
Helmut Grohne [Mon, 4 Mar 2013 17:49:54 +0000 (18:49 +0100)]
webapp: use caching table "shared" for /binary page

8 years agowebapp: generate /comparison pages in constant-space
Helmut Grohne [Mon, 4 Mar 2013 12:49:22 +0000 (13:49 +0100)]
webapp: generate /comparison pages in constant-space

8 years agoimportpkg: record the source package relationship
Helmut Grohne [Mon, 4 Mar 2013 10:44:24 +0000 (11:44 +0100)]
importpkg: record the source package relationship

8 years agoupdate_sharing: wrong database name
Helmut Grohne [Sat, 2 Mar 2013 21:33:39 +0000 (22:33 +0100)]
update_sharing: wrong database name

8 years agoadd sharing table
Helmut Grohne [Sat, 2 Mar 2013 21:29:04 +0000 (22:29 +0100)]
add sharing table

The sharing table is a cache for the /binary web pages. It essentially
contains the numbers presented. This caching table is not automatically
populated. It needs to be reconstructed after every (group of) package
imports.

8 years agoupdate README
Helmut Grohne [Sat, 2 Mar 2013 20:46:47 +0000 (21:46 +0100)]
update README

 * Tell about schema.sql.
 * Explain WAL.