Helmut Grohne [Tue, 25 Feb 2014 21:29:56 +0000 (22:29 +0100)]
explain what can be done with the new data
Helmut Grohne [Tue, 25 Feb 2014 06:17:39 +0000 (07:17 +0100)]
record package metadata that describes co-installability
Specifically all entries in the Conflicts header are saved in the
conflict table, all entries in the Provides header are saved in the
provide table (to cover conflicts with virtual packages) and packages
using dpkg-divert in preinst get a magic "_dpkg-divert" entry in their
conflict table. With this metadata it should be possible to compute
undeclared file conflicts.
Helmut Grohne [Sun, 23 Feb 2014 19:12:18 +0000 (20:12 +0100)]
Merge branch updatesharing-eqclass
Helmut Grohne [Sun, 23 Feb 2014 17:19:35 +0000 (18:19 +0100)]
spell check comments
Helmut Grohne [Sun, 23 Feb 2014 16:29:41 +0000 (17:29 +0100)]
fix spelling mistake
Reported-By: Stefan Kaltenbrunner
Helmut Grohne [Sun, 23 Feb 2014 14:44:03 +0000 (15:44 +0100)]
webapp: fix eqclass usage in package comparison
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
Helmut Grohne [Fri, 21 Feb 2014 20:59:04 +0000 (21:59 +0100)]
update_sharing: weaken assumptions about db layout
Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.
Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.
Helmut Grohne [Wed, 19 Feb 2014 13:21:20 +0000 (14:21 +0100)]
blacklist content rather than hashes
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
Helmut Grohne [Wed, 19 Feb 2014 13:19:56 +0000 (14:19 +0100)]
GzipDecompressor: don't treat checksum as garbage trailer
Helmut Grohne [Wed, 19 Feb 2014 06:54:21 +0000 (07:54 +0100)]
DecompressedHash should fail on trailing input
Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.
Reported-By: Olly Betts
Helmut Grohne [Thu, 3 Oct 2013 06:51:41 +0000 (08:51 +0200)]
work around python-debian's #670679
Helmut Grohne [Wed, 11 Sep 2013 06:35:41 +0000 (08:35 +0200)]
webapp: open cursors less often
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
Helmut Grohne [Tue, 10 Sep 2013 07:39:40 +0000 (09:39 +0200)]
webapp: close database cursors
Leaking them can result in running out of available filedescriptors.
Helmut Grohne [Wed, 4 Sep 2013 08:15:59 +0000 (10:15 +0200)]
webapp: serve static files from /static
Helmut Grohne [Mon, 2 Sep 2013 16:51:20 +0000 (18:51 +0200)]
add option -d --database for db path to all scripts
Helmut Grohne [Mon, 2 Sep 2013 08:00:44 +0000 (10:00 +0200)]
autoimport: avoid hard coded temporary directory
Helmut Grohne [Mon, 2 Sep 2013 07:30:05 +0000 (09:30 +0200)]
importpkg: move library-like parts to dedup.debpkg
Helmut Grohne [Mon, 19 Aug 2013 09:52:39 +0000 (11:52 +0200)]
importpkg: don't blacklist boring gzip_sha512 hashes
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
Helmut Grohne [Fri, 16 Aug 2013 20:45:18 +0000 (22:45 +0200)]
make debian version_compare available in sql
Helmut Grohne [Fri, 16 Aug 2013 20:36:04 +0000 (22:36 +0200)]
webapp templates: add an anchor for file issues
Helmut Grohne [Fri, 2 Aug 2013 13:21:56 +0000 (15:21 +0200)]
model comparability as an equivalence relation
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
Helmut Grohne [Thu, 1 Aug 2013 21:06:26 +0000 (23:06 +0200)]
support hashing gif images
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
Helmut Grohne [Tue, 30 Jul 2013 16:15:56 +0000 (18:15 +0200)]
templates/binary: space between package and compare
Helmut Grohne [Tue, 30 Jul 2013 14:03:16 +0000 (16:03 +0200)]
templates: wiki.d.o redirects to https now
Helmut Grohne [Tue, 30 Jul 2013 13:52:22 +0000 (15:52 +0200)]
fix update_sharing to work after functionid merge
Helmut Grohne [Mon, 29 Jul 2013 19:44:56 +0000 (21:44 +0200)]
importpkg.py: support uncompressed data.tar
Helmut Grohne [Sat, 27 Jul 2013 07:39:14 +0000 (09:39 +0200)]
also move the static directory into the dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:32:03 +0000 (09:32 +0200)]
move templates to dedup package
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
Helmut Grohne [Fri, 26 Jul 2013 19:53:11 +0000 (21:53 +0200)]
verify package hashes when importing via http
Helmut Grohne [Fri, 26 Jul 2013 13:04:02 +0000 (15:04 +0200)]
Merge branch functionid
Actual savings on the full data set are around 7%.
Conflicts:
README
Helmut Grohne [Thu, 25 Jul 2013 11:28:19 +0000 (13:28 +0200)]
display "issues" with files in package view
Currently this is invalid .gz files and png files not named .png.
Helmut Grohne [Thu, 25 Jul 2013 10:48:45 +0000 (12:48 +0200)]
README: foo.PNG is also a valid png name
Helmut Grohne [Wed, 24 Jul 2013 05:20:19 +0000 (07:20 +0200)]
readyaml: cache the whole function table
This should reduce the query bandwidth to the rdbms.
Helmut Grohne [Tue, 23 Jul 2013 21:32:00 +0000 (23:32 +0200)]
webapp: make html for index valid
Helmut Grohne [Tue, 23 Jul 2013 21:26:52 +0000 (23:26 +0200)]
README: fix typo in query
Helmut Grohne [Tue, 23 Jul 2013 21:26:28 +0000 (23:26 +0200)]
webapp: remove unused function
Helmut Grohne [Tue, 23 Jul 2013 19:54:41 +0000 (21:54 +0200)]
adapt queries in README to new schema
Helmut Grohne [Tue, 23 Jul 2013 16:53:55 +0000 (18:53 +0200)]
schema: reference hash functions by integer key
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
Helmut Grohne [Mon, 22 Jul 2013 10:03:35 +0000 (12:03 +0200)]
schema: extend content_package_index
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
Helmut Grohne [Mon, 15 Jul 2013 05:21:09 +0000 (07:21 +0200)]
Merge branch 'packageid'
Helmut Grohne [Fri, 12 Jul 2013 13:24:09 +0000 (15:24 +0200)]
importpkg: simplify state logic
Helmut Grohne [Fri, 12 Jul 2013 13:12:09 +0000 (15:12 +0200)]
importpkg: split process_package to process_control
Helmut Grohne [Wed, 10 Jul 2013 14:16:45 +0000 (16:16 +0200)]
schema: reference package table by integer key
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
Helmut Grohne [Wed, 10 Jul 2013 13:23:15 +0000 (15:23 +0200)]
schema.sql: drop unused index
sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.
Helmut Grohne [Wed, 3 Jul 2013 19:56:11 +0000 (21:56 +0200)]
README: explain update_sharing.py
Helmut Grohne [Sun, 23 Jun 2013 10:00:36 +0000 (12:00 +0200)]
Merge branch yamlimport
+ Way faster on multiple cores.
+ More reliable, cause http connections do not time out when the db
blocks.
- Way slower on single core with contended io path. No clue why.
Still update_sharing.py makes up the bulk of processing time.
Helmut Grohne [Wed, 19 Jun 2013 06:35:26 +0000 (08:35 +0200)]
webapp: fix hash example link after git upload
The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.
Helmut Grohne [Tue, 11 Jun 2013 21:22:10 +0000 (23:22 +0200)]
autoimport: don't fork for readyaml
This appears to be a huge performance boost.
Helmut Grohne [Tue, 11 Jun 2013 21:11:39 +0000 (23:11 +0200)]
autoimport: support processing individual files
This gets back the original functionality of importpkg.py.
Helmut Grohne [Mon, 10 Jun 2013 16:22:29 +0000 (18:22 +0200)]
split the import phase to a yaml stream
importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.
Helmut Grohne [Mon, 27 May 2013 09:59:33 +0000 (11:59 +0200)]
dedup.image: img.convert can also raise that crazy stuff
Helmut Grohne [Thu, 9 May 2013 07:20:03 +0000 (09:20 +0200)]
webapp: declare html5 and utf-8
Helmut Grohne [Thu, 9 May 2013 06:32:14 +0000 (08:32 +0200)]
webapp: enrich comparison page with version info
Helmut Grohne [Wed, 8 May 2013 13:52:42 +0000 (15:52 +0200)]
fix attribution of logo
I remembered the wrong name. The logo was made by Sune Vuorela.
Helmut Grohne [Sun, 5 May 2013 15:24:50 +0000 (17:24 +0200)]
webapp: markup error in /source template
Helmut Grohne [Sun, 5 May 2013 15:22:33 +0000 (17:22 +0200)]
webapp: validator complained about <link> with sizes
Helmut Grohne [Sun, 5 May 2013 15:10:24 +0000 (17:10 +0200)]
webapp: reference favicon from base.html
Helmut Grohne [Sun, 5 May 2013 14:19:10 +0000 (16:19 +0200)]
added favicon.ico
Authored: Cyril Brulebois
Helmut Grohne [Thu, 2 May 2013 17:28:24 +0000 (19:28 +0200)]
webapp: use jinja's filesizeformat
Except it doesn't work, so replace it with our version. At least we
might be able to drop this code in a future update.
Helmut Grohne [Thu, 2 May 2013 16:48:14 +0000 (18:48 +0200)]
webapp: reduce size of comparison output
Only add rowspan when it carries a meaning.
Helmut Grohne [Sat, 27 Apr 2013 08:55:21 +0000 (10:55 +0200)]
webapp: add a css class binary-package
Helmut Grohne [Thu, 25 Apr 2013 12:19:58 +0000 (14:19 +0200)]
webapp: total_size is None if num_files is 0
Helmut Grohne [Thu, 25 Apr 2013 12:10:47 +0000 (14:10 +0200)]
webapp: color filenames when hovering them
Helmut Grohne [Thu, 25 Apr 2013 12:10:18 +0000 (14:10 +0200)]
webapp: turn the <br> after filename into a style
Helmut Grohne [Thu, 25 Apr 2013 12:02:48 +0000 (14:02 +0200)]
move css to /style.css
Helmut Grohne [Thu, 25 Apr 2013 12:01:11 +0000 (14:01 +0200)]
webapp: make filenames css styleable
Helmut Grohne [Thu, 25 Apr 2013 07:33:03 +0000 (09:33 +0200)]
webapp: top-align fields in /compare pages
Suggested by Paul Wise.
Helmut Grohne [Thu, 25 Apr 2013 07:32:46 +0000 (09:32 +0200)]
fix markup in base.html
Helmut Grohne [Wed, 24 Apr 2013 18:56:46 +0000 (20:56 +0200)]
implement the /compare/pkg1/pkg2 page differently
The original version had two major drawbacks:
1) The SQL query used would cause a btree sort, so the time waiting
for the first output was rather long.
2) For packages with many equal files, the output would grow with
O(n^2).
Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.
Helmut Grohne [Sun, 14 Apr 2013 08:31:55 +0000 (10:31 +0200)]
webapp: added some useful notes
Helmut Grohne [Sat, 13 Apr 2013 07:59:45 +0000 (09:59 +0200)]
base.html: add link to wiki.debian.org
Helmut Grohne [Mon, 8 Apr 2013 12:41:23 +0000 (14:41 +0200)]
README: improve query after schemachange
Helmut Grohne [Tue, 26 Mar 2013 15:23:37 +0000 (16:23 +0100)]
webapp: fix problem from the previous merge
Helmut Grohne [Tue, 26 Mar 2013 14:59:48 +0000 (15:59 +0100)]
Merge branch schemachange
Helmut Grohne [Wed, 20 Mar 2013 18:19:50 +0000 (19:19 +0100)]
webapp: report correct sizes
Helmut Grohne [Wed, 20 Mar 2013 18:12:25 +0000 (19:12 +0100)]
webapp: remove broken assert
Fails on long inputs.
Helmut Grohne [Mon, 18 Mar 2013 15:51:17 +0000 (16:51 +0100)]
dedup.image: mask errors from PIL
Helmut Grohne [Tue, 12 Mar 2013 07:38:57 +0000 (08:38 +0100)]
dedup.arreader: missing bytes marker
Helmut Grohne [Tue, 12 Mar 2013 07:24:49 +0000 (08:24 +0100)]
move ArReader from importpkg to dedup.arreader
Also document it.
Helmut Grohne [Sun, 10 Mar 2013 06:38:22 +0000 (07:38 +0100)]
README: update queries to match content table split
Helmut Grohne [Sat, 9 Mar 2013 17:43:47 +0000 (18:43 +0100)]
split content table to a hash table
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
Helmut Grohne [Sat, 9 Mar 2013 17:37:24 +0000 (18:37 +0100)]
webapp: drop unused function compute_sharedstats
The sharing table works great and I don't want to adapt it for the next
step in the schema change.
Helmut Grohne [Thu, 7 Mar 2013 08:05:48 +0000 (09:05 +0100)]
use "ON DELETE CASCADE" clauses
Helmut Grohne [Thu, 7 Mar 2013 07:43:15 +0000 (08:43 +0100)]
enable enforcing foreign keys
Helmut Grohne [Thu, 7 Mar 2013 07:41:35 +0000 (08:41 +0100)]
schema.sql: remove unsatisfiable foreign key
In the dependency table we will insert dependencies on packages which
are not tracked. This happens during initial import and for virtual
packages. Therefore the "required" column cannot be a foreign key.
Helmut Grohne [Thu, 7 Mar 2013 07:28:56 +0000 (08:28 +0100)]
schema.sql: annotat foreign keys of sharing
Helmut Grohne [Thu, 7 Mar 2013 07:24:44 +0000 (08:24 +0100)]
integrate the source table into the package table
Helmut Grohne [Thu, 7 Mar 2013 07:12:01 +0000 (08:12 +0100)]
README: explain queries
Helmut Grohne [Wed, 6 Mar 2013 14:36:49 +0000 (15:36 +0100)]
README: added interesting query
Helmut Grohne [Tue, 5 Mar 2013 07:39:06 +0000 (08:39 +0100)]
webapp: added /source/<pkg> page
Helmut Grohne [Tue, 5 Mar 2013 07:38:39 +0000 (08:38 +0100)]
webapp: helper function function_combination
Helmut Grohne [Tue, 5 Mar 2013 07:21:13 +0000 (08:21 +0100)]
importpkg: source header may contain a version
Helmut Grohne [Mon, 4 Mar 2013 17:53:23 +0000 (18:53 +0100)]
webapp: fix index template
Apparently not all browsers understand <a ... /> in all rendering modes.
Helmut Grohne [Mon, 4 Mar 2013 17:49:54 +0000 (18:49 +0100)]
webapp: use caching table "shared" for /binary page
Helmut Grohne [Mon, 4 Mar 2013 12:49:22 +0000 (13:49 +0100)]
webapp: generate /comparison pages in constant-space
Helmut Grohne [Mon, 4 Mar 2013 10:44:24 +0000 (11:44 +0100)]
importpkg: record the source package relationship
Helmut Grohne [Sat, 2 Mar 2013 21:33:39 +0000 (22:33 +0100)]
update_sharing: wrong database name
Helmut Grohne [Sat, 2 Mar 2013 21:29:04 +0000 (22:29 +0100)]
add sharing table
The sharing table is a cache for the /binary web pages. It essentially
contains the numbers presented. This caching table is not automatically
populated. It needs to be reconstructed after every (group of) package
imports.
Helmut Grohne [Sat, 2 Mar 2013 20:46:47 +0000 (21:46 +0100)]
update README
* Tell about schema.sql.
* Explain WAL.
Helmut Grohne [Sat, 2 Mar 2013 20:24:18 +0000 (21:24 +0100)]
move fetchiter from webapp to dedup.utils