Helmut Grohne [Fri, 29 Jul 2016 15:04:12 +0000 (17:04 +0200)]
repository moved
Helmut Grohne [Thu, 9 Jun 2016 20:48:46 +0000 (22:48 +0200)]
DecompressedStream: fix decompression without flush
In Python 3.x, lzma.LZMADecompressor doesn't have a flush method.
Helmut Grohne [Thu, 9 Jun 2016 20:44:04 +0000 (22:44 +0200)]
autoimport: fix hash check
Fixes:
2f12a6e2f426 ("autoimport: add option to skip hash checking")
Helmut Grohne [Wed, 25 May 2016 17:27:35 +0000 (19:27 +0200)]
autoimport: improve fetching package lists
Moving the fetching part into dedup.utils. Instead of hard coding the
gzip compressed copy, try xz, gz and plain in that order. Also take care
to actually close the connection.
Helmut Grohne [Tue, 24 May 2016 15:50:57 +0000 (17:50 +0200)]
use urlopen from urllib2 on py2
This causes non-successful fetches to result in HTTPErrors like it does
in py3 already.
Helmut Grohne [Mon, 23 May 2016 19:49:43 +0000 (21:49 +0200)]
move dedup.debpkg.process_control back into importpkg
After all, it isn't that generic. It knows what information is necessary
for running dedup. Thus it really belongs to the extractor subclass.
By building on handle_control_info, not that much parsing logic is left
in the extractor subclass.
Helmut Grohne [Mon, 23 May 2016 19:48:15 +0000 (21:48 +0200)]
DebExtractor: implement parsing of control.tar
Helmut Grohne [Mon, 23 May 2016 19:09:38 +0000 (21:09 +0200)]
importpkg: fix --hash broken in previous commit
Helmut Grohne [Mon, 23 May 2016 19:03:52 +0000 (21:03 +0200)]
remove curl dependency
Teach importpkg how to download urls using urlopen and thus remove the
need for invoking curl.
Helmut Grohne [Mon, 23 May 2016 13:33:40 +0000 (15:33 +0200)]
autoimport: add option to skip hash checking
For variations of dedup, that do not consume the data.tar member, this
option can save significant bandwidth.
Helmut Grohne [Sun, 22 May 2016 21:21:16 +0000 (23:21 +0200)]
autoimport: stream package list and use generic decompressor
* streaming means that we do not need to hold the entire package list
in memory (but the pkgs dict will become large anyway).
* The decompress utility allows easily switching to e.g. xz which is
the only compression format for the dbgsym suites.
Helmut Grohne [Sun, 22 May 2016 21:18:54 +0000 (23:18 +0200)]
DecompressedStream: implement readline
Iteration over file-like is required by deb822.Packages.iter_paragraphs.
Helmut Grohne [Sat, 21 May 2016 15:54:04 +0000 (17:54 +0200)]
move from deprecated optparse to argparse
Helmut Grohne [Thu, 5 May 2016 19:21:48 +0000 (21:21 +0200)]
treat Pre-Depends like regular Depends
The former behaviour was ignoring them. The intended use for dedup is to
know whenever a package unconditionally requires another package.
Helmut Grohne [Sun, 1 May 2016 12:31:56 +0000 (14:31 +0200)]
push more functionality into DebExtractor
The handle_ar_member and handle_ar_end methods now have a default
implementation adding further handlers handle_debversion,
handle_control_tar and handle_data_tar.
In that process two additional bugs were fixed:
* decompress_tar was wrongly passing errors="surrogateescape" for
Python 2.x even though that's only supported for Python 3.x.
* The use of decompress actually passes the extension as unicode.
Helmut Grohne [Sun, 1 May 2016 12:26:20 +0000 (14:26 +0200)]
use same Python version for autoimport and importpkg
The autoimport tool runs the Python interpreter explicitly. Instead of
invoking just "python" and thus calling whatever the current default is,
use sys.executable which is the interpreter used to run autoimport, thus
locking both to the same Python version.
Helmut Grohne [Thu, 28 Apr 2016 19:35:42 +0000 (21:35 +0200)]
support Python 3.x in importpkg
In Python 2.x, TarInfo.name is a bytes object. In Python 3.x,
TarInfo.name always is a unicode object. To avoid importpkg crashing
with an exception, we direct the Python 3.x decoding to use
surrogateescapes. Thus decoding the name boils down to checking whether
it contains surrogates.
Helmut Grohne [Thu, 28 Apr 2016 18:50:12 +0000 (20:50 +0200)]
decouple a function decompress out of decompress_tar
Building on the previous commit, add a decompress function that turns a
compressed filelike into a decompressed filelike. Use it to decouple the
decompression step.
Helmut Grohne [Thu, 28 Apr 2016 18:28:11 +0000 (20:28 +0200)]
extend functionality of DecompressedStream
It now supports:
* tell()
* seek(absolute_position), forward only
* close()
* closed
This is sufficient for putting it as a fileobj into tarfile.TarFile. By
doing so we can decouple decompression from tar processing, which eases
papering over the Python 2.x vs Python 3.x differences.
Helmut Grohne [Thu, 21 Apr 2016 21:15:22 +0000 (23:15 +0200)]
importpkg: move the hash function list to the extractor class
They really are an aspect of the particular extractor and can easily be
changed by subclassing.
Helmut Grohne [Tue, 19 Apr 2016 20:48:02 +0000 (22:48 +0200)]
add a class DebExtractor for guiding feature extraction
It is supposed to separate the parsing of Debian packages (understanding
how the format works) from the actual feature extraction. Its goal is to
simplify writing custom extractors for different feature sets.
Helmut Grohne [Sat, 16 Apr 2016 09:22:18 +0000 (11:22 +0200)]
add a validate method to HashedStream
Helmut Grohne [Sat, 16 Apr 2016 09:14:40 +0000 (11:14 +0200)]
importpkg: use yaml dumper directly
Instead of carefully crafting an iterator to pass to yaml.safe_dump_all,
we simply take control on our own and call represent on a yaml dumper
object where needed.
Helmut Grohne [Sat, 16 Apr 2016 07:03:51 +0000 (09:03 +0200)]
importpkg: refactor commit handling out of process_package*
Helmut Grohne [Fri, 8 Apr 2016 18:56:42 +0000 (20:56 +0200)]
urlopen moved from urllib to urllib.request in py3k
Helmut Grohne [Thu, 16 Apr 2015 15:58:56 +0000 (17:58 +0200)]
process_control: do not encode to ascii
Otherwise the yaml will contain binary strings on py3k which end up as
binary data in the sqlite database. In py2, yaml can handle those
unicode objects just fine.
Helmut Grohne [Thu, 16 Apr 2015 15:56:24 +0000 (17:56 +0200)]
tempfile.mkdtemp does not like bytes in py3k
Helmut Grohne [Thu, 16 Apr 2015 15:56:02 +0000 (17:56 +0200)]
unquote moved from urllib to urllib.parse in py3k
Helmut Grohne [Thu, 16 Apr 2015 15:47:20 +0000 (17:47 +0200)]
element access on bytes yields int in py3k
Helmut Grohne [Thu, 16 Apr 2015 15:46:07 +0000 (17:46 +0200)]
zlib.crc32 behaves inconsistently on py2 vs py3
zlib.crc32 returns a int32_t on py2 and a uint32_t on py3.
Helmut Grohne [Thu, 16 Apr 2015 15:44:31 +0000 (17:44 +0200)]
there is no itertools.imap in py3k
Helmut Grohne [Thu, 16 Apr 2015 15:43:48 +0000 (17:43 +0200)]
use binary stdin on py3k
Helmut Grohne [Thu, 16 Apr 2015 15:43:11 +0000 (17:43 +0200)]
distinguish bytes from unicode for py3k
Helmut Grohne [Wed, 23 Jul 2014 16:07:39 +0000 (18:07 +0200)]
importpkg: be more liberal in control file naming
While in current sid packages the control file in control.tar is always
named "./control", some older packages name it "control".
Helmut Grohne [Sat, 14 Jun 2014 10:08:09 +0000 (12:08 +0200)]
improve schema documentation
wording, more NOT NULLs, some more explanations
Helmut Grohne [Sat, 14 Jun 2014 08:19:55 +0000 (10:19 +0200)]
add documentation to schema.sql
Thanks to Peter Palfrader for explaining what information is needed and
reviewing the documentation.
Helmut Grohne [Sun, 11 May 2014 13:59:46 +0000 (15:59 +0200)]
update copyright information
Helmut Grohne [Sun, 11 May 2014 13:57:36 +0000 (15:57 +0200)]
importpkg: reduce copy&paste
Guillem Jover [Wed, 7 May 2014 23:50:48 +0000 (01:50 +0200)]
importpkg: add support for data.tar.lzma
Creating packages with lzma compression has been deprecated since dpkg
1.16.4, but there might be some of those in the wild and supporting them
is strightforward when xz is already supported.
Signed-off-by: Guillem Jover <guillem@debian.org>
Guillem Jover [Wed, 7 May 2014 19:06:38 +0000 (21:06 +0200)]
importpkg: add support for control.tar and control.tar.xz
dpkg supports those since 1.17.6.
Signed-off-by: Guillem Jover <guillem@debian.org>
Guillem Jover [Wed, 7 May 2014 23:46:21 +0000 (01:46 +0200)]
dedup.arreader: remove trailing slash from ar members
The GNU ar format adds a trailing slash to the member names, normalize
the member names to take this into account.
Signed-off-by: Guillem Jover <guillem@debian.org>
Helmut Grohne [Sun, 11 May 2014 13:25:46 +0000 (15:25 +0200)]
webapp: allow git-like hash truncation
Helmut Grohne [Mon, 21 Apr 2014 10:50:15 +0000 (12:50 +0200)]
autoimport: support protocols besides http
Helmut Grohne [Sat, 8 Mar 2014 08:48:17 +0000 (09:48 +0100)]
schema: make syntax compatible with postgres
Helmut Grohne [Sun, 23 Feb 2014 19:12:18 +0000 (20:12 +0100)]
Merge branch updatesharing-eqclass
Helmut Grohne [Sun, 23 Feb 2014 17:19:35 +0000 (18:19 +0100)]
spell check comments
Helmut Grohne [Sun, 23 Feb 2014 16:29:41 +0000 (17:29 +0100)]
fix spelling mistake
Reported-By: Stefan Kaltenbrunner
Helmut Grohne [Sun, 23 Feb 2014 14:44:03 +0000 (15:44 +0100)]
webapp: fix eqclass usage in package comparison
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
Helmut Grohne [Fri, 21 Feb 2014 20:59:04 +0000 (21:59 +0100)]
update_sharing: weaken assumptions about db layout
Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.
Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.
Helmut Grohne [Wed, 19 Feb 2014 13:21:20 +0000 (14:21 +0100)]
blacklist content rather than hashes
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
Helmut Grohne [Wed, 19 Feb 2014 13:19:56 +0000 (14:19 +0100)]
GzipDecompressor: don't treat checksum as garbage trailer
Helmut Grohne [Wed, 19 Feb 2014 06:54:21 +0000 (07:54 +0100)]
DecompressedHash should fail on trailing input
Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.
Reported-By: Olly Betts
Helmut Grohne [Thu, 3 Oct 2013 06:51:41 +0000 (08:51 +0200)]
work around python-debian's #670679
Helmut Grohne [Wed, 11 Sep 2013 06:35:41 +0000 (08:35 +0200)]
webapp: open cursors less often
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
Helmut Grohne [Tue, 10 Sep 2013 07:39:40 +0000 (09:39 +0200)]
webapp: close database cursors
Leaking them can result in running out of available filedescriptors.
Helmut Grohne [Wed, 4 Sep 2013 08:15:59 +0000 (10:15 +0200)]
webapp: serve static files from /static
Helmut Grohne [Mon, 2 Sep 2013 16:51:20 +0000 (18:51 +0200)]
add option -d --database for db path to all scripts
Helmut Grohne [Mon, 2 Sep 2013 08:00:44 +0000 (10:00 +0200)]
autoimport: avoid hard coded temporary directory
Helmut Grohne [Mon, 2 Sep 2013 07:30:05 +0000 (09:30 +0200)]
importpkg: move library-like parts to dedup.debpkg
Helmut Grohne [Mon, 19 Aug 2013 09:52:39 +0000 (11:52 +0200)]
importpkg: don't blacklist boring gzip_sha512 hashes
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
Helmut Grohne [Fri, 16 Aug 2013 20:45:18 +0000 (22:45 +0200)]
make debian version_compare available in sql
Helmut Grohne [Fri, 16 Aug 2013 20:36:04 +0000 (22:36 +0200)]
webapp templates: add an anchor for file issues
Helmut Grohne [Fri, 2 Aug 2013 13:21:56 +0000 (15:21 +0200)]
model comparability as an equivalence relation
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
Helmut Grohne [Thu, 1 Aug 2013 21:06:26 +0000 (23:06 +0200)]
support hashing gif images
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
Helmut Grohne [Tue, 30 Jul 2013 16:15:56 +0000 (18:15 +0200)]
templates/binary: space between package and compare
Helmut Grohne [Tue, 30 Jul 2013 14:03:16 +0000 (16:03 +0200)]
templates: wiki.d.o redirects to https now
Helmut Grohne [Tue, 30 Jul 2013 13:52:22 +0000 (15:52 +0200)]
fix update_sharing to work after functionid merge
Helmut Grohne [Mon, 29 Jul 2013 19:44:56 +0000 (21:44 +0200)]
importpkg.py: support uncompressed data.tar
Helmut Grohne [Sat, 27 Jul 2013 07:39:14 +0000 (09:39 +0200)]
also move the static directory into the dedup package
Helmut Grohne [Sat, 27 Jul 2013 07:32:03 +0000 (09:32 +0200)]
move templates to dedup package
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
Helmut Grohne [Fri, 26 Jul 2013 19:53:11 +0000 (21:53 +0200)]
verify package hashes when importing via http
Helmut Grohne [Fri, 26 Jul 2013 13:04:02 +0000 (15:04 +0200)]
Merge branch functionid
Actual savings on the full data set are around 7%.
Conflicts:
README
Helmut Grohne [Thu, 25 Jul 2013 11:28:19 +0000 (13:28 +0200)]
display "issues" with files in package view
Currently this is invalid .gz files and png files not named .png.
Helmut Grohne [Thu, 25 Jul 2013 10:48:45 +0000 (12:48 +0200)]
README: foo.PNG is also a valid png name
Helmut Grohne [Wed, 24 Jul 2013 05:20:19 +0000 (07:20 +0200)]
readyaml: cache the whole function table
This should reduce the query bandwidth to the rdbms.
Helmut Grohne [Tue, 23 Jul 2013 21:32:00 +0000 (23:32 +0200)]
webapp: make html for index valid
Helmut Grohne [Tue, 23 Jul 2013 21:26:52 +0000 (23:26 +0200)]
README: fix typo in query
Helmut Grohne [Tue, 23 Jul 2013 21:26:28 +0000 (23:26 +0200)]
webapp: remove unused function
Helmut Grohne [Tue, 23 Jul 2013 19:54:41 +0000 (21:54 +0200)]
adapt queries in README to new schema
Helmut Grohne [Tue, 23 Jul 2013 16:53:55 +0000 (18:53 +0200)]
schema: reference hash functions by integer key
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
Helmut Grohne [Mon, 22 Jul 2013 10:03:35 +0000 (12:03 +0200)]
schema: extend content_package_index
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
Helmut Grohne [Mon, 15 Jul 2013 05:21:09 +0000 (07:21 +0200)]
Merge branch 'packageid'
Helmut Grohne [Fri, 12 Jul 2013 13:24:09 +0000 (15:24 +0200)]
importpkg: simplify state logic
Helmut Grohne [Fri, 12 Jul 2013 13:12:09 +0000 (15:12 +0200)]
importpkg: split process_package to process_control
Helmut Grohne [Wed, 10 Jul 2013 14:16:45 +0000 (16:16 +0200)]
schema: reference package table by integer key
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
Helmut Grohne [Wed, 10 Jul 2013 13:23:15 +0000 (15:23 +0200)]
schema.sql: drop unused index
sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.
Helmut Grohne [Wed, 3 Jul 2013 19:56:11 +0000 (21:56 +0200)]
README: explain update_sharing.py
Helmut Grohne [Sun, 23 Jun 2013 10:00:36 +0000 (12:00 +0200)]
Merge branch yamlimport
+ Way faster on multiple cores.
+ More reliable, cause http connections do not time out when the db
blocks.
- Way slower on single core with contended io path. No clue why.
Still update_sharing.py makes up the bulk of processing time.
Helmut Grohne [Wed, 19 Jun 2013 06:35:26 +0000 (08:35 +0200)]
webapp: fix hash example link after git upload
The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.
Helmut Grohne [Tue, 11 Jun 2013 21:22:10 +0000 (23:22 +0200)]
autoimport: don't fork for readyaml
This appears to be a huge performance boost.
Helmut Grohne [Tue, 11 Jun 2013 21:11:39 +0000 (23:11 +0200)]
autoimport: support processing individual files
This gets back the original functionality of importpkg.py.
Helmut Grohne [Mon, 10 Jun 2013 16:22:29 +0000 (18:22 +0200)]
split the import phase to a yaml stream
importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.
Helmut Grohne [Mon, 27 May 2013 09:59:33 +0000 (11:59 +0200)]
dedup.image: img.convert can also raise that crazy stuff
Helmut Grohne [Thu, 9 May 2013 07:20:03 +0000 (09:20 +0200)]
webapp: declare html5 and utf-8
Helmut Grohne [Thu, 9 May 2013 06:32:14 +0000 (08:32 +0200)]
webapp: enrich comparison page with version info
Helmut Grohne [Wed, 8 May 2013 13:52:42 +0000 (15:52 +0200)]
fix attribution of logo
I remembered the wrong name. The logo was made by Sune Vuorela.
Helmut Grohne [Sun, 5 May 2013 15:24:50 +0000 (17:24 +0200)]
webapp: markup error in /source template
Helmut Grohne [Sun, 5 May 2013 15:22:33 +0000 (17:22 +0200)]
webapp: validator complained about <link> with sizes
Helmut Grohne [Sun, 5 May 2013 15:10:24 +0000 (17:10 +0200)]
webapp: reference favicon from base.html
Helmut Grohne [Sun, 5 May 2013 14:19:10 +0000 (16:19 +0200)]
added favicon.ico
Authored: Cyril Brulebois