1
0
mirror of https://github.com/djohnlewis/stackdump synced 2024-12-04 15:07:36 +00:00

Updated README with v1.2 changes and SO import stats.

This commit is contained in:
Sam 2013-12-01 03:33:40 +11:00
parent 9613caa8d1
commit ce3eb04270

View File

@ -28,6 +28,21 @@ Version 1.1 fixes a few bugs, the major one being the inability to import the 20
Because changes have been made to the search schema and the search indexer has been upgraded (to Solr 4.5), all data will need to be re-indexed. Therefore there is no upgrade path; follow the instructions below to set up Stackdump again. It is recommended to install this new version in a new directory, instead of overwriting the existing one.
h2. Changes and upgrading from v1.1 to v1.2.
The major change in the v1.2 release are improvements to the speed of importing data. There are some other smaller changes, including new PowerShell scripts to start and manage Stackdump on Windows as well as a few bug fixes when running on Windows. The search indexing side of things has not changed, therefore data imported using v1.1 will continue to work in v1.2. _Data from older versions however, needs to be re-indexed. See the above section on upgrading to v1.1 for more details._
h3. Importing the StackOverflow data dump, September 2013
The StackOverflow data dump has grown significantly since I started this project back in 2011. With the improvements in v1.2, on a VM with two cores and 4GB of RAM running CentOS 5.7 on a single, standard hard drive containing spinning pieces of metal,
* it took 84719.565491 seconds to import it, or 23 hours, 31 minutes and 59.565491 seconds
* once completed, it requires 20GB of disk space
* during the import, roughly 30GB of disk space was needed
* the import process used, at max, around 2GB of RAM.
In total, the StackOverflow data dump has 15,933,529 posts (questions and answers), 2,332,403 users and a very large number of comments.
h2. Setting up
Stackdump was designed for offline environments or environments with poor internet access, therefore it is bundled with all the dependencies it requires (with the exception of Python, Java and 7-zip).