1
0
mirror of https://github.com/djohnlewis/stackdump synced 2025-12-16 21:03:26 +00:00

6 Commits

5 changed files with 29 additions and 7 deletions

View File

@@ -28,6 +28,21 @@ Version 1.1 fixes a few bugs, the major one being the inability to import the 20
Because changes have been made to the search schema and the search indexer has been upgraded (to Solr 4.5), all data will need to be re-indexed. Therefore there is no upgrade path; follow the instructions below to set up Stackdump again. It is recommended to install this new version in a new directory, instead of overwriting the existing one.
h2. Changes and upgrading from v1.1 to v1.2.
The major change in the v1.2 release are improvements to the speed of importing data. There are some other smaller changes, including new PowerShell scripts to start and manage Stackdump on Windows as well as a few bug fixes when running on Windows. The search indexing side of things has not changed, therefore data imported using v1.1 will continue to work in v1.2. _Data from older versions however, needs to be re-indexed. See the above section on upgrading to v1.1 for more details._
h3. Importing the StackOverflow data dump, September 2013
The StackOverflow data dump has grown significantly since I started this project back in 2011. With the improvements in v1.2, on a VM with two cores and 4GB of RAM running CentOS 5.7 on a single, standard hard drive containing spinning pieces of metal,
* it took *84719.565491 seconds* to import it, or *23 hours, 31 minutes and 59.565491 seconds*
* once completed, it used up *20GB* of disk space
* during the import, roughly *30GB* of disk space was needed
* the import process used, at max, around *2GB* of RAM.
In total, the StackOverflow data dump has *15,933,529 posts* (questions and answers), *2,332,403 users* and a very large number of comments.
h2. Setting up
Stackdump was designed for offline environments or environments with poor internet access, therefore it is bundled with all the dependencies it requires (with the exception of Python, Java and 7-zip).
@@ -43,6 +58,12 @@ As long as you have:
To provide a better experience, Stackdump can use the RSS feed content to pre-fill some of the required details during the import process, as well as to display the site logos in the app. Stackdump comes bundled with a script that downloads and places these bits in the right places. If you're in a completely offline environment however, it may be worth running this script on a connected box first.
h3. Windows users
If you're using Windows, you will need to substitute the appropriate PowerShell equivalent command for the Stackdump scripts used below. These equivalent PowerShell scripts are in the Stackdump root directory, alongside their Unix counterparts. The names are roughly the same, with the exception of @manage.sh@, which in PowerShell has been broken up into two scripts, @List-StackdumpCommands.ps1@ and @Run-StackdumpCommand.ps1@.
Remember to set your PowerShell execution policy to at least @RemoteSigned@ first as these scripts are not signed. Use the @Get-ExecutionPolicy@ cmdlet to see the current policy, and @Set-ExecutionPolicy@ to set it. You will need to have administrative privileges to set it.
h3. Extract Stackdump
Stackdump was to be self-contained, so to get it up and running, simply extract the Stackdump download to an appropriate location.

Binary file not shown.

View File

@@ -45,7 +45,7 @@
that avoids logging every request
-->
<schema name="example" version="1.5">
<schema name="stackdump" version="1.5">
<!-- attribute "name" is the name of this schema and is only used for display purposes.
version="x.y" is Solr's version number for the schema syntax and
semantics. It should not normally be changed by applications.

View File

@@ -265,19 +265,20 @@ class PostContentHandler(xml.sax.ContentHandler):
if 'AcceptedAnswerId' in attrs:
d['acceptedAnswerId'] = int(attrs.get('AcceptedAnswerId', 0))
d['creationDate'] = datetime.strptime(attrs.get('CreationDate'), ISO_DATE_FORMAT)
# Solr accepts ISO dates, but must be UTC as indicated by trailing Z
d['creationDate'] = attrs.get('CreationDate') + 'Z'
d['score'] = int(attrs.get('Score', 0))
d['body'] = attrs.get('Body', '')
d['ownerUserId'] = int(attrs.get('OwnerUserId', 0))
if 'LastEditorUserId' in attrs:
d['lastEditorUserId'] = int(attrs.get('LastEditorUserId', 0))
if 'LastEditDate' in attrs:
d['lastEditDate'] = datetime.strptime(attrs.get('LastEditDate'), ISO_DATE_FORMAT)
d['lastActivityDate'] = datetime.strptime(attrs.get('LastActivityDate'), ISO_DATE_FORMAT)
d['lastEditDate'] = attrs.get('LastEditDate') + 'Z'
d['lastActivityDate'] = attrs.get('LastActivityDate') + 'Z'
if 'CommunityOwnedDate' in attrs:
d['communityOwnedDate'] = datetime.strptime(attrs.get('CommunityOwnedDate'), ISO_DATE_FORMAT)
d['communityOwnedDate'] = attrs.get('CommunityOwnedDate') + 'Z'
if 'ClosedDate' in attrs:
d['closedDate'] = datetime.strptime(attrs.get('ClosedDate'), ISO_DATE_FORMAT)
d['closedDate'] = attrs.get('ClosedDate') + 'Z'
d['title'] = attrs.get('Title', '')
if 'Tags' in attrs:
d['tags'] = attrs.get('Tags', '')

View File

@@ -22,4 +22,4 @@ then
fi
cd "$SCRIPT_DIR/java/solr/server"
"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -jar start.jar
"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -Djetty.host=127.0.0.1 -jar start.jar