Added section in README re new PowerShell scripts.

Also fixed formatting and wording.
Updated README with v1.2 changes and SO import stats.
2025-12-16 21:03:26 +00:00 · 2013-12-01 03:43:58 +11:00 · 2013-12-01 03:33:40 +11:00 · 2013-11-29 15:18:55 +11:00 · 2013-11-29 15:11:32 +11:00 · 2013-11-29 15:06:10 +11:00
5 changed files with 29 additions and 7 deletions
--- a/README.textile
+++ b/README.textile
@@ -28,6 +28,21 @@ Version 1.1 fixes a few bugs, the major one being the inability to import the 20

 Because changes have been made to the search schema and the search indexer has been upgraded (to Solr 4.5), all data will need to be re-indexed. Therefore there is no upgrade path; follow the instructions below to set up Stackdump again. It is recommended to install this new version in a new directory, instead of overwriting the existing one.

+h2. Changes and upgrading from v1.1 to v1.2.
+
+The major change in the v1.2 release are improvements to the speed of importing data. There are some other smaller changes, including new PowerShell scripts to start and manage Stackdump on Windows as well as a few bug fixes when running on Windows. The search indexing side of things has not changed, therefore data imported using v1.1 will continue to work in v1.2. _Data from older versions however, needs to be re-indexed. See the above section on upgrading to v1.1 for more details._
+
+h3. Importing the StackOverflow data dump, September 2013
+
+The StackOverflow data dump has grown significantly since I started this project back in 2011. With the improvements in v1.2, on a VM with two cores and 4GB of RAM running CentOS 5.7 on a single, standard hard drive containing spinning pieces of metal,
+
+* it took *84719.565491 seconds* to import it, or *23 hours, 31 minutes and 59.565491 seconds*
+* once completed, it used up *20GB* of disk space
+* during the import, roughly *30GB* of disk space was needed
+* the import process used, at max, around *2GB* of RAM.
+
+In total, the StackOverflow data dump has *15,933,529 posts* (questions and answers), *2,332,403 users* and a very large number of comments.
+
 h2. Setting up

 Stackdump was designed for offline environments or environments with poor internet access, therefore it is bundled with all the dependencies it requires (with the exception of Python, Java and 7-zip). 
@@ -43,6 +58,12 @@ As long as you have:

 To provide a better experience, Stackdump can use the RSS feed content to pre-fill some of the required details during the import process, as well as to display the site logos in the app. Stackdump comes bundled with a script that downloads and places these bits in the right places. If you're in a completely offline environment however, it may be worth running this script on a connected box first.

+h3. Windows users
+
+If you're using Windows, you will need to substitute the appropriate PowerShell equivalent command for the Stackdump scripts used below. These equivalent PowerShell scripts are in the Stackdump root directory, alongside their Unix counterparts. The names are roughly the same, with the exception of @manage.sh@, which in PowerShell has been broken up into two scripts, @List-StackdumpCommands.ps1@ and @Run-StackdumpCommand.ps1@.
+
+Remember to set your PowerShell execution policy to at least @RemoteSigned@ first as these scripts are not signed. Use the @Get-ExecutionPolicy@ cmdlet to see the current policy, and @Set-ExecutionPolicy@ to set it. You will need to have administrative privileges to set it. 
+
 h3. Extract Stackdump

 Stackdump was to be self-contained, so to get it up and running, simply extract the Stackdump download to an appropriate location.
--- a/Start-Solr.ps1
+++ b/Start-Solr.ps1
--- a/java/solr/server/solr/stackdump/conf/schema.xml
+++ b/java/solr/server/solr/stackdump/conf/schema.xml
@@ -45,7 +45,7 @@
    that avoids logging every request
 -->

-<schema name="example" version="1.5">
+<schema name="stackdump" version="1.5">
  <!-- attribute "name" is the name of this schema and is only used for display purposes.
       version="x.y" is Solr's version number for the schema syntax and 
       semantics.  It should not normally be changed by applications.
--- a/python/src/stackdump/commands/import_site.py
+++ b/python/src/stackdump/commands/import_site.py
@@ -265,19 +265,20 @@ class PostContentHandler(xml.sax.ContentHandler):
            
            if 'AcceptedAnswerId' in attrs:
                d['acceptedAnswerId'] = int(attrs.get('AcceptedAnswerId', 0))
-            d['creationDate'] = datetime.strptime(attrs.get('CreationDate'), ISO_DATE_FORMAT)
+            # Solr accepts ISO dates, but must be UTC as indicated by trailing Z
+            d['creationDate'] = attrs.get('CreationDate') + 'Z'
            d['score'] = int(attrs.get('Score', 0))
            d['body'] = attrs.get('Body', '')
            d['ownerUserId'] = int(attrs.get('OwnerUserId', 0))
            if 'LastEditorUserId' in attrs:
                d['lastEditorUserId'] = int(attrs.get('LastEditorUserId', 0))
            if 'LastEditDate' in attrs:
-                d['lastEditDate'] = datetime.strptime(attrs.get('LastEditDate'), ISO_DATE_FORMAT)
-            d['lastActivityDate'] = datetime.strptime(attrs.get('LastActivityDate'), ISO_DATE_FORMAT)
+                d['lastEditDate'] = attrs.get('LastEditDate') + 'Z'
+            d['lastActivityDate'] = attrs.get('LastActivityDate') + 'Z'
            if 'CommunityOwnedDate' in attrs:
-                d['communityOwnedDate'] = datetime.strptime(attrs.get('CommunityOwnedDate'), ISO_DATE_FORMAT)
+                d['communityOwnedDate'] = attrs.get('CommunityOwnedDate') + 'Z'
            if 'ClosedDate' in attrs:
-                d['closedDate'] = datetime.strptime(attrs.get('ClosedDate'), ISO_DATE_FORMAT)
+                d['closedDate'] = attrs.get('ClosedDate') + 'Z'
            d['title'] = attrs.get('Title', '')
            if 'Tags' in attrs:
                d['tags'] = attrs.get('Tags', '')
--- a/start_solr.sh
+++ b/start_solr.sh
@@ -22,4 +22,4 @@ then
 fi

 cd "$SCRIPT_DIR/java/solr/server"
-"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -jar start.jar
+"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -Djetty.host=127.0.0.1 -jar start.jar
Author	SHA1	Message	Date
Sam	722d4125e7	Added section in README re new PowerShell scripts. Also fixed formatting and wording.	2013-12-01 03:43:58 +11:00
Sam	ce3eb04270	Updated README with v1.2 changes and SO import stats.	2013-12-01 03:33:40 +11:00
Samuel Lai	9613caa8d1	Changed settings so Solr now only listens on localhost, not all interfaces.	2013-11-29 15:18:55 +11:00
Samuel Lai	2583afeb90	Removed more redundant date/time parsing.	2013-11-29 15:11:32 +11:00
Samuel Lai	522e1ff4f2	Fixed bug in script where the directory change was not reverted when script exited.	2013-11-29 15:06:10 +11:00
Samuel Lai	36eb8d3980	Changed the name of the stackdump schema to something better than 'Example'.	2013-11-29 15:05:31 +11:00