Minor README tweaks.

Grrr. More textile issues.
Oops, textile syntax mistake.
2025-12-17 13:23:25 +00:00 · 2014-03-03 17:07:26 +11:00 · 2014-02-27 22:02:04 +11:00 · 2014-02-27 22:00:48 +11:00 · 2014-02-27 21:58:22 +11:00 · 2014-02-27 20:44:55 +11:00
17 changed files with 2091 additions and 883 deletions
--- a/.hgignore
+++ b/.hgignore
@@ -22,3 +22,6 @@ tutorial/.*$

 # ignore the downloaded logos
 ^python/media/images/logos/.*
+
+# PyCharm project files
+^.idea/
--- a/List-StackdumpCommands.ps1
+++ b/List-StackdumpCommands.ps1
--- a/README.textile
+++ b/README.textile
@@ -28,6 +28,29 @@ Version 1.1 fixes a few bugs, the major one being the inability to import the 20

 Because changes have been made to the search schema and the search indexer has been upgraded (to Solr 4.5), all data will need to be re-indexed. Therefore there is no upgrade path; follow the instructions below to set up Stackdump again. It is recommended to install this new version in a new directory, instead of overwriting the existing one.

+h2. Changes and upgrading from v1.1 to v1.2
+
+The major change in the v1.2 release are improvements to the speed of importing data. There are some other smaller changes, including new PowerShell scripts to start and manage Stackdump on Windows as well as a few bug fixes when running on Windows. The search indexing side of things has not changed, therefore data imported using v1.1 will continue to work in v1.2. _Data from older versions however, needs to be re-indexed. See the above section on upgrading to v1.1 for more details._
+
+h2. Changes and upgrading from v1.2 to v1.3
+
+v1.3 is primarily bugfix release, for a fairly serious bug. It turns out Stackdump has been subtly overwriting questions as more sites are imported because it assumed post IDs were unique across all sites, when they in fact were not.  This meant as more sites were imported, the previous sites started to lose questions. The fix required a change to search index, therefore *the data directory will need to be deleted and all data will need to be re-imported after installing this version*. Thanks to @yammesicka for reporting the issue.
+
+Other changes include a new setting to allow disabling the link and image URL rewriting, and a change to the @import_site@ command so it doesn't bail immediately if there is a Solr connection issue - it will prompt and allow resumption after the connection issue has been resolved.
+
+h3. Importing the StackOverflow data dump, September 2013
+
+The StackOverflow data dump has grown significantly since I started this project back in 2011. With the improvements in v1.2, on a VM with two cores and 4GB of RAM running CentOS 5.7 on a single, standard hard drive containing spinning pieces of metal,
+
+* it took *84719.565491 seconds* to import it, or *23 hours, 31 minutes and 59.565491 seconds*
+* once completed, it used up *20GB* of disk space
+* during the import, roughly *30GB* of disk space was needed
+* the import process used, at max, around *2GB* of RAM.
+
+In total, the StackOverflow data dump has *15,933,529 posts* (questions and answers), *2,332,403 users* and a very large number of comments.
+
+I attempted this on a similarly spec'ed Windows 7 64-bit VM as well - 23 hours later and it is still trying to process the comments. The SQLite, Python or just disk performance is very poor for some reason. Therefore, if you intend on importing StackOverflow, I would advise you to run Stackdump on Linux instead. The smaller sites all complete without a reasonable time though, and there are no perceptible issues with performance as far as I'm aware on Windows.
+
 h2. Setting up

 Stackdump was designed for offline environments or environments with poor internet access, therefore it is bundled with all the dependencies it requires (with the exception of Python, Java and 7-zip). 
@@ -36,16 +59,22 @@ As long as you have:
 * "Python":http://python.org/download/,
 * "Java":http://java.com/en/download/manual.jsp,
 * "Stackdump":https://bitbucket.org/samuel.lai/stackdump/downloads,
-* the "StackExchange Data Dump":http://www.clearbits.net/creators/146-stack-exchange-data-dump (Note: this is only available as a torrent), and
+* the "StackExchange Data Dump":https://archive.org/details/stackexchange (download the sites you wish to import - note that StackOverflow is split into 7 archive files; only Comments, Posts and Users are required), and
 * "7-zip":http://www.7-zip.org/ (needed to extract the data dump files)

 ...you should be able to get an instance up and running.

 To provide a better experience, Stackdump can use the RSS feed content to pre-fill some of the required details during the import process, as well as to display the site logos in the app. Stackdump comes bundled with a script that downloads and places these bits in the right places. If you're in a completely offline environment however, it may be worth running this script on a connected box first.

+h3. Windows users
+
+If you're using Windows, you will need to substitute the appropriate PowerShell equivalent command for the Stackdump scripts used below. These equivalent PowerShell scripts are in the Stackdump root directory, alongside their Unix counterparts. The names are roughly the same, with the exception of @manage.sh@, which in PowerShell has been broken up into two scripts, @List-StackdumpCommands.ps1@ and @Run-StackdumpCommand.ps1@.
+
+Remember to set your PowerShell execution policy to at least @RemoteSigned@ first as these scripts are not signed. Use the @Get-ExecutionPolicy@ cmdlet to see the current policy, and @Set-ExecutionPolicy@ to set it. You will need to have administrative privileges to set it. 
+
 h3. Extract Stackdump

-Stackdump was to be self-contained, so to get it up and running, simply extract the Stackdump download to an appropriate location.
+Stackdump was designed to be self-contained, so to get it up and running, simply extract the Stackdump download archive to an appropriate location.

 h3. Verify dependencies

@@ -87,15 +116,15 @@ To start the import process, execute the following command -

@stackdump_dir/manage.sh import_site --base-url site_url --dump-date dump_date path_to_xml_files@

-... where site_url is the URL of the site you're importing, e.g. __android.stackexchange.com__; dump_date is the date of the data dump you're importing, e.g. __August 2012__, and finally path_to_xml_files is the path to the XML files you just extracted. The dump_date is a text string that is shown in the app only, so it can be in any format you want.
+... where @site_url@ is the URL of the site you're importing, e.g. __android.stackexchange.com__; @dump_date@ is the date of the data dump you're importing, e.g. __August 2012__, and finally @path_to_xml_files@ is the path to the directory containing the XML files that were just extracted. The @dump_date@ is a text string that is shown in the app only, so it can be in any format you want.

-For example, to import the August 2012 data dump of the Android StackExchange site, you would execute -
+For example, to import the August 2012 data dump of the Android StackExchange site, with the files extracted into @/tmp/android@, you would execute -

@stackdump_dir/manage.sh import_site --base-url android.stackexchange.com --dump-date "August 2012" /tmp/android@

 It is normal to get messages about unknown PostTypeIds and missing comments and answers. These errors are likely due to those posts being hidden via moderation.

-This can take anywhere between a minute to 10 hours or more depending on the site you're importing. As a rough guide, __android.stackexchange.com__ took a minute on my VM, while __stackoverflow.com__ took just over 10 hours.
+This can take anywhere between a minute to 20 hours or more depending on the site you're importing. As a rough guide, __android.stackexchange.com__ took a minute on my VM, while __stackoverflow.com__ took just under 24 hours.

 Repeat these steps for each site you wish to import. Do not attempt to import multiple sites at the same time; it will not work and you may end up with half-imported sites.

@@ -109,19 +138,49 @@ To start Stackdump, execute the following command -

 ... and visit port 8080 on that machine. That's it - your own offline, read-only instance of StackExchange.

-If you need to change the port that it runs on, modify @stackdump_dir/python/src/stackdump/settings.py@ and restart the app.
+If you need to change the port that it runs on, or modify other settings that control how Stackdump works; see the 'Optional configuration' section below for more details.

-The aforementioned @settings.py@ file also contains some other settings that control how Stackdump works.
+Both the search indexer and the app need to be running for Stackdump to work.
+
+h2. Optional configuration
+
+There are a few settings for those who like to tweak. There's no need to adjust them normally though; the default settings should be fine.
+
+The settings file is located in @stackdump_dir/python/src/stackdump/settings.py@. The web component will need to be restarted after changes have been made for them to take effect.
+
+* *SERVER_HOST* - the network interface to run the Stackdump web app on. Use _'0.0.0.0'_ for all interfaces, or _'127.0.0.1'_ for localhost only. By default, it runs on all interfaces.
+* *SERVER_PORT* - the port to run the Stackdump web app on. The default port is _8080_.
+* *SOLR_URL* - the URL to the Solr instance. The default assumes Solr is running on the same system. Change this if Solr is running on a different system.
+* *NUM_OF_DEFAULT_COMMENTS* - the number of comments shown by default for questions and answers before the remaining comments are hidden (and shown when clicked). The default is _3_ comments.
+* *NUM_OF_RANDOM_QUESTIONS* - the number of random questions shown on the home page of Stackdump and the site pages. The default is _3_ questions.
+* *REWRITE_LINKS_AND_IMAGES* - by default, all links are rewritten to either point internally or be marked as an external link, and image URLs are rewritten to point to a placeholder image. Set this setting to _False_ to disable this behaviour. 
+
+h2. Running Stackdump as a service

 Stackdump comes bundled with some init.d scripts as well which were tested on CentOS 5. These are located in the @init.d@ directory. To use these, you will need to modify them to specify the path to the Stackdump root directory and the user to run under.

-Both the search indexer and the app need to be running for Stackdump to work.
+Another option is to use "Supervisor":http://supervisord.org/ with a simple configuration file, e.g.,
+
+bc.. [program:stackdump-solr]
+command=/path/to/stackdump/start_solr.sh
+priority=900
+user=stackdump_user
+stdout_logfile=/path/to/stackdump/solr_stdout.log
+stderr_logfile=/path/to/stackdump/solr_stderr.log
+
+[program:stackdump-web]
+command=/path/to/stackdump/start_web.sh
+user=stackdump_user
+stdout_logfile=/path/to/stackdump/web_stdout.log
+stderr_logfile=/path/to/stackdump/web_stderr.log
+
+p. Yet another option for those using newer Linux distributions is to create native "systemd service definitions":http://www.freedesktop.org/software/systemd/man/systemd.service.html of type _simple_ for each of the components.

 h2. Maintenance

 Stackdump stores all its data in the @data@ directory under its root directory. If you want to start fresh, just stop the app and the search indexer, delete that directory and restart the app and search indexer.

-To delete certain sites from Stackdump, use the manage_sites management command -
+To delete certain sites from Stackdump, use the @manage_sites@ management command -

@stackdump_dir/manage.sh manage_sites -l@ to list the sites (and their site keys) currently in the system;
@stackdump_dir/manage.sh manage_sites -d site_key@ to delete a particular site.
--- a/Run-StackdumpCommand.ps1
+++ b/Run-StackdumpCommand.ps1
--- a/Start-Python.ps1
+++ b/Start-Python.ps1
--- a/Start-Solr.ps1
+++ b/Start-Solr.ps1
--- a/Start-StackdumpWeb.ps1
+++ b/Start-StackdumpWeb.ps1
--- a/java/solr/dist/solr-4.5.0.war
+++ b/java/solr/dist/solr-4.5.0.war
--- a/java/solr/server/solr/stackdump/conf/schema.xml
+++ b/java/solr/server/solr/stackdump/conf/schema.xml
@@ -45,7 +45,7 @@
    that avoids logging every request
 -->

-<schema name="example" version="1.5">
+<schema name="stackdump" version="1.5">
  <!-- attribute "name" is the name of this schema and is only used for display purposes.
       version="x.y" is Solr's version number for the schema syntax and 
       semantics.  It should not normally be changed by applications.
@@ -110,6 +110,10 @@

   <!-- we'll get the values out of the JSON, so most fields are not stored -->
   <!-- fields are listed here so searches can be performed against them -->
+   <!-- this is used by Lucene to uniquely identify a post across all sites.
+        It is of the form "siteKey-id" and is necessary because post IDs are
+        reused across sites. -->
+   <field name="documentId" type="string" indexed="true" stored="true" required="true" />
   <!-- the ID field needs to be a string for the QueryElevationComponent -->
   <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="siteKey" type="string" indexed="true" stored="true" required="true" />
@@ -196,7 +200,7 @@
 <!-- Field to use to determine and enforce document uniqueness. 
      Unless this field is marked with required="false", it will be a required field
   -->
- <uniqueKey>id</uniqueKey>
+ <uniqueKey>documentId</uniqueKey>

 <!-- DEPRECATED: The defaultSearchField is consulted by various query parsers when
  parsing a query string that isn't explicit about the field.  Machine (non-user)
--- a/python/packages/pysolr.py
+++ b/python/packages/pysolr.py
@@ -246,6 +246,11 @@ class Solr(object):
    Optionally accepts ``timeout`` for wait seconds until giving up on a
    request. Default is ``60`` seconds.

+    Optionally accepts ``assume_clean`` to skip cleaning request of invalid XML
+    characters. This offers a slight performance improvement, but only set this
+    to ``True`` if you know your request is clean (e.g. coming from other XML
+    data). Bad things will happen otherwise. Default is ``False``.
+
    Usage::

        solr = pysolr.Solr('http://localhost:8983/solr')
@@ -253,10 +258,11 @@ class Solr(object):
        solr = pysolr.Solr('http://localhost:8983/solr', timeout=10)

    """
-    def __init__(self, url, decoder=None, timeout=60):
+    def __init__(self, url, decoder=None, timeout=60, assume_clean=False):
        self.decoder = decoder or json.JSONDecoder()
        self.url = url
        self.timeout = timeout
+        self.assume_clean = assume_clean
        self.log = self._get_log()
        self.session = requests.Session()
        self.session.stream = False
@@ -506,7 +512,10 @@ class Solr(object):

            value = "{0}".format(value)

-        return clean_xml_string(value)
+        if self.assume_clean:
+            return value
+        else:
+            return clean_xml_string(value)

    def _to_python(self, value):
        """
--- a/python/packages/pysolr.py.orig
+++ b/python/packages/pysolr.py.orig
--- a/python/src/stackdump/app.py
+++ b/python/src/stackdump/app.py
@@ -26,7 +26,7 @@ import html5lib
 from html5lib.filters._base import Filter as HTML5LibFilterBase
 import markdown

-from stackdump.models import Site, Badge, Comment, User
+from stackdump.models import Site, Badge, User
 from stackdump import settings

 # STATIC VARIABLES
@@ -410,7 +410,8 @@ def view_question(site_key, question_id, answer_id=None):
    
    result = results.docs[0]
    convert_comments_to_html(result)
-    rewrite_result(result)
+    if settings.REWRITE_LINKS_AND_IMAGES:
+        rewrite_result(result)
    sort_answers(result)
    context['result'] = result

--- a/python/src/stackdump/commands/import_site.py
+++ b/python/src/stackdump/commands/import_site.py
@@ -12,15 +12,18 @@ from datetime import datetime
 import re
 import urllib2
 import socket
+import tempfile
+import traceback
 from optparse import OptionParser
 from xml.etree import ElementTree

-from sqlobject import sqlhub, connectionForURI, AND, OR, IN, SQLObject
+from sqlobject import sqlhub, connectionForURI, AND, IN, SQLObject, \
+    UnicodeCol, DateTimeCol, IntCol, DatabaseIndex, dbconnection
 from sqlobject.sqlbuilder import Delete, Insert
 from sqlobject.styles import DefaultStyle
-from pysolr import Solr
+from pysolr import Solr, SolrError

-from stackdump.models import Site, Badge, Comment, User
+from stackdump.models import Site, Badge, User
 from stackdump import settings

 try:
@@ -108,7 +111,7 @@ class BadgeContentHandler(BaseContentHandler):
            d['sourceId'] = int(attrs['Id'])
            d['userId'] = int(attrs.get('UserId', 0))
            d['name'] = attrs.get('Name', '')
-            d['date'] = datetime.strptime(attrs.get('Date'), ISO_DATE_FORMAT)
+            d['date'] = attrs.get('Date')
        except Exception, e:
            # could not parse this, so ignore the row completely
            self.cur_props = None
@@ -135,12 +138,12 @@ class CommentContentHandler(BaseContentHandler):
            return
        
        try:
-            d = self.cur_props = { 'site' : self.site }
+            d = self.cur_props = { 'siteId' : self.site.id }
            d['sourceId'] = int(attrs['Id'])
            d['postId'] = int(attrs.get('PostId', 0))
            d['score'] = int(attrs.get('Score', 0))
            d['text'] = attrs.get('Text', '')
-            d['creationDate'] = datetime.strptime(attrs.get('CreationDate'), ISO_DATE_FORMAT)
+            d['creationDate'] = attrs.get('CreationDate')
            d['userId'] = int(attrs.get('UserId', 0))
            
        except Exception, e:
@@ -181,10 +184,10 @@ class UserContentHandler(BaseContentHandler):
            d = self.cur_props = { 'site' : self.site }
            d['sourceId'] = int(attrs['Id'])
            d['reputation'] = int(attrs.get('Reputation', 0))
-            d['creationDate'] = datetime.strptime(attrs.get('CreationDate'), ISO_DATE_FORMAT)
+            d['creationDate'] = attrs.get('CreationDate')
            d['displayName'] = attrs.get('DisplayName', '')
            d['emailHash'] = attrs.get('EmailHash', '')
-            d['lastAccessDate'] = datetime.strptime(attrs.get('LastAccessDate'), ISO_DATE_FORMAT)
+            d['lastAccessDate'] = attrs.get('LastAccessDate')
            d['websiteUrl'] = attrs.get('WebsiteUrl', '')
            d['location'] = attrs.get('Location', '')
            d['age'] = int(attrs.get('Age', 0))
@@ -258,24 +261,37 @@ class PostContentHandler(xml.sax.ContentHandler):
                d['answers'] = [ ]
                d['answerCount'] = int(attrs.get('AnswerCount', 0))
                d['viewCount'] = int(attrs.get('ViewCount', 0))
+            elif attrs['PostTypeId'] == '3':
+                raise ValueError('Skipping row ID [%s] as it is an orphaned tag wiki page (PostTypeId [3]).' % (attrs.get('Id', -1)))
+            elif attrs['PostTypeId'] == '4':
+                raise ValueError('Skipping row ID [%s] as it is a tag wiki excerpt (PostTypeId [4]).' % (attrs.get('Id', -1)))
+            elif attrs['PostTypeId'] == '5':
+                raise ValueError('Skipping row ID [%s] as it is a tag wiki page (PostTypeId [5]).' % (attrs.get('Id', -1)))
+            elif attrs['PostTypeId'] == '6':
+                raise ValueError('Skipping row ID [%s] as it is a moderator nomination post (PostTypeId [6]).' % (attrs.get('Id', -1)))
+            elif attrs['PostTypeId'] == '7':
+                raise ValueError('Skipping row ID [%s] as it is a wiki placeholder page (PostTypeId [7]).' % (attrs.get('Id', -1)))
+            elif attrs['PostTypeId'] == '8':
+                raise ValueError('Skipping row ID [%s] as it is an privilege wiki page (PostTypeId [8]).' % (attrs.get('Id', -1)))
            else:
-                raise ValueError('Unknown PostTypeId [%s] for row ID [%s]. Probably a tag wiki page.' % (attrs.get('PostTypeId', -1), attrs.get('Id', -1)))
+                raise ValueError('Unknown PostTypeId [%s] for row ID [%s].' % (attrs.get('PostTypeId', -1), attrs.get('Id', -1)))
            
            if 'AcceptedAnswerId' in attrs:
                d['acceptedAnswerId'] = int(attrs.get('AcceptedAnswerId', 0))
-            d['creationDate'] = datetime.strptime(attrs.get('CreationDate'), ISO_DATE_FORMAT)
+            # Solr accepts ISO dates, but must be UTC as indicated by trailing Z
+            d['creationDate'] = attrs.get('CreationDate') + 'Z'
            d['score'] = int(attrs.get('Score', 0))
            d['body'] = attrs.get('Body', '')
            d['ownerUserId'] = int(attrs.get('OwnerUserId', 0))
            if 'LastEditorUserId' in attrs:
                d['lastEditorUserId'] = int(attrs.get('LastEditorUserId', 0))
            if 'LastEditDate' in attrs:
-                d['lastEditDate'] = datetime.strptime(attrs.get('LastEditDate'), ISO_DATE_FORMAT)
-            d['lastActivityDate'] = datetime.strptime(attrs.get('LastActivityDate'), ISO_DATE_FORMAT)
+                d['lastEditDate'] = attrs.get('LastEditDate') + 'Z'
+            d['lastActivityDate'] = attrs.get('LastActivityDate') + 'Z'
            if 'CommunityOwnedDate' in attrs:
-                d['communityOwnedDate'] = datetime.strptime(attrs.get('CommunityOwnedDate'), ISO_DATE_FORMAT)
+                d['communityOwnedDate'] = attrs.get('CommunityOwnedDate') + 'Z'
            if 'ClosedDate' in attrs:
-                d['closedDate'] = datetime.strptime(attrs.get('ClosedDate'), ISO_DATE_FORMAT)
+                d['closedDate'] = attrs.get('ClosedDate') + 'Z'
            d['title'] = attrs.get('Title', '')
            if 'Tags' in attrs:
                d['tags'] = attrs.get('Tags', '')
@@ -342,8 +358,9 @@ class PostContentHandler(xml.sax.ContentHandler):
        if self.row_count % 1000 == 0:
            print('%-10s Processed %d rows.' % ('[post]', self.row_count))
        
-        # only check for finished questions every 1000 rows to speed things up
-        if self.row_count % 1000 == 0:
+        # only check for finished questions every 10000 rows to speed things up
+        if self.row_count % 10000 == 0:
+            print('Committing completed questions...')
            self.commit_finished_questions()
    
    def commit_finished_questions(self):
@@ -400,7 +417,7 @@ class PostContentHandler(xml.sax.ContentHandler):
            post_ids.add(a['id'])
        
        # get the comments
-        comment_objs = Comment.select(AND(Comment.q.site == self.site,
+        comment_objs = Comment.select(AND(Comment.q.siteId == self.site.id,
                                          IN(Comment.q.postId, list(post_ids))))
        
        # sort the comments out into a dict keyed on the post id
@@ -455,6 +472,9 @@ class PostContentHandler(xml.sax.ContentHandler):
        doc['answers-json'] = [ json.dumps(a, default=self.json_default_handler) for a in q['answers'] ]
        
        # map other fields to search index doc
+        # this is the ID for Solr to uniquely identify this question across all
+        # sites
+        doc['documentId'] = self.site.key + '-' + str(q['id'])
        doc['id'] = str(q['id'])
        doc['siteKey'] = self.site.key
        doc['creationDate'] = q['creationDate']
@@ -514,9 +534,29 @@ class PostContentHandler(xml.sax.ContentHandler):
    
    def commit_questions(self, questions, commit=True):
        """
-        Commits the given list of questions to solr.
+        Adds the given list of questions to solr.
+
+        By default, they are committed immediately. Set the ``commit`` argument
+        to False to disable this behaviour.
        """
-        self.solr.add(questions, commit=commit)
+        while True:
+            try:
+                self.solr.add(questions, commit=commit)
+                break
+            except SolrError, e:
+                print('An exception occurred while committing questions - ')
+                traceback.print_exc(file=sys.stdout)
+                print('')
+                while True:
+                    response = raw_input('Try committing the questions again? (y/n) ').lower()
+                    if response not in ('y', 'n'):
+                        print("Answer either 'y' or 'n'. Answering 'n' will abort the import process.")
+                    else:
+                        print('')
+                        if response == 'y':
+                            break
+                        else:
+                            raise
    
    def commit_all_questions(self):
        """
@@ -551,6 +591,25 @@ class PostContentHandler(xml.sax.ContentHandler):
        for question_id, answers in self.orphan_answers.items():
            print('There are %d answers for missing question [ID# %d]. Ignoring orphan answers.' % (len(answers), question_id))

+
+# TEMP COMMENT DATABASE DEFINITION
+comment_db_sqlhub = dbconnection.ConnectionHub()
+class Comment(SQLObject):
+    sourceId = IntCol()
+    siteId = IntCol()
+    postId = IntCol()
+    score = IntCol()
+    text = UnicodeCol()
+    creationDate = DateTimeCol(datetimeFormat=ISO_DATE_FORMAT)
+    userId = IntCol()
+
+    siteId_postId_index = DatabaseIndex(siteId, postId)
+
+    _connection = comment_db_sqlhub
+
+    json_fields = [ 'id', 'score', 'text', 'creationDate', 'userId' ]
+
+
 # METHODS
 def get_file_path(dir_path, filename):
    """
@@ -593,14 +652,14 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
        sys.exit(1)

    # connect to the database
-    print('Connecting to the database...')
+    print('Connecting to the Stackdump database...')
    conn_str = settings.DATABASE_CONN_STR
    sqlhub.processConnection = connectionForURI(conn_str)
    print('Connected.\n')

    # connect to solr
    print('Connecting to solr...')
-    solr = Solr(settings.SOLR_URL)
+    solr = Solr(settings.SOLR_URL, assume_clean=True)
    # pysolr doesn't try to connect until a request is made, so we'll make a ping request
    try:
        solr._send_request('GET', 'admin/ping')
@@ -614,7 +673,6 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
    print("Creating tables if they don't exist...")
    Site.createTable(ifNotExists=True)
    Badge.createTable(ifNotExists=True)
-    Comment.createTable(ifNotExists=True)
    User.createTable(ifNotExists=True)
    print('Created.\n')

@@ -742,8 +800,6 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
        sqlhub.threadConnection = sqlhub.processConnection.transaction()
        conn = sqlhub.threadConnection
        # these deletions are done in this order to avoid FK constraint issues
-        print('\tDeleting comments...')
-        conn.query(conn.sqlrepr(Delete(Comment.sqlmeta.table, where=(Comment.q.site==site))))
        print('\tDeleting badges...')
        conn.query(conn.sqlrepr(Delete(Badge.sqlmeta.table, where=(Badge.q.site==site))))
        print('\tDeleting users...')
@@ -758,11 +814,26 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
        solr.commit(expungeDeletes=True)
        print('Deleted.\n')

+    # create the temporary comments database
+    print('Connecting to the temporary comments database...')
+    temp_db_file, temp_db_path = tempfile.mkstemp('.sqlite', 'temp_comment_db-' + re.sub(r'[^\w]', '_', site_key) + '-', settings.TEMP_COMMENTS_DATABASE_DIR)
+    os.close(temp_db_file)
+    conn_str = 'sqlite:///' + temp_db_path
+    comment_db_sqlhub.processConnection = connectionForURI(conn_str)
+    print('Connected.')
+    Comment.createTable()
+    print('Schema created.')
+    comment_db_sqlhub.processConnection.getConnection().execute('PRAGMA synchronous = OFF')
+    comment_db_sqlhub.processConnection.getConnection().execute('PRAGMA journal_mode = MEMORY')
+    print('Pragma configured.\n')
+
    timing_start = time.time()

    # start a new transaction
    sqlhub.threadConnection = sqlhub.processConnection.transaction()
    conn = sqlhub.threadConnection
+    comment_db_sqlhub.threadConnection = comment_db_sqlhub.processConnection.transaction()
+    temp_db_conn = comment_db_sqlhub.threadConnection

    # create a new Site
    site = Site(name=site_name, desc=site_desc, key=site_key, dump_date=dump_date,
@@ -785,7 +856,7 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
    print('[comment] PARSING COMMENTS...')
    xml_path = get_file_path(xml_root, 'comments.xml')
    print('[comment] start parsing comments.xml...')
-    handler = CommentContentHandler(conn, site)
+    handler = CommentContentHandler(temp_db_conn, site)
    xml.sax.parse(xml_path, handler)
    print('%-10s Processed %d rows.' % ('[comment]', handler.row_count))
    print('[comment] FINISHED PARSING COMMENTS.\n')
@@ -812,8 +883,10 @@ def import_site(xml_root, site_name, dump_date, site_desc, site_key,
    print('[post] FINISHED PARSING POSTS.\n')

    # DELETE COMMENTS
-    print('[comment] DELETING COMMENTS FROM DATABASE (they are no longer needed)...')
-    conn.query(conn.sqlrepr(Delete(Comment.sqlmeta.table, where=(Comment.q.site == site))))
+    print('[comment] DELETING TEMPORARY COMMENTS DATABASE (they are no longer needed)...')
+    temp_db_conn.commit(close=True)
+    comment_db_sqlhub.processConnection.close()
+    os.remove(temp_db_path)
    print('[comment] FINISHED DELETING COMMENTS.\n')

    # commit transaction
--- a/python/src/stackdump/default_settings.py
+++ b/python/src/stackdump/default_settings.py
@@ -18,7 +18,8 @@ SERVER_PORT = 8080
 SOLR_URL = 'http://localhost:8983/solr/stackdump/'

 import os
-DATABASE_CONN_STR = 'sqlite://%s/../../../data/stackdump.sqlite' % os.path.dirname(__file__)
+DATABASE_CONN_STR = 'sqlite:///' + os.path.join(os.path.dirname(__file__), '..', '..', '..', 'data', 'stackdump.sqlite')
+TEMP_COMMENTS_DATABASE_DIR = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'data')

 # if the website is hosted under a subpath, specify it here. It must end with a
 # slash.
@@ -31,6 +32,9 @@ NUM_OF_DEFAULT_COMMENTS = 3
 # number of random questions to show on search query pages
 NUM_OF_RANDOM_QUESTIONS = 3

+# rewrite links and images to point internally or to a placeholder respectively
+REWRITE_LINKS_AND_IMAGES = True
+
 # settings that are available in templates
 TEMPLATE_SETTINGS = [
    'APP_URL_ROOT',
--- a/python/src/stackdump/models.py
+++ b/python/src/stackdump/models.py
@@ -5,6 +5,10 @@
 from sqlobject import SQLObject, UnicodeCol, DateTimeCol, IntCol, ForeignKey, \
                      DatabaseIndex

+
+ISO_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S.%f'
+
+
 class Site(SQLObject):
    name = UnicodeCol()
    desc = UnicodeCol()
@@ -15,34 +19,23 @@ class Site(SQLObject):
    
    siteKey_index = DatabaseIndex(key, unique=True)

+
 class Badge(SQLObject):
    sourceId = IntCol()
    site = ForeignKey('Site', cascade=True)
    userId = IntCol()
    name = UnicodeCol()
-    date = DateTimeCol()
+    date = DateTimeCol(datetimeFormat=ISO_DATE_FORMAT)

-class Comment(SQLObject):
-    sourceId = IntCol()
-    site = ForeignKey('Site', cascade=True)
-    postId = IntCol()
-    score = IntCol()
-    text = UnicodeCol()
-    creationDate = DateTimeCol()
-    userId = IntCol()
-    
-    siteId_postId_index = DatabaseIndex(site, postId)
-    
-    json_fields = [ 'id', 'score', 'text', 'creationDate', 'userId' ]

 class User(SQLObject):
    sourceId = IntCol()
    site = ForeignKey('Site', cascade=True)
    reputation = IntCol()
-    creationDate = DateTimeCol()
+    creationDate = DateTimeCol(datetimeFormat=ISO_DATE_FORMAT)
    displayName = UnicodeCol()
    emailHash = UnicodeCol()
-    lastAccessDate = DateTimeCol()
+    lastAccessDate = DateTimeCol(datetimeFormat=ISO_DATE_FORMAT)
    websiteUrl = UnicodeCol()
    location = UnicodeCol()
    age = IntCol()
--- a/python/src/stackdump/settings.py
+++ b/python/src/stackdump/settings.py
@@ -19,9 +19,10 @@ from default_settings import *
 # uncomment if the default host and port for Solr is different.
 #SOLR_URL = 'http://localhost:8983/solr/stackdump/'

-# uncomment if the database for Stackdump is not the default SQLite one
-#import os
-#DATABASE_CONN_STR = 'sqlite://%s/../../../data/stackdump.sqlite' % os.path.dirname(__file__)
+# uncomment if the database for Stackdump is not the default SQLite one or you
+# wish to have the database at a different path to the stackdump_root/data
+# directory
+#DATABASE_CONN_STR = 'sqlite:///' + path_to_the_database

 # if the website is hosted under a subpath, specify it here. It must end with a
 # slash.
@@ -33,3 +34,6 @@ from default_settings import *

 # number of random questions to show on search query pages
 #NUM_OF_RANDOM_QUESTIONS = 3
+
+# rewrite links and images to point internally or to a placeholder respectively
+#REWRITE_LINKS_AND_IMAGES = True
--- a/start_solr.sh
+++ b/start_solr.sh
@@ -22,4 +22,4 @@ then
 fi

 cd "$SCRIPT_DIR/java/solr/server"
-"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -jar start.jar
+"$JAVA_CMD" -Xmx2048M -XX:MaxPermSize=512M -Djetty.host=127.0.0.1 -jar start.jar
Author	SHA1	Message	Date
Samuel Lai	4d6343584a	Minor README tweaks.	2014-03-03 17:07:26 +11:00
Samuel Lai	9d1d6b135a	Grrr. More textile issues.	2014-02-27 22:02:04 +11:00
Samuel Lai	96b06f7b35	Oops, textile syntax mistake.	2014-02-27 22:00:48 +11:00
Samuel Lai	28d79ea089	Added notes on using supervisor with stackdump.	2014-02-27 21:58:22 +11:00
Samuel Lai	ce7edf1ca0	Minor README tweaks.	2014-02-27 20:44:55 +11:00
Samuel Lai	4254f31859	Updated the README for the next release. Fixes #8 by updating the URL to the data dumps.	2014-02-27 20:39:32 +11:00
Samuel Lai	c11fcfacf6	Fixes #9 . Added ability for import_site command to resume importing if the connection to Solr is lost and restored.	2014-02-27 20:12:53 +11:00
Samuel Lai	7764f088c2	Added a setting to disable the rewriting of links and image URLs.	2014-02-27 18:52:25 +11:00
Samuel Lai	a4c6c2c7ba	Certain ignored post type IDs are now recognised by the error handler and messages printed as such.	2014-02-27 18:13:04 +11:00
Samuel Lai	01f9b10c27	Fixed #7 . Turns out post IDs are not unique across sites. This change will require re-indexing of all sites unfortunately. On the upside, more questions to browse!	2014-02-27 17:57:34 +11:00
Sam	cdb93e6f68	Merged changes.	2014-02-16 01:04:19 +11:00
Sam	0990e00852	Added an original copy of pysolr.py so the custom changes can be worked out.	2014-02-16 01:03:05 +11:00
Samuel Lai	92e359174a	Added some notes on importing StackOverflow on Windows.	2013-12-12 17:29:55 +11:00
Samuel Lai	c521fc1627	Added tag v1.2 for changeset 240affa260a1	2013-11-30 18:06:37 +11:00
Sam	722d4125e7	Added section in README re new PowerShell scripts. Also fixed formatting and wording.	2013-12-01 03:43:58 +11:00
Sam	ce3eb04270	Updated README with v1.2 changes and SO import stats.	2013-12-01 03:33:40 +11:00
Samuel Lai	9613caa8d1	Changed settings so Solr now only listens on localhost, not all interfaces.	2013-11-29 15:18:55 +11:00
Samuel Lai	2583afeb90	Removed more redundant date/time parsing.	2013-11-29 15:11:32 +11:00
Samuel Lai	522e1ff4f2	Fixed bug in script where the directory change was not reverted when script exited.	2013-11-29 15:06:10 +11:00
Samuel Lai	36eb8d3980	Changed the name of the stackdump schema to something better than 'Example'.	2013-11-29 15:05:31 +11:00
Samuel Lai	a597b2e588	Merge import-perf-improvements branch to default.	2013-11-29 13:01:41 +11:00
Samuel Lai	4a9c4504b3	Updated bad docs.	2013-11-29 12:57:06 +11:00
Samuel Lai	77dd2def42	Oops, forgot to re-instate the comment index during the backout.	2013-11-29 01:42:17 +11:00
Samuel Lai	75a216f5a4	Backed out the comments-batching change. It was causing weird perf issues and errors. Didn't really seem like it made things faster; if anything, things became slower.	2013-11-29 01:12:09 +11:00
Samuel Lai	bf09e36928	Changed other models to avoid unnecessary date/time parsing. Added PRAGMA statements for comments table and changed flow so the siteId_postId index is now created after data has been inserted.	2013-11-29 00:18:54 +11:00
Samuel Lai	cdb8d96508	Comments are now committed in batches and using a 'prepared' statement via executemany. Also fixed a Windows compatibility bug with the new temp comments db and a bug with the webapp now that the Comment model has moved. Dates are also no longer parsed from their ISO form for comments; instead left as strings and parsed by SQLObject internally as needed.	2013-11-28 23:51:53 +11:00
Samuel Lai	5868c8e328	Fixed settings for Windows compatibility.	2013-11-28 22:06:33 +11:00
Samuel Lai	8e3d21f817	Fixed settings for Windows compatibility.	2013-11-28 22:06:33 +11:00
Samuel Lai	2fea457b06	Added PowerShell equivalents to launch and manage Stackdump on Windows.	2013-11-28 21:53:45 +11:00
Samuel Lai	6469691e4b	Added PowerShell equivalents to launch and manage Stackdump on Windows.	2013-11-28 21:53:45 +11:00
Samuel Lai	65394ac516	More minor fixes. Really should get Stackdump set-up on my dev machine.	2013-11-28 15:07:05 +11:00
Samuel Lai	bcf1d7c71a	Again. Forgot to fix site->siteId rename.	2013-11-28 14:39:25 +11:00
Samuel Lai	d36146ae46	More bugs - forgot to rename uses when renaming Comment.site to siteId	2013-11-28 14:38:21 +11:00
Samuel Lai	e1272ce58a	Oops, bug with closing temp_db file handle.	2013-11-28 14:35:24 +11:00
Samuel Lai	bff7e13d83	Comment data used during importing is now stored in a separate database to make it easier to delete them afterwards.	2013-11-28 14:23:55 +11:00
Samuel Lai	c0766de8d4	Skips valid XML character scrubbing if configured for faster performance.	2013-11-28 14:01:00 +11:00
Samuel Lai	644269dd5d	Added PyCharm project files to the ignore list.	2013-11-28 13:54:47 +11:00
Sam	6bbf0d7b28	Removed a big duplicate file in Solr.	2013-10-22 23:36:46 +11:00
Sam	71c875437e	Added tag v1.1 for changeset 3ad1ff15b528	2013-10-22 23:21:20 +11:00