1
0
mirror of https://github.com/djohnlewis/stackdump synced 2025-12-17 05:13:32 +00:00

7 Commits

Author SHA1 Message Date
Samuel Lai
1f9546e4b9 Made some minor amendments to the instructions in the README. 2012-08-19 12:53:42 +10:00
Samuel Lai
049e857159 Handled another exception that may occur if no data has been imported. 2012-08-19 12:47:42 +10:00
Samuel Lai
16e5530a82 Modified download_site_info script to create the data directory if it doesn't exist. 2012-08-19 12:30:33 +10:00
Samuel Lai
c1ae870e3d Startup scripts now create the data directory if it doesn't exist. 2012-08-19 12:27:45 +10:00
Samuel Lai
651f97255e More rendering fixes to README. 2012-08-19 12:15:35 +10:00
Samuel Lai
527d5deb05 Fixed some minor bugs with README and it being rendered by bitbucket. 2012-08-19 12:13:06 +10:00
Samuel Lai
1e6718d850 Merged the cpython-only branch into the default branch.
The cPython will be the default version; not really much need for the Jython version anymore.
2012-08-19 11:49:38 +10:00
5 changed files with 55 additions and 15 deletions

View File

@@ -1,14 +1,14 @@
h1. Stackdump - an offline browser for StackExchange sites.
Stackdump was conceived for those who work in work environments that do not allow easy access to the StackExchange family of websites. It allows you to host a read-only instance of the StackExchange sites locally, accessible via a web browser.
Stackdump was conceived for those who work in environments that do not have easy access to the StackExchange family of websites. It allows you to host a read-only instance of the StackExchange sites locally, accessible via a web browser.
Stackdump comprises of two components - the search indexer ("Apache Solr":http://lucene.apache.org/solr/) and the web application. It uses the "StackExchange Data Dumps":http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/, published quarterly by StackExchange, as its source of data.
h2. Screenshots
!http://edgylogic.com/dynmedia/301/640x480/!
!http://edgylogic.com/dynmedia/303/640x480/!
!http://edgylogic.com/dynmedia/302/640x480/!
"Stackdump home":http://edgylogic.com/dynmedia/301/
"Stackdump search results":http://edgylogic.com/dynmedia/303/
"Stackdump question view":http://edgylogic.com/dynmedia/302/
h2. System Requirements
@@ -27,9 +27,9 @@ Stackdump was designed for offline environments or environments with poor intern
As long as you have:
* "Python":http://python.org/download/,
* "Java":http://java.com/en/download/manual.jsp,
* Stackdump,
* "Stackdump":https://bitbucket.org/samuel.lai/stackdump/downloads,
* the "StackExchange Data Dump":http://www.clearbits.net/creators/146-stack-exchange-data-dump (Note: this is only available as a torrent), and
* "7-zip" (needed to extract the data dump files)
* "7-zip":http://www.7-zip.org/ (needed to extract the data dump files)
...you should be able to get an instance up and running.
@@ -41,13 +41,13 @@ Stackdump was to be self-contained, so to get it up and running, simply extract
h3. Verify dependencies
Next, you should verify that the required Java and Python versions are accessible in the path.
Next, you should verify that the required Java and Python versions are accessible in the PATH. (If you haven't installed them yet, now is a good time to do so.)
Type @java -version@ and check that it is at least version 1.6.
Then type @python -V@ and check that it is version 2.5 or later (and not Python 3).
If you would rather not put these versions in the path (e.g. you don't want to override the default version of Python in your Linux distribution), you can tell Stackdump which Java and/or Python to use explicitly by creating a file named @JAVA_CMD@ or @PYTHON_CMD@ respectively in the Stackdump root directory, and placing the path to the executable in there.
If you would rather not put these versions in the PATH (e.g. you don't want to override the default version of Python in your Linux distribution), you can tell Stackdump which Java and/or Python to use explicitly by creating a file named @JAVA_CMD@ or @PYTHON_CMD@ respectively in the Stackdump root directory, and placing the path to the executable in there.
h3. Download additional site information
@@ -60,9 +60,9 @@ To start the download, execute the following command in the Stackdump root direc
If Stackdump will be running in a completely offline environment, it is recommended that you extract and run this command in a connected environment first. If that is not possible, you can manually download the required pieces -
* download the "RSS feed":http://stackexchange.com/feeds/sites to a file
* for each site you will be importing, work out the __site key__ and download the logo by substituting the site key into this URL: http://sstatic.net/<strong>site_key</strong>/img/icon-48.png where *site_key* is the site key. The site key is generally the bit in the URL before .stackexchange.com, or just the domain without the TLD, e.g. for the Salesforce StackExchange at http://salesforce.stackexchange.com, it is just __salesforce__, while for Server Fault at http://serverfault.com, it is __serverfault__.
* for each site you will be importing, work out the __site key__ and download the logo by substituting the site key into this URL: @http://sstatic.net/site_key/img/icon-48.png@ where *site_key* is the site key. The site key is generally the bit in the URL before .stackexchange.com, or just the domain without the TLD, e.g. for the Salesforce StackExchange at http://salesforce.stackexchange.com, it is just __salesforce__, while for Server Fault at http://serverfault.com, it is __serverfault__.
The RSS feed file should be copied to the file @stackdump_dir/data/sites@, and the logos should be copied to @stackdump_dir/python/media/images/logos@ and named with the site key and extension, e.g. @serverfault.png@.
The RSS feed file should be copied to the file @stackdump_dir/data/sites@ (create the @data@ directory if it doesn't exist), and the logos should be copied to the @stackdump_dir/python/media/images/logos@ directory and named with the site key and file type extension, e.g. @serverfault.png@.
h3. Import sites
@@ -74,11 +74,19 @@ To start the import process, execute the following command -
@stackdump_dir/manage.sh import_site --base-url site_url --dump-date dump_date path_to_xml_files@
... where __site_url__ is the URL of the site you're importing, e.g. __android.stackexchange.com__; __dump_date__ is the date of the data dump you're importing, e.g. __August 2012__, and finally __path_to_xml_files__ is the path to the XML files you just extracted. The __dump_date__ is a text string that is shown in the app only, so it can be in any format you want.
... where site_url is the URL of the site you're importing, e.g. __android.stackexchange.com__; dump_date is the date of the data dump you're importing, e.g. __August 2012__, and finally path_to_xml_files is the path to the XML files you just extracted. The dump_date is a text string that is shown in the app only, so it can be in any format you want.
For example, to import the August 2012 data dump of the Android StackExchange site, you would execute -
@stackdump_dir/manage.sh import_site --base-url android.stackexchange.com --dump-date "August 2012" /tmp/android@
It is normal to get messages about unknown PostTypeIds and missing comments and answers. These errors are likely due to those posts being hidden via moderation.
This can take anywhere between a minute to 10 hours or more depending on the site you're importing. As a rough guide, __android.stackexchange.com__ took a minute on my VM, while __stackoverflow.com__ took just over 10 hours.
Repeat these steps for each site you wish to import.
Repeat these steps for each site you wish to import. Do not attempt to import multiple sites at the same time; it will not work and you may end up with half-imported sites.
The import process can be cancelled at any time without any adverse effect, however on the next run it will have to start from scratch again.
h3. Start the app
@@ -86,7 +94,7 @@ To start Stackdump, execute the following command -
@stackdump_dir/start_web.sh@
... and visit port 8080 on that machine.
... and visit port 8080 on that machine. That's it - your own offline, read-only instance of StackExchange.
If you need to change the port that it runs on, modify @stackdump_dir/python/src/stackdump/settings.py@ and restart the app.
@@ -94,6 +102,17 @@ Stackdump comes bundled with some init.d scripts as well which were tested on Ce
Both the search indexer and the app need to be running for Stackdump to work.
h2. Maintenance
Stackdump stores all its data in the @data@ directory under its root directory. If you want to start fresh, just stop the app and the search indexer, delete that directory and restart the app and search indexer.
To delete certain sites from Stackdump, use the manage_sites management command -
@stackdump_dir/manage.sh manage_sites -l@ to list the sites (and their site keys) currently in the system;
@stackdump_dir/manage.sh manage_sites -d site_key@ to delete a particular site.
It is not necessary to delete a site before importing a new data dump of it though; the import process will automatically purge the old copy during the import process.
h2. Credits
Stackdump leverages several open-source projects to do various things, including -
@@ -108,7 +127,7 @@ Stackdump leverages several open-source projects to do various things, including
* "iso8601":http://pypi.python.org/pypi/iso8601/ for date parsing
* "httplib2":http://code.google.com/p/httplib2/ as a dependency of pysolr
h2. Things not supported
h2. Things not supported... yet
* searching or browsing by tags
* tag wiki pages

View File

@@ -245,6 +245,10 @@ def error500(error):
# HACK: the exception object doesn't seem to provide a better way though.
if 'database is locked' in ex.args:
return render_template('importinprogress.html')
# check if we get a 'no such table' error. If so, this means we haven't
# had any data imported yet.
if ex.message.startswith('no such table:'):
return render_template('nodata.html')
if isinstance(ex, socket.error):
# if the error is connection refused, then it is likely because Solr is
# not running. Show a nice error message.

View File

@@ -10,6 +10,10 @@ import sys
script_dir = os.path.dirname(sys.argv[0])
sites_file_path = os.path.join(script_dir, '../../../../data/sites')
# ensure the data directory exists
if not os.path.exists(os.path.dirname(sites_file_path)):
os.mkdir(os.path.dirname(sites_file_path))
# download the sites RSS file
print 'Downloading StackExchange sites RSS file...',
urllib.urlretrieve('http://stackexchange.com/feeds/sites', sites_file_path)

View File

@@ -10,9 +10,16 @@ fi
if [ -z "`which $JAVA_CMD 2>/dev/null`" ]
then
echo "Java not found. Try specifying path in a file named JAVA_CMD in the script dir."
echo "Java not found. Try specifying the path to the Java executable in a file named"
echo "JAVA_CMD in this script's directory."
exit 1
fi
# ensure the data directory exists
if [ ! -e "$SCRIPT_DIR/data" ]
then
mkdir "$SCRIPT_DIR/data"
fi
cd $SCRIPT_DIR/java/solr/server
$JAVA_CMD -server -Xmx2048M -XX:MaxPermSize=512M -jar start.jar

View File

@@ -2,4 +2,10 @@
SCRIPT_DIR=`dirname $0`
# ensure the data directory exists
if [ ! -e "$SCRIPT_DIR/data" ]
then
mkdir "$SCRIPT_DIR/data"
fi
$SCRIPT_DIR/start_python.sh $SCRIPT_DIR/python/src/stackdump/app.py