1
0
mirror of https://github.com/djohnlewis/stackdump synced 2024-12-04 23:17:37 +00:00

Added an original copy of pysolr.py so the custom changes can be worked out.

This commit is contained in:
Sam 2014-02-16 01:03:05 +11:00
commit 0990e00852
1153 changed files with 170165 additions and 0 deletions

27
.hgignore Normal file
View File

@ -0,0 +1,27 @@
^JAVA_CMD$
^PYTHON_CMD$
# ignore any data
^data/.*$
# ignore working bytecode
\.class$
\.pyc$
^datadump/.*
# ignore test and tutorial directories
test/.*$
tests/.*$
testsuite/.*$
tutorial/.*$
# Solr/Jetty
^java/solr/server/solr-webapp/.*
^java/solr/server/logs/.*
# ignore the downloaded logos
^python/media/images/logos/.*
# PyCharm project files
^.idea/

BIN
List-StackdumpCommands.ps1 Normal file

Binary file not shown.

179
README.textile Normal file
View File

@ -0,0 +1,179 @@
h1. Stackdump - an offline browser for StackExchange sites.
Stackdump was conceived for those who work in environments that do not have easy access to the StackExchange family of websites. It allows you to host a read-only instance of the StackExchange sites locally, accessible via a web browser.
Stackdump comprises of two components - the search indexer ("Apache Solr":http://lucene.apache.org/solr/) and the web application. It uses the "StackExchange Data Dumps":http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/, published quarterly by StackExchange, as its source of data.
h2. Screenshots
"Stackdump home":http://edgylogic.com/dynmedia/301/
"Stackdump search results":http://edgylogic.com/dynmedia/303/
"Stackdump question view":http://edgylogic.com/dynmedia/302/
h2. System Requirements
Stackdump was written in Python and requires Python 2.5 or later (but not Python 3). It leverages Apache Solr, which requires the Java runtime (JRE), version 6 or later.
Besides that, there are no OS-dependent dependencies and should work on any platform that Python and Java run on (although it only comes bundled with Linux scripts at the moment). It was, however, developed and tested on CentOS 5 running Python 2.7 and JRE 6 update 27.
You will also need "7-zip":http://www.7-zip.org/ to extract the data dump files, but Stackdump does not use it directly so you can perform the extraction on another machine first.
It is recommended that Stackdump be run on a system with at least 3GB of RAM, particularly if you intend to import StackOverflow into Stackdump. Apache Solr requires a fair bit of memory during the import process. It should also have a fair bit of space available; having at least roughly the space used by the raw, extracted, data dump XML files is a good rule of thumb (note that once imported, the raw data dump XML files are not needed by Stackdump any more).
Finally, Stackdump has been tested and works in the latest browsers (IE9, FF10+, Chrome, Safari). It degrades fairly gracefully in older browsers, although some will have rendering issues, e.g. IE8.
h2. Changes and upgrading to v1.1
Version 1.1 fixes a few bugs, the major one being the inability to import the 2013 data dumps due to changes in the case of the filenames. It also adds a couple of minor features, including support for resolving and rewriting short question and answer permalinks.
Because changes have been made to the search schema and the search indexer has been upgraded (to Solr 4.5), all data will need to be re-indexed. Therefore there is no upgrade path; follow the instructions below to set up Stackdump again. It is recommended to install this new version in a new directory, instead of overwriting the existing one.
h2. Changes and upgrading from v1.1 to v1.2.
The major change in the v1.2 release are improvements to the speed of importing data. There are some other smaller changes, including new PowerShell scripts to start and manage Stackdump on Windows as well as a few bug fixes when running on Windows. The search indexing side of things has not changed, therefore data imported using v1.1 will continue to work in v1.2. _Data from older versions however, needs to be re-indexed. See the above section on upgrading to v1.1 for more details._
h3. Importing the StackOverflow data dump, September 2013
The StackOverflow data dump has grown significantly since I started this project back in 2011. With the improvements in v1.2, on a VM with two cores and 4GB of RAM running CentOS 5.7 on a single, standard hard drive containing spinning pieces of metal,
* it took *84719.565491 seconds* to import it, or *23 hours, 31 minutes and 59.565491 seconds*
* once completed, it used up *20GB* of disk space
* during the import, roughly *30GB* of disk space was needed
* the import process used, at max, around *2GB* of RAM.
In total, the StackOverflow data dump has *15,933,529 posts* (questions and answers), *2,332,403 users* and a very large number of comments.
h2. Setting up
Stackdump was designed for offline environments or environments with poor internet access, therefore it is bundled with all the dependencies it requires (with the exception of Python, Java and 7-zip).
As long as you have:
* "Python":http://python.org/download/,
* "Java":http://java.com/en/download/manual.jsp,
* "Stackdump":https://bitbucket.org/samuel.lai/stackdump/downloads,
* the "StackExchange Data Dump":http://www.clearbits.net/creators/146-stack-exchange-data-dump (Note: this is only available as a torrent), and
* "7-zip":http://www.7-zip.org/ (needed to extract the data dump files)
...you should be able to get an instance up and running.
To provide a better experience, Stackdump can use the RSS feed content to pre-fill some of the required details during the import process, as well as to display the site logos in the app. Stackdump comes bundled with a script that downloads and places these bits in the right places. If you're in a completely offline environment however, it may be worth running this script on a connected box first.
h3. Windows users
If you're using Windows, you will need to substitute the appropriate PowerShell equivalent command for the Stackdump scripts used below. These equivalent PowerShell scripts are in the Stackdump root directory, alongside their Unix counterparts. The names are roughly the same, with the exception of @manage.sh@, which in PowerShell has been broken up into two scripts, @List-StackdumpCommands.ps1@ and @Run-StackdumpCommand.ps1@.
Remember to set your PowerShell execution policy to at least @RemoteSigned@ first as these scripts are not signed. Use the @Get-ExecutionPolicy@ cmdlet to see the current policy, and @Set-ExecutionPolicy@ to set it. You will need to have administrative privileges to set it.
h3. Extract Stackdump
Stackdump was to be self-contained, so to get it up and running, simply extract the Stackdump download to an appropriate location.
h3. Verify dependencies
Next, you should verify that the required Java and Python versions are accessible in the PATH. (If you haven't installed them yet, now is a good time to do so.)
Type @java -version@ and check that it is at least version 1.6.
bq. If you're using Java 7 on Linux and you see an error similar to the following -
@ Error: failed /opt/jre1.7.0_40/lib/i386/server/libjvm.so, because /opt/jre1.7.0_40/lib/i386/server/libjvm.so: cannot restore segment prot after reloc: Permission denied @
this is because you have SELinux enabled. You will need to tell SELinux to allow Java to run by using the following command as root (amending the path as necessary) -
@chcon -t textrel_shlib_t /opt/jre1.7.0_40/lib/i386/server/libjvm.so@
Then type @python -V@ and check that it is version 2.5 or later (and not Python 3).
If you would rather not put these versions in the PATH (e.g. you don't want to override the default version of Python in your Linux distribution), you can tell Stackdump which Java and/or Python to use explicitly by creating a file named @JAVA_CMD@ or @PYTHON_CMD@ respectively in the Stackdump root directory, and placing the path to the executable in there.
h3. Download additional site information
As mentioned earlier, Stackdump can use additional information available in the StackExchange RSS feed to pre-fill required details during the site import process and to show the logos for each site.
To start the download, execute the following command in the Stackdump root directory -
@./manage.sh download_site_info@
If Stackdump will be running in a completely offline environment, it is recommended that you extract and run this command in a connected environment first. If that is not possible, you can manually download the required pieces -
* download the "RSS feed":http://stackexchange.com/feeds/sites to a file
* for each site you will be importing, work out the __site key__ and download the logo by substituting the site key into this URL: @http://sstatic.net/site_key/img/icon-48.png@ where *site_key* is the site key. The site key is generally the bit in the URL before .stackexchange.com, or just the domain without the TLD, e.g. for the Salesforce StackExchange at http://salesforce.stackexchange.com, it is just __salesforce__, while for Server Fault at http://serverfault.com, it is __serverfault__.
The RSS feed file should be copied to the file @stackdump_dir/data/sites@ (create the @data@ directory if it doesn't exist), and the logos should be copied to the @stackdump_dir/python/media/images/logos@ directory and named with the site key and file type extension, e.g. @serverfault.png@.
h3. Import sites
Each data dump for a StackExchange site is a "7-zip":http://www.7-zip.org/ file. Extract the file corresponding to the site you wish to import into a temporary directory. It should have a bunch of XML files in it when complete.
Now make sure you have the search indexer up and running. This can be done by simply executing the @stackdump_dir/start_solr.sh@ command.
To start the import process, execute the following command -
@stackdump_dir/manage.sh import_site --base-url site_url --dump-date dump_date path_to_xml_files@
... where site_url is the URL of the site you're importing, e.g. __android.stackexchange.com__; dump_date is the date of the data dump you're importing, e.g. __August 2012__, and finally path_to_xml_files is the path to the XML files you just extracted. The dump_date is a text string that is shown in the app only, so it can be in any format you want.
For example, to import the August 2012 data dump of the Android StackExchange site, you would execute -
@stackdump_dir/manage.sh import_site --base-url android.stackexchange.com --dump-date "August 2012" /tmp/android@
It is normal to get messages about unknown PostTypeIds and missing comments and answers. These errors are likely due to those posts being hidden via moderation.
This can take anywhere between a minute to 10 hours or more depending on the site you're importing. As a rough guide, __android.stackexchange.com__ took a minute on my VM, while __stackoverflow.com__ took just over 10 hours.
Repeat these steps for each site you wish to import. Do not attempt to import multiple sites at the same time; it will not work and you may end up with half-imported sites.
The import process can be cancelled at any time without any adverse effect, however on the next run it will have to start from scratch again.
h3. Start the app
To start Stackdump, execute the following command -
@stackdump_dir/start_web.sh@
... and visit port 8080 on that machine. That's it - your own offline, read-only instance of StackExchange.
If you need to change the port that it runs on, modify @stackdump_dir/python/src/stackdump/settings.py@ and restart the app.
The aforementioned @settings.py@ file also contains some other settings that control how Stackdump works.
Stackdump comes bundled with some init.d scripts as well which were tested on CentOS 5. These are located in the @init.d@ directory. To use these, you will need to modify them to specify the path to the Stackdump root directory and the user to run under.
Both the search indexer and the app need to be running for Stackdump to work.
h2. Maintenance
Stackdump stores all its data in the @data@ directory under its root directory. If you want to start fresh, just stop the app and the search indexer, delete that directory and restart the app and search indexer.
To delete certain sites from Stackdump, use the manage_sites management command -
@stackdump_dir/manage.sh manage_sites -l@ to list the sites (and their site keys) currently in the system;
@stackdump_dir/manage.sh manage_sites -d site_key@ to delete a particular site.
It is not necessary to delete a site before importing a new data dump of it though; the import process will automatically purge the old copy during the import process.
h2. Credits
Stackdump leverages several open-source projects to do various things, including -
* "twitter-bootstrap":http://github.com/twitter/bootstrap for the UI
* "jQuery":http://jquery.com for the UI
* "bottle.py":http://bottlepy.org for the web framework
* "cherrypy":http://cherrypy.org for the built-in web server
* "pysolr":https://github.com/toastdriven/pysolr/ to connect from Python to the search indexer, Apache Solr
* "html5lib":http://code.google.com/p/html5lib/ for parsing HTML
* "Jinja2":http://jinja.pocoo.org/ for templating
* "SQLObject":http://www.sqlobject.org/ for writing and reading from the database
* "iso8601":http://pypi.python.org/pypi/iso8601/ for date parsing
* "markdown":http://pypi.python.org/pypi/Markdown for rendering comments
* "mathjax":http://www.mathjax.org/ for displaying mathematical expressions properly
* "httplib2":http://code.google.com/p/httplib2/ as a dependency of pysolr
* "Apache Solr":http://lucene.apache.org/solr/ for search functionality
h2. Things not supported... yet
* searching or browsing by tags
* tag wiki pages
* badges
* post history, e.g. reasons why are a post was closed are not listed
h2. License
Stackdump is licensed under the "MIT License":http://en.wikipedia.org/wiki/MIT_License.

BIN
Run-StackdumpCommand.ps1 Normal file

Binary file not shown.

BIN
Start-Python.ps1 Normal file

Binary file not shown.

BIN
Start-Solr.ps1 Normal file

Binary file not shown.

BIN
Start-StackdumpWeb.ps1 Normal file

Binary file not shown.

142
init.d/stackdump_solr Executable file
View File

@ -0,0 +1,142 @@
#! /bin/bash
#
# stackdump_solr: Starts the Solr instance for Stackdump
#
# chkconfig: 345 99 01
# description: This daemon provides the search engine capability for Stackdump.\
# It is a required part of Stackdump; Stackdump will not work \
# without it.
# Source function library.
. /etc/init.d/functions
# this needs to be the path of the Stackdump root directory.
STACKDUMP_HOME=/opt/stackdump/
# this is the user that Stackdump runs under
STACKDUMP_USER=stackdump
SOLR_PID_FILE=/var/run/stackdump_solr.pid
if [ ! -d "$STACKDUMP_HOME" ]
then
echo "The STACKDUMP_HOME variable does not point to a valid directory."
exit 1
fi
base=${0##*/}
start() {
echo -n $"Starting Stackdump - Solr... "
# create the logs directory if it doesn't already exist
if [ ! -d "$STACKDUMP_HOME/logs" ]
then
runuser -s /bin/bash $STACKDUMP_USER -c "mkdir $STACKDUMP_HOME/logs"
fi
# check if it is already running
SOLR_PID=`cat $SOLR_PID_FILE 2>/dev/null`
if [ ! -z "$SOLR_PID" ]
then
if [ ! -z "$(pgrep -P $SOLR_PID)" ]
then
echo
echo "Stackdump - Solr is already running."
exit 2
else
# the PID is stale.
rm $SOLR_PID_FILE
fi
fi
# run it!
runuser -s /bin/bash $STACKDUMP_USER -c "$STACKDUMP_HOME/start_solr.sh >> $STACKDUMP_HOME/logs/solr.log 2>&1" &
SOLR_PID=$!
RETVAL=$?
if [ $RETVAL = 0 ]
then
echo $SOLR_PID > $SOLR_PID_FILE
success $"$base startup"
else
failure $"$base startup"
fi
echo
return $RETVAL
}
stop() {
# check if it is running
SOLR_PID=`cat $SOLR_PID_FILE 2>/dev/null`
if [ -z "$SOLR_PID" ] || [ -z "$(pgrep -P $SOLR_PID)" ]
then
echo "Stackdump - Solr is not running."
exit 2
fi
echo -n $"Shutting down Stackdump - Solr... "
# it is running, so shut it down.
# there are many levels of processes here and the kill signal needs to
# be sent to the actual Java process for the process to stop, so let's
# just kill the whole process group.
RUNUSER_CMD_PID=`pgrep -P $SOLR_PID`
RUNUSER_CMD_PGRP=`ps -o pgrp --no-headers -p $RUNUSER_CMD_PID`
pkill -g $RUNUSER_CMD_PGRP
RETVAL=$?
[ $RETVAL = 0 ] && success $"$base shutdown" || failure $"$base shutdown"
rm -f $SOLR_PID_FILE
echo
return $RETVAL
}
status() {
# check if it is running
SOLR_PID=`cat $SOLR_PID_FILE 2>/dev/null`
if [ -z "$SOLR_PID" ]
then
echo "Stackdump - Solr is not running."
exit 0
else
if [ -z "$(pgrep -P $SOLR_PID)" ]
then
rm -f $SOLR_PID_FILE
echo "Stackdump - Solr is not running."
exit 0
else
echo "Stackdump - Solr is running."
exit 0
fi
fi
}
restart() {
stop
start
}
RETVAL=0
# See how we were called.
case "$1" in
start)
start
;;
stop)
stop
;;
status)
status
;;
restart)
restart
;;
*)
echo $"Usage: $0 {start|stop|status|restart}"
exit 1
esac
exit $RETVAL

141
init.d/stackdump_web Normal file
View File

@ -0,0 +1,141 @@
#! /bin/bash
#
# stackdump_web: Starts the Stackdump web app
#
# chkconfig: 345 99 01
# description: This daemon is the web server for Stackdump.\
# It requires the Solr instance to be running to function.
# Source function library.
. /etc/init.d/functions
# this needs to be the path of the Stackdump root directory.
STACKDUMP_HOME=/opt/stackdump/
# this is the user that Stackdump runs under
STACKDUMP_USER=stackdump
WEB_PID_FILE=/var/run/stackdump_web.pid
if [ ! -d "$STACKDUMP_HOME" ]
then
echo "The STACKDUMP_HOME variable does not point to a valid directory."
exit 1
fi
base=${0##*/}
start() {
echo -n $"Starting Stackdump - Web... "
# create the logs directory if it doesn't already exist
if [ ! -d "$STACKDUMP_HOME/logs" ]
then
runuser -s /bin/bash $STACKDUMP_USER -c "mkdir $STACKDUMP_HOME/logs"
fi
# check if it is already running
WEB_PID=`cat $WEB_PID_FILE 2>/dev/null`
if [ ! -z "$WEB_PID" ]
then
if [ ! -z "$(pgrep -P $WEB_PID)" ]
then
echo
echo "Stackdump - Web is already running."
exit 2
else
# the PID is stale.
rm $WEB_PID_FILE
fi
fi
# run it!
runuser -s /bin/bash $STACKDUMP_USER -c "$STACKDUMP_HOME/start_web.sh >> $STACKDUMP_HOME/logs/web.log 2>&1" &
WEB_PID=$!
RETVAL=$?
if [ $RETVAL = 0 ]
then
echo $WEB_PID > $WEB_PID_FILE
success $"$base startup"
else
failure $"$base startup"
fi
echo
return $RETVAL
}
stop() {
# check if it is running
WEB_PID=`cat $WEB_PID_FILE 2>/dev/null`
if [ -z "$WEB_PID" ] || [ -z "$(pgrep -P $WEB_PID)" ]
then
echo "Stackdump - Web is not running."
exit 2
fi
echo -n $"Shutting down Stackdump - Web... "
# it is running, so shut it down.
# there are many levels of processes here and the kill signal needs to
# be sent to the actual Python process for the process to stop, so let's
# just kill the whole process group.
RUNUSER_CMD_PID=`pgrep -P $WEB_PID`
RUNUSER_CMD_PGRP=`ps -o pgrp --no-headers -p $RUNUSER_CMD_PID`
pkill -g $RUNUSER_CMD_PGRP
RETVAL=$?
[ $RETVAL = 0 ] && success $"$base shutdown" || failure $"$base shutdown"
rm -f $WEB_PID_FILE
echo
return $RETVAL
}
status() {
# check if it is running
WEB_PID=`cat $WEB_PID_FILE 2>/dev/null`
if [ -z "$WEB_PID" ]
then
echo "Stackdump - Web is not running."
exit 0
else
if [ -z "$(pgrep -P $WEB_PID)" ]
then
rm -f $WEB_PID_FILE
echo "Stackdump - Web is not running."
exit 0
else
echo "Stackdump - Web is running."
exit 0
fi
fi
}
restart() {
stop
start
}
RETVAL=0
# See how we were called.
case "$1" in
start)
start
;;
stop)
stop
;;
status)
status
;;
restart)
restart
;;
*)
echo $"Usage: $0 {start|stop|status|restart}"
exit 1
esac
exit $RETVAL

7412
java/solr/CHANGES.txt Normal file

File diff suppressed because it is too large Load Diff

226
java/solr/LICENSE.txt Normal file
View File

@ -0,0 +1,226 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==========================================================================
The following license applies to the JQuery JavaScript library
--------------------------------------------------------------------------
Copyright (c) 2010 John Resig, http://jquery.com/
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

564
java/solr/NOTICE.txt Normal file

File diff suppressed because it is too large Load Diff

120
java/solr/README.txt Normal file
View File

@ -0,0 +1,120 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Welcome to the Apache Solr project!
-----------------------------------
Solr is the popular, blazing fast open source enterprise search platform
from the Apache Lucene project.
For a complete description of the Solr project, team composition, source
code repositories, and other details, please see the Solr web site at
http://lucene.apache.org/solr
Getting Started
---------------
See the "example" directory for an example Solr setup. A tutorial
using the example setup can be found at
http://lucene.apache.org/solr/tutorial.html
or linked from "docs/index.html" in a binary distribution.
Also, there are Solr clients for many programming languages, see
http://wiki.apache.org/solr/IntegratingSolr
Files included in an Apache Solr binary distribution
----------------------------------------------------
example/
A self-contained example Solr instance, complete with a sample
configuration, documents to index, and the Jetty Servlet container.
Please see example/README.txt for information about running this
example.
dist/solr-XX.war
The Apache Solr Application. Deploy this WAR file to any servlet
container to run Apache Solr.
dist/solr-<component>-XX.jar
The Apache Solr libraries. To compile Apache Solr Plugins,
one or more of these will be required. The core library is
required at a minimum. (see http://wiki.apache.org/solr/SolrPlugins
for more information).
docs/index.html
The Apache Solr Javadoc API documentation and Tutorial
Instructions for Building Apache Solr from Source
-------------------------------------------------
1. Download the Java SE 6 JDK (Java Development Kit) or later from http://java.sun.com/
You will need the JDK installed, and the $JAVA_HOME/bin (Windows: %JAVA_HOME%\bin)
folder included on your command path. To test this, issue a "java -version" command
from your shell (command prompt) and verify that the Java version is 1.6 or later.
2. Download the Apache Ant binary distribution (1.8.2+) from
http://ant.apache.org/ You will need Ant installed and the $ANT_HOME/bin (Windows:
%ANT_HOME%\bin) folder included on your command path. To test this, issue a
"ant -version" command from your shell (command prompt) and verify that Ant is
available.
You will also need to install Apache Ivy binary distribution (2.2.0) from
http://ant.apache.org/ivy/ and place ivy-2.2.0.jar file in ~/.ant/lib -- if you skip
this step, the Solr build system will offer to do it for you.
3. Download the Apache Solr distribution, linked from the above web site.
Unzip the distribution to a folder of your choice, e.g. C:\solr or ~/solr
Alternately, you can obtain a copy of the latest Apache Solr source code
directly from the Subversion repository:
http://lucene.apache.org/solr/versioncontrol.html
4. Navigate to the "solr" folder and issue an "ant" command to see the available options
for building, testing, and packaging Solr.
NOTE:
To see Solr in action, you may want to use the "ant example" command to build
and package Solr into the example/webapps directory. See also example/README.txt.
Export control
-------------------------------------------------
This distribution includes cryptographic software. The country in
which you currently reside may have restrictions on the import,
possession, use, and/or re-export to another country, of
encryption software. BEFORE using any encryption software, please
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to
see if this is permitted. See <http://www.wassenaar.org/> for more
information.
The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms. The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS
Export Administration Regulations, Section 740.13) for both object
code and source code.
The following provides more details on the included cryptographic
software:
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for
extracting text content and metadata from encrypted PDF files.
See http://www.bouncycastle.org/ for more details on Bouncy Castle.

View File

@ -0,0 +1,13 @@
# System Requirements
Apache Solr runs of Java 6 or greater. When using Java 7, be sure to
install at least Update 1! With all Java versions it is strongly
recommended to not use experimental `-XX` JVM options. It is also
recommended to always use the latest update version of your Java VM,
because bugs may affect Solr. An overview of known JVM bugs can be
found on http://wiki.apache.org/lucene-java/JavaBugs.
CPU, disk and memory requirements are based on the many choices made in
implementing Solr (document size, number of documents, and number of
hits retrieved to name a few). The benchmarks page has some information
related to performance on particular platforms.

Binary file not shown.

BIN
java/solr/dist/solr-cell-4.5.0.jar vendored Normal file

Binary file not shown.

BIN
java/solr/dist/solr-clustering-4.5.0.jar vendored Normal file

Binary file not shown.

BIN
java/solr/dist/solr-core-4.5.0.jar vendored Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
java/solr/dist/solr-langid-4.5.0.jar vendored Normal file

Binary file not shown.

BIN
java/solr/dist/solr-solrj-4.5.0.jar vendored Normal file

Binary file not shown.

Binary file not shown.

BIN
java/solr/dist/solr-uima-4.5.0.jar vendored Normal file

Binary file not shown.

BIN
java/solr/dist/solr-velocity-4.5.0.jar vendored Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
java/solr/dist/solrj-lib/noggit-0.5.jar vendored Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,6 @@
The Solr test-framework products base classes and utility classes for
writting JUnit tests excercising Solr functionality.
This test framework relies on the lucene components found in in the
./lucene-libs/ directory, as well as the third-party libraries found
in the ./lib directory.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,8 @@
<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
<Set name="contextPath"><SystemProperty name="hostContext" default="/solr"/></Set>
<Set name="war"><SystemProperty name="jetty.home"/>/webapps/solr.war</Set>
<Set name="defaultsDescriptor"><SystemProperty name="jetty.home"/>/etc/webdefault.xml</Set>
<Set name="tempDirectory"><Property name="jetty.home" default="."/>/solr-webapp</Set>
</Configure>

View File

@ -0,0 +1,37 @@
#!/bin/bash -ex
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############
# This script shows how the solrtest.keystore file used for solr tests
# and these example configs was generated.
#
# Running this script should only be necessary if the keystore file
# needs to be replaced, which shouldn't be required until sometime around
# the year 4751.
#
# NOTE: the "-ext" option used in the "keytool" command requires that you have
# the java7 version of keytool, but the generated key will work with any
# version of java
echo "### remove old keystore"
rm -f solrtest.keystore
echo "### create keystore and keys"
keytool -keystore solrtest.keystore -storepass "secret" -alias solrtest -keypass "secret" -genkey -keyalg RSA -dname "cn=localhost, ou=SolrTest, o=lucene.apache.org, c=US" -ext "san=ip:127.0.0.1" -validity 999999

View File

@ -0,0 +1,205 @@
<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<!-- =============================================================== -->
<!-- Configure the Jetty Server -->
<!-- -->
<!-- Documentation of this file format can be found at: -->
<!-- http://wiki.eclipse.org/Jetty/Reference/jetty.xml_syntax -->
<!-- -->
<!-- =============================================================== -->
<Configure id="Server" class="org.eclipse.jetty.server.Server">
<!-- =========================================================== -->
<!-- Server Thread Pool -->
<!-- =========================================================== -->
<Set name="ThreadPool">
<!-- Default queued blocking threadpool -->
<New class="org.eclipse.jetty.util.thread.QueuedThreadPool">
<Set name="minThreads">10</Set>
<Set name="maxThreads">10000</Set>
<Set name="detailedDump">false</Set>
</New>
</Set>
<!-- =========================================================== -->
<!-- Set connectors -->
<!-- =========================================================== -->
<!--
<Call name="addConnector">
<Arg>
<New class="org.eclipse.jetty.server.nio.SelectChannelConnector">
<Set name="host"><SystemProperty name="jetty.host" /></Set>
<Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="Acceptors">2</Set>
<Set name="statsOn">false</Set>
<Set name="confidentialPort">8443</Set>
<Set name="lowResourcesConnections">5000</Set>
<Set name="lowResourcesMaxIdleTime">5000</Set>
</New>
</Arg>
</Call>
-->
<!-- This connector is currently being used for Solr because it
showed better performance than nio.SelectChannelConnector
for typical Solr requests. -->
<Call name="addConnector">
<Arg>
<New class="org.eclipse.jetty.server.bio.SocketConnector">
<Call class="java.lang.System" name="setProperty"> <Arg>log4j.configuration</Arg> <Arg>etc/log4j.properties</Arg> </Call>
<Set name="host"><SystemProperty name="jetty.host" /></Set>
<Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="lowResourceMaxIdleTime">1500</Set>
<Set name="statsOn">false</Set>
</New>
</Arg>
</Call>
<!-- if the connector below is uncommented, then jetty will also accept SSL
connections on port 8984, using a self signed certificate and can
optionally require the client to authenticate with a certificate.
(which can be the same as the server certificate_
# Run solr example with SSL on port 8984
java -jar start.jar
#
# Run post.jar so that it trusts the server cert...
java -Djavax.net.ssl.trustStore=../etc/solrtest.keystore -Durl=https://localhost:8984/solr/update -jar post.jar *.xml
# Run solr example with SSL requiring client certs on port 8984
java -Djetty.ssl.clientAuth=true -jar start.jar
#
# Run post.jar so that it trusts the server cert,
# and authenticates with a client cert
java -Djavax.net.ssl.keyStorePassword=secret -Djavax.net.ssl.keyStore=../etc/solrtest.keystore -Djavax.net.ssl.trustStore=../etc/solrtest.keystore -Durl=https://localhost:8984/solr/update -jar post.jar *.xml
-->
<!--
<Call name="addConnector">
<Arg>
<New class="org.eclipse.jetty.server.ssl.SslSelectChannelConnector">
<Arg>
<New class="org.eclipse.jetty.http.ssl.SslContextFactory">
<Set name="keyStore"><SystemProperty name="jetty.home" default="."/>/etc/solrtest.keystore</Set>
<Set name="keyStorePassword">secret</Set>
<Set name="needClientAuth"><SystemProperty name="jetty.ssl.clientAuth" default="false"/></Set>
</New>
</Arg>
<Set name="port"><SystemProperty name="jetty.ssl.port" default="8984"/></Set>
<Set name="maxIdleTime">30000</Set>
</New>
</Arg>
</Call>
-->
<!-- =========================================================== -->
<!-- Set handler Collection Structure -->
<!-- =========================================================== -->
<Set name="handler">
<New id="Handlers" class="org.eclipse.jetty.server.handler.HandlerCollection">
<Set name="handlers">
<Array type="org.eclipse.jetty.server.Handler">
<Item>
<New id="Contexts" class="org.eclipse.jetty.server.handler.ContextHandlerCollection"/>
</Item>
<Item>
<New id="DefaultHandler" class="org.eclipse.jetty.server.handler.DefaultHandler"/>
</Item>
<Item>
<New id="RequestLog" class="org.eclipse.jetty.server.handler.RequestLogHandler"/>
</Item>
</Array>
</Set>
</New>
</Set>
<!-- =========================================================== -->
<!-- Configure Request Log -->
<!-- =========================================================== -->
<!--
<Ref id="Handlers">
<Call name="addHandler">
<Arg>
<New id="RequestLog" class="org.eclipse.jetty.server.handler.RequestLogHandler">
<Set name="requestLog">
<New id="RequestLogImpl" class="org.eclipse.jetty.server.NCSARequestLog">
<Set name="filename">
logs/request.yyyy_mm_dd.log
</Set>
<Set name="filenameDateFormat">yyyy_MM_dd</Set>
<Set name="retainDays">90</Set>
<Set name="append">true</Set>
<Set name="extended">false</Set>
<Set name="logCookies">false</Set>
<Set name="LogTimeZone">UTC</Set>
</New>
</Set>
</New>
</Arg>
</Call>
</Ref>
-->
<!-- =========================================================== -->
<!-- extra options -->
<!-- =========================================================== -->
<Set name="stopAtShutdown">true</Set>
<Set name="sendServerVersion">false</Set>
<Set name="sendDateHeader">false</Set>
<Set name="gracefulShutdown">1000</Set>
<Set name="dumpAfterStart">false</Set>
<Set name="dumpBeforeStop">false</Set>
<Call name="addBean">
<Arg>
<New id="DeploymentManager" class="org.eclipse.jetty.deploy.DeploymentManager">
<Set name="contexts">
<Ref id="Contexts" />
</Set>
<Call name="setContextAttribute">
<Arg>org.eclipse.jetty.server.webapp.ContainerIncludeJarPattern</Arg>
<Arg>.*/servlet-api-[^/]*\.jar$</Arg>
</Call>
<!-- Add a customize step to the deployment lifecycle -->
<!-- uncomment and replace DebugBinding with your extended AppLifeCycle.Binding class
<Call name="insertLifeCycleNode">
<Arg>deployed</Arg>
<Arg>starting</Arg>
<Arg>customise</Arg>
</Call>
<Call name="addLifeCycleBinding">
<Arg>
<New class="org.eclipse.jetty.deploy.bindings.DebugBinding">
<Arg>customise</Arg>
</New>
</Arg>
</Call>
-->
</New>
</Arg>
</Call>
<Ref id="DeploymentManager">
<Call name="addAppProvider">
<Arg>
<New class="org.eclipse.jetty.deploy.providers.ContextProvider">
<Set name="monitoredDirName"><SystemProperty name="jetty.home" default="."/>/contexts</Set>
<Set name="scanInterval">0</Set>
</New>
</Arg>
</Call>
</Ref>
</Configure>

View File

@ -0,0 +1,38 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# To use this log config, start solr with the following system property:
# -Djava.util.logging.config.file=etc/logging.properties
## Default global logging level:
.level = INFO
## Log every update command (add, delete, commit, ...)
#org.apache.solr.update.processor.LogUpdateProcessor.level = FINE
## Where to log (space separated list).
handlers = java.util.logging.FileHandler
java.util.logging.FileHandler.level = FINE
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
# 1 GB limit per file
java.util.logging.FileHandler.limit = 1073741824
# Log to the logs directory, with log files named solrxxx.log
java.util.logging.FileHandler.pattern = ./logs/solr%u.log

Binary file not shown.

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,24 @@
# Logging level
solr.log=logs/
log4j.rootLogger=INFO, file, CONSOLE
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x \u2013 %m%n
#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=4MB
log4j.appender.file.MaxBackupIndex=9
#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%-5p - %d{yyyy-MM-dd HH:mm:ss.SSS}; %C; %m\n
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.hadoop=WARN
# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

View File

@ -0,0 +1,63 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Example Solr Home Directory
=============================
This directory is provided as an example of what a "Solr Home" directory
should look like.
It's not strictly necessary that you copy all of the files in this
directory when setting up a new instance of Solr, but it is recommended.
Basic Directory Structure
-------------------------
The Solr Home directory typically contains the following...
* solr.xml *
This is the primary configuration file Solr looks for when starting.
This file specifies the list of "SolrCores" it should load, and high
level configuration options that should be used for all SolrCores.
Please see the comments in ./solr.xml for more details.
If no solr.xml file is found, then Solr assumes that there should be
a single SolrCore named "collection1" and that the "Instance Directory"
for collection1 should be the same as the Solr Home Directory.
* Individual SolrCore Instance Directories *
Although solr.xml can be configured to look for SolrCore Instance Directories
in any path, simple sub-directories of the Solr Home Dir using relative paths
are common for many installations. In this directory you can see the
"./collection1" Instance Directory.
* A Shared 'lib' Directory *
Although solr.xml can be configured with an optional "sharedLib" attribute
that can point to any path, it is common to use a "./lib" sub-directory of the
Solr Home Directory.
* ZooKeeper Files *
When using SolrCloud using the embedded ZooKeeper option for Solr, it is
common to have a "zoo.cfg" file and "zoo_data" directories in the Solr Home
Directory. Please see the SolrCloud wiki page for more details...
https://wiki.apache.org/solr/SolrCloud

View File

@ -0,0 +1,45 @@
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
This is an example of a simple "solr.xml" file for configuring one or
more Solr Cores, as well as allowing Cores to be added, removed, and
reloaded via HTTP requests.
More information about options available in this configuration file,
and Solr Core administration can be found online:
http://wiki.apache.org/solr/CoreAdmin
-->
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<int name="zkClientTimeout">${zkClientTimeout:15000}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>

View File

@ -0,0 +1,50 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Example SolrCore Instance Directory
=============================
This directory is provided as an example of what an "Instance Directory"
should look like for a SolrCore
It's not strictly necessary that you copy all of the files in this
directory when setting up a new SolrCores, but it is recommended.
Basic Directory Structure
-------------------------
The Solr Home directory typically contains the following sub-directories...
conf/
This directory is mandatory and must contain your solrconfig.xml
and schema.xml. Any other optional configuration files would also
be kept here.
data/
This directory is the default location where Solr will keep your
index, and is used by the replication scripts for dealing with
snapshots. You can override this location in the
conf/solrconfig.xml. Solr will create this directory if it does not
already exist.
lib/
This directory is optional. If it exists, Solr will load any Jars
found in this directory and use them to resolve any "plugins"
specified in your solrconfig.xml or schema.xml (ie: Analyzers,
Request Handlers, etc...). Alternatively you can use the <lib>
syntax in conf/solrconfig.xml to direct Solr to your plugins. See
the example conf/solrconfig.xml file for details.

View File

@ -0,0 +1,24 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- The content of this page will be statically included into the top-
right box of the cores overview page. Uncomment this as an example to
see there the content will show up.
<img src="img/ico/construction.png"> This line will appear at the top-
right box on collection1's Overview
-->

View File

@ -0,0 +1,25 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- admin-extra.menu-bottom.html -->
<!--
<li>
<a href="#" style="background-image: url(img/ico/construction.png);">
LAST ITEM
</a>
</li>
-->

View File

@ -0,0 +1,25 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- admin-extra.menu-top.html -->
<!--
<li>
<a href="#" style="background-image: url(img/ico/construction.png);">
FIRST ITEM
</a>
</li>
-->

View File

@ -0,0 +1,67 @@
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- Example exchange rates file for CurrencyField type named "currency" in example schema -->
<currencyConfig version="1.0">
<rates>
<!-- Updated from http://www.exchangerate.com/ at 2011-09-27 -->
<rate from="USD" to="ARS" rate="4.333871" comment="ARGENTINA Peso" />
<rate from="USD" to="AUD" rate="1.025768" comment="AUSTRALIA Dollar" />
<rate from="USD" to="EUR" rate="0.743676" comment="European Euro" />
<rate from="USD" to="BRL" rate="1.881093" comment="BRAZIL Real" />
<rate from="USD" to="CAD" rate="1.030815" comment="CANADA Dollar" />
<rate from="USD" to="CLP" rate="519.0996" comment="CHILE Peso" />
<rate from="USD" to="CNY" rate="6.387310" comment="CHINA Yuan" />
<rate from="USD" to="CZK" rate="18.47134" comment="CZECH REP. Koruna" />
<rate from="USD" to="DKK" rate="5.515436" comment="DENMARK Krone" />
<rate from="USD" to="HKD" rate="7.801922" comment="HONG KONG Dollar" />
<rate from="USD" to="HUF" rate="215.6169" comment="HUNGARY Forint" />
<rate from="USD" to="ISK" rate="118.1280" comment="ICELAND Krona" />
<rate from="USD" to="INR" rate="49.49088" comment="INDIA Rupee" />
<rate from="USD" to="XDR" rate="0.641358" comment="INTNL MON. FUND SDR" />
<rate from="USD" to="ILS" rate="3.709739" comment="ISRAEL Sheqel" />
<rate from="USD" to="JPY" rate="76.32419" comment="JAPAN Yen" />
<rate from="USD" to="KRW" rate="1169.173" comment="KOREA (SOUTH) Won" />
<rate from="USD" to="KWD" rate="0.275142" comment="KUWAIT Dinar" />
<rate from="USD" to="MXN" rate="13.85895" comment="MEXICO Peso" />
<rate from="USD" to="NZD" rate="1.285159" comment="NEW ZEALAND Dollar" />
<rate from="USD" to="NOK" rate="5.859035" comment="NORWAY Krone" />
<rate from="USD" to="PKR" rate="87.57007" comment="PAKISTAN Rupee" />
<rate from="USD" to="PEN" rate="2.730683" comment="PERU Sol" />
<rate from="USD" to="PHP" rate="43.62039" comment="PHILIPPINES Peso" />
<rate from="USD" to="PLN" rate="3.310139" comment="POLAND Zloty" />
<rate from="USD" to="RON" rate="3.100932" comment="ROMANIA Leu" />
<rate from="USD" to="RUB" rate="32.14663" comment="RUSSIA Ruble" />
<rate from="USD" to="SAR" rate="3.750465" comment="SAUDI ARABIA Riyal" />
<rate from="USD" to="SGD" rate="1.299352" comment="SINGAPORE Dollar" />
<rate from="USD" to="ZAR" rate="8.329761" comment="SOUTH AFRICA Rand" />
<rate from="USD" to="SEK" rate="6.883442" comment="SWEDEN Krona" />
<rate from="USD" to="CHF" rate="0.906035" comment="SWITZERLAND Franc" />
<rate from="USD" to="TWD" rate="30.40283" comment="TAIWAN Dollar" />
<rate from="USD" to="THB" rate="30.89487" comment="THAILAND Baht" />
<rate from="USD" to="AED" rate="3.672955" comment="U.A.E. Dirham" />
<rate from="USD" to="UAH" rate="7.988582" comment="UKRAINE Hryvnia" />
<rate from="USD" to="GBP" rate="0.647910" comment="UNITED KINGDOM Pound" />
<!-- Cross-rates for some common currencies -->
<rate from="EUR" to="GBP" rate="0.869914" />
<rate from="EUR" to="NOK" rate="7.800095" />
<rate from="GBP" to="NOK" rate="8.966508" />
</rates>
</currencyConfig>

View File

@ -0,0 +1,38 @@
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- If this file is found in the config directory, it will only be
loaded once at startup. If it is found in Solr's data
directory, it will be re-loaded every commit.
See http://wiki.apache.org/solr/QueryElevationComponent for more info
-->
<elevate>
<query text="foo bar">
<doc id="1" />
<doc id="2" />
<doc id="3" />
</query>
<query text="ipod">
<doc id="MA147LL/A" /> <!-- put the actual ipod at the top -->
<doc id="IW-02" exclude="true" /> <!-- exclude this cable -->
</query>
</elevate>

View File

@ -0,0 +1,8 @@
# Set of Catalan contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
d
l
m
n
s
t

View File

@ -0,0 +1,15 @@
# Set of French contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
l
m
t
qu
n
s
j
d
c
jusqu
quoiqu
lorsqu
puisqu

View File

@ -0,0 +1,5 @@
# Set of Irish contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
d
m
b

View File

@ -0,0 +1,23 @@
# Set of Italian contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
c
l
all
dall
dell
nell
sull
coll
pell
gl
agl
dagl
degl
negl
sugl
un
m
t
s
v
d

View File

@ -0,0 +1,5 @@
# Set of Irish hyphenations for StopFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
h
n
t

View File

@ -0,0 +1,6 @@
# Set of overrides for the dutch stemmer
# TODO: load this as a resource from the analyzer and sync it in build.xml
fiets fiets
bromfiets bromfiets
ei eier
kind kinder

View File

@ -0,0 +1,420 @@
#
# This file defines a Japanese stoptag set for JapanesePartOfSpeechStopFilter.
#
# Any token with a part-of-speech tag that exactly matches those defined in this
# file are removed from the token stream.
#
# Set your own stoptags by uncommenting the lines below. Note that comments are
# not allowed on the same line as a stoptag. See LUCENE-3745 for frequency lists,
# etc. that can be useful for building you own stoptag set.
#
# The entire possible tagset is provided below for convenience.
#
#####
# noun: unclassified nouns
#名詞
#
# noun-common: Common nouns or nouns where the sub-classification is undefined
#名詞-一般
#
# noun-proper: Proper nouns where the sub-classification is undefined
#名詞-固有名詞
#
# noun-proper-misc: miscellaneous proper nouns
#名詞-固有名詞-一般
#
# noun-proper-person: Personal names where the sub-classification is undefined
#名詞-固有名詞-人名
#
# noun-proper-person-misc: names that cannot be divided into surname and
# given name; foreign names; names where the surname or given name is unknown.
# e.g. お市の方
#名詞-固有名詞-人名-一般
#
# noun-proper-person-surname: Mainly Japanese surnames.
# e.g. 山田
#名詞-固有名詞-人名-姓
#
# noun-proper-person-given_name: Mainly Japanese given names.
# e.g. 太郎
#名詞-固有名詞-人名-名
#
# noun-proper-organization: Names representing organizations.
# e.g. 通産省, NHK
#名詞-固有名詞-組織
#
# noun-proper-place: Place names where the sub-classification is undefined
#名詞-固有名詞-地域
#
# noun-proper-place-misc: Place names excluding countries.
# e.g. アジア, バルセロナ, 京都
#名詞-固有名詞-地域-一般
#
# noun-proper-place-country: Country names.
# e.g. 日本, オーストラリア
#名詞-固有名詞-地域-国
#
# noun-pronoun: Pronouns where the sub-classification is undefined
#名詞-代名詞
#
# noun-pronoun-misc: miscellaneous pronouns:
# e.g. それ, ここ, あいつ, あなた, あちこち, いくつ, どこか, なに, みなさん, みんな, わたくし, われわれ
#名詞-代名詞-一般
#
# noun-pronoun-contraction: Spoken language contraction made by combining a
# pronoun and the particle 'wa'.
# e.g. ありゃ, こりゃ, こりゃあ, そりゃ, そりゃあ
#名詞-代名詞-縮約
#
# noun-adverbial: Temporal nouns such as names of days or months that behave
# like adverbs. Nouns that represent amount or ratios and can be used adverbially,
# e.g. 金曜, 一月, 午後, 少量
#名詞-副詞可能
#
# noun-verbal: Nouns that take arguments with case and can appear followed by
# 'suru' and related verbs (する, できる, なさる, くださる)
# e.g. インプット, 愛着, 悪化, 悪戦苦闘, 一安心, 下取り
#名詞-サ変接続
#
# noun-adjective-base: The base form of adjectives, words that appear before な ("na")
# e.g. 健康, 安易, 駄目, だめ
#名詞-形容動詞語幹
#
# noun-numeric: Arabic numbers, Chinese numerals, and counters like 何 (回), 数.
# e.g. 0, 1, 2, 何, 数, 幾
#名詞-数
#
# noun-affix: noun affixes where the sub-classification is undefined
#名詞-非自立
#
# noun-affix-misc: Of adnominalizers, the case-marker の ("no"), and words that
# attach to the base form of inflectional words, words that cannot be classified
# into any of the other categories below. This category includes indefinite nouns.
# e.g. あかつき, 暁, かい, 甲斐, 気, きらい, 嫌い, くせ, 癖, こと, 事, ごと, 毎, しだい, 次第,
# 順, せい, 所為, ついで, 序で, つもり, 積もり, 点, どころ, の, はず, 筈, はずみ, 弾み,
# 拍子, ふう, ふり, 振り, ほう, 方, 旨, もの, 物, 者, ゆえ, 故, ゆえん, 所以, わけ, 訳,
# わり, 割り, 割, ん-口語/, もん-口語/
#名詞-非自立-一般
#
# noun-affix-adverbial: noun affixes that that can behave as adverbs.
# e.g. あいだ, 間, あげく, 挙げ句, あと, 後, 余り, 以外, 以降, 以後, 以上, 以前, 一方, うえ,
# 上, うち, 内, おり, 折り, かぎり, 限り, きり, っきり, 結果, ころ, 頃, さい, 際, 最中, さなか,
# 最中, じたい, 自体, たび, 度, ため, 為, つど, 都度, とおり, 通り, とき, 時, ところ, 所,
# とたん, 途端, なか, 中, のち, 後, ばあい, 場合, 日, ぶん, 分, ほか, 他, まえ, 前, まま,
# 儘, 侭, みぎり, 矢先
#名詞-非自立-副詞可能
#
# noun-affix-aux: noun affixes treated as 助動詞 ("auxiliary verb") in school grammars
# with the stem よう(だ) ("you(da)").
# e.g. よう, やう, 様 (よう)
#名詞-非自立-助動詞語幹
#
# noun-affix-adjective-base: noun affixes that can connect to the indeclinable
# connection form な (aux "da").
# e.g. みたい, ふう
#名詞-非自立-形容動詞語幹
#
# noun-special: special nouns where the sub-classification is undefined.
#名詞-特殊
#
# noun-special-aux: The そうだ ("souda") stem form that is used for reporting news, is
# treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the base
# form of inflectional words.
# e.g. そう
#名詞-特殊-助動詞語幹
#
# noun-suffix: noun suffixes where the sub-classification is undefined.
#名詞-接尾
#
# noun-suffix-misc: Of the nouns or stem forms of other parts of speech that connect
# to ガル or タイ and can combine into compound nouns, words that cannot be classified into
# any of the other categories below. In general, this category is more inclusive than
# 接尾語 ("suffix") and is usually the last element in a compound noun.
# e.g. おき, かた, 方, 甲斐 (がい), がかり, ぎみ, 気味, ぐるみ, (~した) さ, 次第, 済 (ず) み,
# よう, (でき)っこ, 感, 観, 性, 学, 類, 面, 用
#名詞-接尾-一般
#
# noun-suffix-person: Suffixes that form nouns and attach to person names more often
# than other nouns.
# e.g. 君, 様, 著
#名詞-接尾-人名
#
# noun-suffix-place: Suffixes that form nouns and attach to place names more often
# than other nouns.
# e.g. 町, 市, 県
#名詞-接尾-地域
#
# noun-suffix-verbal: Of the suffixes that attach to nouns and form nouns, those that
# can appear before スル ("suru").
# e.g. 化, 視, 分け, 入り, 落ち, 買い
#名詞-接尾-サ変接続
#
# noun-suffix-aux: The stem form of そうだ (様態) that is used to indicate conditions,
# is treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the
# conjunctive form of inflectional words.
# e.g. そう
#名詞-接尾-助動詞語幹
#
# noun-suffix-adjective-base: Suffixes that attach to other nouns or the conjunctive
# form of inflectional words and appear before the copula だ ("da").
# e.g. 的, げ, がち
#名詞-接尾-形容動詞語幹
#
# noun-suffix-adverbial: Suffixes that attach to other nouns and can behave as adverbs.
# e.g. 後 (ご), 以後, 以降, 以前, 前後, 中, 末, 上, 時 (じ)
#名詞-接尾-副詞可能
#
# noun-suffix-classifier: Suffixes that attach to numbers and form nouns. This category
# is more inclusive than 助数詞 ("classifier") and includes common nouns that attach
# to numbers.
# e.g. 個, つ, 本, 冊, パーセント, cm, kg, カ月, か国, 区画, 時間, 時半
#名詞-接尾-助数詞
#
# noun-suffix-special: Special suffixes that mainly attach to inflecting words.
# e.g. (楽し) さ, (考え) 方
#名詞-接尾-特殊
#
# noun-suffix-conjunctive: Nouns that behave like conjunctions and join two words
# together.
# e.g. (日本) 対 (アメリカ), 対 (アメリカ), (3) 対 (5), (女優) 兼 (主婦)
#名詞-接続詞的
#
# noun-verbal_aux: Nouns that attach to the conjunctive particle て ("te") and are
# semantically verb-like.
# e.g. ごらん, ご覧, 御覧, 頂戴
#名詞-動詞非自立的
#
# noun-quotation: text that cannot be segmented into words, proverbs, Chinese poetry,
# dialects, English, etc. Currently, the only entry for 名詞 引用文字列 ("noun quotation")
# is いわく ("iwaku").
#名詞-引用文字列
#
# noun-nai_adjective: Words that appear before the auxiliary verb ない ("nai") and
# behave like an adjective.
# e.g. 申し訳, 仕方, とんでも, 違い
#名詞-ナイ形容詞語幹
#
#####
# prefix: unclassified prefixes
#接頭詞
#
# prefix-nominal: Prefixes that attach to nouns (including adjective stem forms)
# excluding numerical expressions.
# e.g. お (水), 某 (氏), 同 (社), 故 (~氏), 高 (品質), お (見事), ご (立派)
#接頭詞-名詞接続
#
# prefix-verbal: Prefixes that attach to the imperative form of a verb or a verb
# in conjunctive form followed by なる/なさる/くださる.
# e.g. お (読みなさい), お (座り)
#接頭詞-動詞接続
#
# prefix-adjectival: Prefixes that attach to adjectives.
# e.g. お (寒いですねえ), バカ (でかい)
#接頭詞-形容詞接続
#
# prefix-numerical: Prefixes that attach to numerical expressions.
# e.g. 約, およそ, 毎時
#接頭詞-数接続
#
#####
# verb: unclassified verbs
#動詞
#
# verb-main:
#動詞-自立
#
# verb-auxiliary:
#動詞-非自立
#
# verb-suffix:
#動詞-接尾
#
#####
# adjective: unclassified adjectives
#形容詞
#
# adjective-main:
#形容詞-自立
#
# adjective-auxiliary:
#形容詞-非自立
#
# adjective-suffix:
#形容詞-接尾
#
#####
# adverb: unclassified adverbs
#副詞
#
# adverb-misc: Words that can be segmented into one unit and where adnominal
# modification is not possible.
# e.g. あいかわらず, 多分
#副詞-一般
#
# adverb-particle_conjunction: Adverbs that can be followed by の, は, に,
# な, する, だ, etc.
# e.g. こんなに, そんなに, あんなに, なにか, なんでも
#副詞-助詞類接続
#
#####
# adnominal: Words that only have noun-modifying forms.
# e.g. この, その, あの, どの, いわゆる, なんらかの, 何らかの, いろんな, こういう, そういう, ああいう,
# どういう, こんな, そんな, あんな, どんな, 大きな, 小さな, おかしな, ほんの, たいした,
# 「(, も) さる (ことながら)」, 微々たる, 堂々たる, 単なる, いかなる, 我が」「同じ, 亡き
#連体詞
#
#####
# conjunction: Conjunctions that can occur independently.
# e.g. が, けれども, そして, じゃあ, それどころか
接続詞
#
#####
# particle: unclassified particles.
助詞
#
# particle-case: case particles where the subclassification is undefined.
助詞-格助詞
#
# particle-case-misc: Case particles.
# e.g. から, が, で, と, に, へ, より, を, の, にて
助詞-格助詞-一般
#
# particle-case-quote: the "to" that appears after nouns, a persons speech,
# quotation marks, expressions of decisions from a meeting, reasons, judgements,
# conjectures, etc.
# e.g. ( だ) と (述べた.), ( である) と (して執行猶予...)
助詞-格助詞-引用
#
# particle-case-compound: Compounds of particles and verbs that mainly behave
# like case particles.
# e.g. という, といった, とかいう, として, とともに, と共に, でもって, にあたって, に当たって, に当って,
# にあたり, に当たり, に当り, に当たる, にあたる, において, に於いて,に於て, における, に於ける,
# にかけ, にかけて, にかんし, に関し, にかんして, に関して, にかんする, に関する, に際し,
# に際して, にしたがい, に従い, に従う, にしたがって, に従って, にたいし, に対し, にたいして,
# に対して, にたいする, に対する, について, につき, につけ, につけて, につれ, につれて, にとって,
# にとり, にまつわる, によって, に依って, に因って, により, に依り, に因り, による, に依る, に因る,
# にわたって, にわたる, をもって, を以って, を通じ, を通じて, を通して, をめぐって, をめぐり, をめぐる,
# って-口語/, ちゅう-関西弁「という」/, (何) ていう (人)-口語/, っていう-口語/, といふ, とかいふ
助詞-格助詞-連語
#
# particle-conjunctive:
# e.g. から, からには, が, けれど, けれども, けど, し, つつ, て, で, と, ところが, どころか, とも, ども,
# ながら, なり, ので, のに, ば, ものの, や ( した), やいなや, (ころん) じゃ(いけない)-口語/,
# (行っ) ちゃ(いけない)-口語/, (言っ) たって (しかたがない)-口語/, (それがなく)ったって (平気)-口語/
助詞-接続助詞
#
# particle-dependency:
# e.g. こそ, さえ, しか, すら, は, も, ぞ
助詞-係助詞
#
# particle-adverbial:
# e.g. がてら, かも, くらい, 位, ぐらい, しも, (学校) じゃ(これが流行っている)-口語/,
# (それ)じゃあ (よくない)-口語/, ずつ, (私) なぞ, など, (私) なり (に), (先生) なんか (大嫌い)-口語/,
# (私) なんぞ, (先生) なんて (大嫌い)-口語/, のみ, だけ, (私) だって-口語/, だに,
# (彼)ったら-口語/, (お茶) でも (いかが), 等 (とう), (今後) とも, ばかり, ばっか-口語/, ばっかり-口語/,
# ほど, 程, まで, 迄, (誰) も (が)([助詞-格助詞] および [助詞-係助詞] の前に位置する「も」)
助詞-副助詞
#
# particle-interjective: particles with interjective grammatical roles.
# e.g. (松島) や
助詞-間投助詞
#
# particle-coordinate:
# e.g. と, たり, だの, だり, とか, なり, や, やら
助詞-並立助詞
#
# particle-final:
# e.g. かい, かしら, さ, ぜ, (だ)っけ-口語/, (とまってる) で-方言/, な, ナ, なあ-口語/, ぞ, ね, ネ,
# ねぇ-口語/, ねえ-口語/, ねん-方言/, の, のう-口語/, や, よ, ヨ, よぉ-口語/, わ, わい-口語/
助詞-終助詞
#
# particle-adverbial/conjunctive/final: The particle "ka" when unknown whether it is
# adverbial, conjunctive, or sentence final. For example:
# (a) 「A か B か」. Ex:「(国内で運用する) か,(海外で運用する) か (.)」
# (b) Inside an adverb phrase. Ex:「(幸いという) か (, 死者はいなかった.)」
# 「(祈りが届いたせい) か (, 試験に合格した.)」
# (c) 「かのように」. Ex:「(何もなかった) か (のように振る舞った.)」
# e.g. か
助詞-副助詞/並立助詞/終助詞
#
# particle-adnominalizer: The "no" that attaches to nouns and modifies
# non-inflectional words.
助詞-連体化
#
# particle-adnominalizer: The "ni" and "to" that appear following nouns and adverbs
# that are giongo, giseigo, or gitaigo.
# e.g. に, と
助詞-副詞化
#
# particle-special: A particle that does not fit into one of the above classifications.
# This includes particles that are used in Tanka, Haiku, and other poetry.
# e.g. かな, けむ, ( しただろう) に, (あんた) にゃ(わからん), (俺) ん (家)
助詞-特殊
#
#####
# auxiliary-verb:
助動詞
#
#####
# interjection: Greetings and other exclamations.
# e.g. おはよう, おはようございます, こんにちは, こんばんは, ありがとう, どうもありがとう, ありがとうございます,
# いただきます, ごちそうさま, さよなら, さようなら, はい, いいえ, ごめん, ごめんなさい
#感動詞
#
#####
# symbol: unclassified Symbols.
記号
#
# symbol-misc: A general symbol not in one of the categories below.
# e.g. [○◎@$〒→+]
記号-一般
#
# symbol-comma: Commas
# e.g. [,、]
記号-読点
#
# symbol-period: Periods and full stops.
# e.g. [..。]
記号-句点
#
# symbol-space: Full-width whitespace.
記号-空白
#
# symbol-open_bracket:
# e.g. [({‘“『【]
記号-括弧開
#
# symbol-close_bracket:
# e.g. [)}’”』」】]
記号-括弧閉
#
# symbol-alphabetic:
#記号-アルファベット
#
#####
# other: unclassified other
#その他
#
# other-interjection: Words that are hard to classify as noun-suffixes or
# sentence-final particles.
# e.g. (だ)ァ
その他-間投
#
#####
# filler: Aizuchi that occurs during a conversation or sounds inserted as filler.
# e.g. あの, うんと, えと
フィラー
#
#####
# non-verbal: non-verbal sound.
非言語音
#
#####
# fragment:
#語断片
#
#####
# unknown: unknown part of speech.
#未知語
#
##### End of file

View File

@ -0,0 +1,125 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
# Cleaned on October 11, 2009 (not normalized, so use before normalization)
# This means that when modifying this list, you might need to add some
# redundant entries, for example containing forms with both أ and ا
من
ومن
منها
منه
في
وفي
فيها
فيه
و
ف
ثم
او
أو
ب
بها
به
ا
أ
اى
اي
أي
أى
لا
ولا
الا
ألا
إلا
لكن
ما
وما
كما
فما
عن
مع
اذا
إذا
ان
أن
إن
انها
أنها
إنها
انه
أنه
إنه
بان
بأن
فان
فأن
وان
وأن
وإن
التى
التي
الذى
الذي
الذين
الى
الي
إلى
إلي
على
عليها
عليه
اما
أما
إما
ايضا
أيضا
كل
وكل
لم
ولم
لن
ولن
هى
هي
هو
وهى
وهي
وهو
فهى
فهي
فهو
انت
أنت
لك
لها
له
هذه
هذا
تلك
ذلك
هناك
كانت
كان
يكون
تكون
وكانت
وكان
غير
بعض
قد
نحو
بين
بينما
منذ
ضمن
حيث
الان
الآن
خلال
بعد
قبل
حتى
عند
عندما
لدى
جميع

View File

@ -0,0 +1,193 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
а
аз
ако
ала
бе
без
беше
би
бил
била
били
било
близо
бъдат
бъде
бяха
в
вас
ваш
ваша
вероятно
вече
взема
ви
вие
винаги
все
всеки
всички
всичко
всяка
във
въпреки
върху
г
ги
главно
го
д
да
дали
до
докато
докога
дори
досега
доста
е
едва
един
ето
за
зад
заедно
заради
засега
затова
защо
защото
и
из
или
им
има
имат
иска
й
каза
как
каква
какво
както
какъв
като
кога
когато
което
които
кой
който
колко
която
къде
където
към
ли
м
ме
между
мен
ми
мнозина
мога
могат
може
моля
момента
му
н
на
над
назад
най
направи
напред
например
нас
не
него
нея
ни
ние
никой
нито
но
някои
някой
няма
обаче
около
освен
особено
от
отгоре
отново
още
пак
по
повече
повечето
под
поне
поради
после
почти
прави
пред
преди
през
при
пък
първо
с
са
само
се
сега
си
скоро
след
сме
според
сред
срещу
сте
съм
със
също
т
тази
така
такива
такъв
там
твой
те
тези
ти
тн
то
това
тогава
този
той
толкова
точно
трябва
тук
тъй
тя
тях
у
харесва
ч
че
често
чрез
ще
щом
я

View File

@ -0,0 +1,220 @@
# Catalan stopwords from http://github.com/vcl/cue.language (Apache 2 Licensed)
a
abans
ací
ah
així
això
al
als
aleshores
algun
alguna
algunes
alguns
alhora
allà
allí
allò
altra
altre
altres
amb
ambdós
ambdues
apa
aquell
aquella
aquelles
aquells
aquest
aquesta
aquestes
aquests
aquí
baix
cada
cadascú
cadascuna
cadascunes
cadascuns
com
contra
d'un
d'una
d'unes
d'uns
dalt
de
del
dels
des
després
dins
dintre
donat
doncs
durant
e
eh
el
els
em
en
encara
ens
entre
érem
eren
éreu
es
és
esta
està
estàvem
estaven
estàveu
esteu
et
etc
ets
fins
fora
gairebé
ha
han
has
havia
he
hem
heu
hi
ho
i
igual
iguals
ja
l'hi
la
les
li
li'n
llavors
m'he
ma
mal
malgrat
mateix
mateixa
mateixes
mateixos
me
mentre
més
meu
meus
meva
meves
molt
molta
moltes
molts
mon
mons
n'he
n'hi
ne
ni
no
nogensmenys
només
nosaltres
nostra
nostre
nostres
o
oh
oi
on
pas
pel
pels
per
però
perquè
poc
poca
pocs
poques
potser
propi
qual
quals
quan
quant
que
què
quelcom
qui
quin
quina
quines
quins
s'ha
s'han
sa
semblant
semblants
ses
seu
seus
seva
seva
seves
si
sobre
sobretot
sóc
solament
sols
son
són
sons
sota
sou
t'ha
t'han
t'he
ta
tal
també
tampoc
tan
tant
tanta
tantes
teu
teus
teva
teves
ton
tons
tot
tota
totes
tots
un
una
unes
uns
us
va
vaig
vam
van
vas
veu
vosaltres
vostra
vostre
vostres

View File

@ -0,0 +1,172 @@
a
s
k
o
i
u
v
z
dnes
cz
tímto
budeš
budem
byli
jseš
můj
svým
ta
tomto
tohle
tuto
tyto
jej
zda
proč
máte
tato
kam
tohoto
kdo
kteří
mi
nám
tom
tomuto
mít
nic
proto
kterou
byla
toho
protože
asi
ho
naši
napište
re
což
tím
takže
svých
její
svými
jste
aj
tu
tedy
teto
bylo
kde
ke
pravé
ji
nad
nejsou
či
pod
téma
mezi
přes
ty
pak
vám
ani
když
však
neg
jsem
tento
článku
články
aby
jsme
před
pta
jejich
byl
ještě
bez
také
pouze
první
vaše
která
nás
nový
tipy
pokud
může
strana
jeho
své
jiné
zprávy
nové
není
vás
jen
podle
zde
být
více
bude
již
než
který
by
které
co
nebo
ten
tak
při
od
po
jsou
jak
další
ale
si
se
ve
to
jako
za
zpět
ze
do
pro
je
na
atd
atp
jakmile
přičemž
on
ona
ono
oni
ony
my
vy
ji
mne
jemu
tomu
těm
těmu
němu
němuž
jehož
jíž
jelikož
jež
jakož
načež

View File

@ -0,0 +1,108 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| A Danish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| This is a ranked list (commonest to rarest) of stopwords derived from
| a large text sample.
og | and
i | in
jeg | I
det | that (dem. pronoun)/it (pers. pronoun)
at | that (in front of a sentence)/to (with infinitive)
en | a/an
den | it (pers. pronoun)/that (dem. pronoun)
til | to/at/for/until/against/by/of/into, more
er | present tense of "to be"
som | who, as
på | on/upon/in/on/at/to/after/of/with/for, on
de | they
med | with/by/in, along
han | he
af | of/by/from/off/for/in/with/on, off
for | at/for/to/from/by/of/ago, in front/before, because
ikke | not
der | who/which, there/those
var | past tense of "to be"
mig | me/myself
sig | oneself/himself/herself/itself/themselves
men | but
et | a/an/one, one (number), someone/somebody/one
har | present tense of "to have"
om | round/about/for/in/a, about/around/down, if
vi | we
min | my
havde | past tense of "to have"
ham | him
hun | she
nu | now
over | over/above/across/by/beyond/past/on/about, over/past
da | then, when/as/since
fra | from/off/since, off, since
du | you
ud | out
sin | his/her/its/one's
dem | them
os | us/ourselves
op | up
man | you/one
hans | his
hvor | where
eller | or
hvad | what
skal | must/shall etc.
selv | myself/youself/herself/ourselves etc., even
her | here
alle | all/everyone/everybody etc.
vil | will (verb)
blev | past tense of "to stay/to remain/to get/to become"
kunne | could
ind | in
når | when
være | present tense of "to be"
dog | however/yet/after all
noget | something
ville | would
jo | you know/you see (adv), yes
deres | their/theirs
efter | after/behind/according to/for/by/from, later/afterwards
ned | down
skulle | should
denne | this
end | than
dette | this
mit | my/mine
også | also
under | under/beneath/below/during, below/underneath
have | have
dig | you
anden | other
hende | her
mine | my
alt | everything
meget | much/very, plenty of
sit | his, her, its, one's
sine | his, her, its, one's
vor | our
mod | against
disse | these
hvis | if
din | your/yours
nogle | some
hos | by/at
blive | be/become
mange | many
ad | by/through
bliver | present tense of "to be/to become"
hendes | her/hers
været | be
thi | for (conj)
jer | you
sådan | such, like this/like that

View File

@ -0,0 +1,292 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| A German stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| The number of forms in this list is reduced significantly by passing it
| through the German stemmer.
aber | but
alle | all
allem
allen
aller
alles
als | than, as
also | so
am | an + dem
an | at
ander | other
andere
anderem
anderen
anderer
anderes
anderm
andern
anderr
anders
auch | also
auf | on
aus | out of
bei | by
bin | am
bis | until
bist | art
da | there
damit | with it
dann | then
der | the
den
des
dem
die
das
daß | that
derselbe | the same
derselben
denselben
desselben
demselben
dieselbe
dieselben
dasselbe
dazu | to that
dein | thy
deine
deinem
deinen
deiner
deines
denn | because
derer | of those
dessen | of him
dich | thee
dir | to thee
du | thou
dies | this
diese
diesem
diesen
dieser
dieses
doch | (several meanings)
dort | (over) there
durch | through
ein | a
eine
einem
einen
einer
eines
einig | some
einige
einigem
einigen
einiger
einiges
einmal | once
er | he
ihn | him
ihm | to him
es | it
etwas | something
euer | your
eure
eurem
euren
eurer
eures
für | for
gegen | towards
gewesen | p.p. of sein
hab | have
habe | have
haben | have
hat | has
hatte | had
hatten | had
hier | here
hin | there
hinter | behind
ich | I
mich | me
mir | to me
ihr | you, to her
ihre
ihrem
ihren
ihrer
ihres
euch | to you
im | in + dem
in | in
indem | while
ins | in + das
ist | is
jede | each, every
jedem
jeden
jeder
jedes
jene | that
jenem
jenen
jener
jenes
jetzt | now
kann | can
kein | no
keine
keinem
keinen
keiner
keines
können | can
könnte | could
machen | do
man | one
manche | some, many a
manchem
manchen
mancher
manches
mein | my
meine
meinem
meinen
meiner
meines
mit | with
muss | must
musste | had to
nach | to(wards)
nicht | not
nichts | nothing
noch | still, yet
nun | now
nur | only
ob | whether
oder | or
ohne | without
sehr | very
sein | his
seine
seinem
seinen
seiner
seines
selbst | self
sich | herself
sie | they, she
ihnen | to them
sind | are
so | so
solche | such
solchem
solchen
solcher
solches
soll | shall
sollte | should
sondern | but
sonst | else
über | over
um | about, around
und | and
uns | us
unse
unsem
unsen
unser
unses
unter | under
viel | much
vom | von + dem
von | from
vor | before
während | while
war | was
waren | were
warst | wast
was | what
weg | away, off
weil | because
weiter | further
welche | which
welchem
welchen
welcher
welches
wenn | when
werde | will
werden | will
wie | how
wieder | again
will | want
wir | we
wird | will
wirst | willst
wo | where
wollen | want
wollte | wanted
würde | would
würden | would
zu | to
zum | zu + dem
zur | zu + der
zwar | indeed
zwischen | between

View File

@ -0,0 +1,78 @@
# Lucene Greek Stopwords list
# Note: by default this file is used after GreekLowerCaseFilter,
# so when modifying this file use 'σ' instead of 'ς'
ο
η
το
οι
τα
του
τησ
των
τον
την
και
κι
κ
ειμαι
εισαι
ειναι
ειμαστε
ειστε
στο
στον
στη
στην
μα
αλλα
απο
για
προσ
με
σε
ωσ
παρα
αντι
κατα
μετα
θα
να
δε
δεν
μη
μην
επι
ενω
εαν
αν
τοτε
που
πωσ
ποιοσ
ποια
ποιο
ποιοι
ποιεσ
ποιων
ποιουσ
αυτοσ
αυτη
αυτο
αυτοι
αυτων
αυτουσ
αυτεσ
αυτα
εκεινοσ
εκεινη
εκεινο
εκεινοι
εκεινεσ
εκεινα
εκεινων
εκεινουσ
οπωσ
ομωσ
ισωσ
οσο
οτι

View File

@ -0,0 +1,54 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# a couple of test stopwords to test that the words are really being
# configured from this file:
stopworda
stopwordb
# Standard english stop words taken from Lucene's StopAnalyzer
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

View File

@ -0,0 +1,354 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| A Spanish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| The following is a ranked list (commonest to rarest) of stopwords
| deriving from a large sample of text.
| Extra words have been added at the end.
de | from, of
la | the, her
que | who, that
el | the
en | in
y | and
a | to
los | the, them
del | de + el
se | himself, from him etc
las | the, them
por | for, by, etc
un | a
para | for
con | with
no | no
una | a
su | his, her
al | a + el
| es from SER
lo | him
como | how
más | more
pero | pero
sus | su plural
le | to him, her
ya | already
o | or
| fue from SER
este | this
| ha from HABER
sí | himself etc
porque | because
esta | this
| son from SER
entre | between
| está from ESTAR
cuando | when
muy | very
sin | without
sobre | on
| ser from SER
| tiene from TENER
también | also
me | me
hasta | until
hay | there is/are
donde | where
| han from HABER
quien | whom, that
| están from ESTAR
| estado from ESTAR
desde | from
todo | all
nos | us
durante | during
| estados from ESTAR
todos | all
uno | a
les | to them
ni | nor
contra | against
otros | other
| fueron from SER
ese | that
eso | that
| había from HABER
ante | before
ellos | they
e | and (variant of y)
esto | this
mí | me
antes | before
algunos | some
qué | what?
unos | a
yo | I
otro | other
otras | other
otra | other
él | he
tanto | so much, many
esa | that
estos | these
mucho | much, many
quienes | who
nada | nothing
muchos | many
cual | who
| sea from SER
poco | few
ella | she
estar | to be
| haber from HABER
estas | these
| estaba from ESTAR
| estamos from ESTAR
algunas | some
algo | something
nosotros | we
| other forms
mi | me
mis | mi plural
tú | thou
te | thee
ti | thee
tu | thy
tus | tu plural
ellas | they
nosotras | we
vosotros | you
vosotras | you
os | you
mío | mine
mía |
míos |
mías |
tuyo | thine
tuya |
tuyos |
tuyas |
suyo | his, hers, theirs
suya |
suyos |
suyas |
nuestro | ours
nuestra |
nuestros |
nuestras |
vuestro | yours
vuestra |
vuestros |
vuestras |
esos | those
esas | those
| forms of estar, to be (not including the infinitive):
estoy
estás
está
estamos
estáis
están
esté
estés
estemos
estéis
estén
estaré
estarás
estará
estaremos
estaréis
estarán
estaría
estarías
estaríamos
estaríais
estarían
estaba
estabas
estábamos
estabais
estaban
estuve
estuviste
estuvo
estuvimos
estuvisteis
estuvieron
estuviera
estuvieras
estuviéramos
estuvierais
estuvieran
estuviese
estuvieses
estuviésemos
estuvieseis
estuviesen
estando
estado
estada
estados
estadas
estad
| forms of haber, to have (not including the infinitive):
he
has
ha
hemos
habéis
han
haya
hayas
hayamos
hayáis
hayan
habré
habrás
habrá
habremos
habréis
habrán
habría
habrías
habríamos
habríais
habrían
había
habías
habíamos
habíais
habían
hube
hubiste
hubo
hubimos
hubisteis
hubieron
hubiera
hubieras
hubiéramos
hubierais
hubieran
hubiese
hubieses
hubiésemos
hubieseis
hubiesen
habiendo
habido
habida
habidos
habidas
| forms of ser, to be (not including the infinitive):
soy
eres
es
somos
sois
son
sea
seas
seamos
seáis
sean
seré
serás
será
seremos
seréis
serán
sería
serías
seríamos
seríais
serían
era
eras
éramos
erais
eran
fui
fuiste
fue
fuimos
fuisteis
fueron
fuera
fueras
fuéramos
fuerais
fueran
fuese
fueses
fuésemos
fueseis
fuesen
siendo
sido
| sed also means 'thirst'
| forms of tener, to have (not including the infinitive):
tengo
tienes
tiene
tenemos
tenéis
tienen
tenga
tengas
tengamos
tengáis
tengan
tendré
tendrás
tendrá
tendremos
tendréis
tendrán
tendría
tendrías
tendríamos
tendríais
tendrían
tenía
tenías
teníamos
teníais
tenían
tuve
tuviste
tuvo
tuvimos
tuvisteis
tuvieron
tuviera
tuvieras
tuviéramos
tuvierais
tuvieran
tuviese
tuvieses
tuviésemos
tuvieseis
tuviesen
teniendo
tenido
tenida
tenidos
tenidas
tened

View File

@ -0,0 +1,99 @@
# example set of basque stopwords
al
anitz
arabera
asko
baina
bat
batean
batek
bati
batzuei
batzuek
batzuetan
batzuk
bera
beraiek
berau
berauek
bere
berori
beroriek
beste
bezala
da
dago
dira
ditu
du
dute
edo
egin
ere
eta
eurak
ez
gainera
gu
gutxi
guzti
haiei
haiek
haietan
hainbeste
hala
han
handik
hango
hara
hari
hark
hartan
hau
hauei
hauek
hauetan
hemen
hemendik
hemengo
hi
hona
honek
honela
honetan
honi
hor
hori
horiei
horiek
horietan
horko
horra
horrek
horrela
horretan
horri
hortik
hura
izan
ni
noiz
nola
non
nondik
nongo
nor
nora
ze
zein
zen
zenbait
zenbat
zer
zergatik
ziren
zituen
zu
zuek
zuen
zuten

View File

@ -0,0 +1,313 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
# Note: by default this file is used after normalization, so when adding entries
# to this file, use the arabic 'ي' instead of 'ی'
انان
نداشته
سراسر
خياه
ايشان
وي
تاكنون
بيشتري
دوم
پس
ناشي
وگو
يا
داشتند
سپس
هنگام
هرگز
پنج
نشان
امسال
ديگر
گروهي
شدند
چطور
ده
و
دو
نخستين
ولي
چرا
چه
وسط
ه
كدام
قابل
يك
رفت
هفت
همچنين
در
هزار
بله
بلي
شايد
اما
شناسي
گرفته
دهد
داشته
دانست
داشتن
خواهيم
ميليارد
وقتيكه
امد
خواهد
جز
اورده
شده
بلكه
خدمات
شدن
برخي
نبود
بسياري
جلوگيري
حق
كردند
نوعي
بعري
نكرده
نظير
نبايد
بوده
بودن
داد
اورد
هست
جايي
شود
دنبال
داده
بايد
سابق
هيچ
همان
انجا
كمتر
كجاست
گردد
كسي
تر
مردم
تان
دادن
بودند
سري
جدا
ندارند
مگر
يكديگر
دارد
دهند
بنابراين
هنگامي
سمت
جا
انچه
خود
دادند
زياد
دارند
اثر
بدون
بهترين
بيشتر
البته
به
براساس
بيرون
كرد
بعضي
گرفت
توي
اي
ميليون
او
جريان
تول
بر
مانند
برابر
باشيم
مدتي
گويند
اكنون
تا
تنها
جديد
چند
بي
نشده
كردن
كردم
گويد
كرده
كنيم
نمي
نزد
روي
قصد
فقط
بالاي
ديگران
اين
ديروز
توسط
سوم
ايم
دانند
سوي
استفاده
شما
كنار
داريم
ساخته
طور
امده
رفته
نخست
بيست
نزديك
طي
كنيد
از
انها
تمامي
داشت
يكي
طريق
اش
چيست
روب
نمايد
گفت
چندين
چيزي
تواند
ام
ايا
با
ان
ايد
ترين
اينكه
ديگري
راه
هايي
بروز
همچنان
پاعين
كس
حدود
مختلف
مقابل
چيز
گيرد
ندارد
ضد
همچون
سازي
شان
مورد
باره
مرسي
خويش
برخوردار
چون
خارج
شش
هنوز
تحت
ضمن
هستيم
گفته
فكر
بسيار
پيش
براي
روزهاي
انكه
نخواهد
بالا
كل
وقتي
كي
چنين
كه
گيري
نيست
است
كجا
كند
نيز
يابد
بندي
حتي
توانند
عقب
خواست
كنند
بين
تمام
همه
ما
باشند
مثل
شد
اري
باشد
اره
طبق
بعد
اگر
صورت
غير
جاي
بيش
ريزي
اند
زيرا
چگونه
بار
لطفا
مي
درباره
من
ديده
همين
گذاري
برداري
علت
گذاشته
هم
فوق
نه
ها
شوند
اباد
همواره
هر
اول
خواهند
چهار
نام
امروز
مان
هاي
قبل
كنم
سعي
تازه
را
هستند
زير
جلوي
عنوان
بود

View File

@ -0,0 +1,95 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/finnish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| forms of BE
olla
olen
olet
on
olemme
olette
ovat
ole | negative form
oli
olisi
olisit
olisin
olisimme
olisitte
olisivat
olit
olin
olimme
olitte
olivat
ollut
olleet
en | negation
et
ei
emme
ette
eivät
|Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat Ess Trans
minä minun minut minua minussa minusta minuun minulla minulta minulle | I
sinä sinun sinut sinua sinussa sinusta sinuun sinulla sinulta sinulle | you
hän hänen hänet häntä hänessä hänestä häneen hänellä häneltä hänelle | he she
me meidän meidät meitä meissä meistä meihin meillä meiltä meille | we
te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you
he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they
tämä tämän tätä tässä tästä tähän tallä tältä tälle tänä täksi | this
tuo tuon tuotä tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that
se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it
nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these
nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those
ne niiden niitä niissä niistä niihin niillä niiltä niille niinä niiksi | they
kuka kenen kenet ketä kenessä kenestä keneen kenellä keneltä kenelle kenenä keneksi| who
ketkä keiden ketkä keitä keissä keistä keihin keillä keiltä keille keinä keiksi | (pl)
mikä minkä minkä mitä missä mistä mihin millä miltä mille minä miksi | which what
mitkä | (pl)
joka jonka jota jossa josta johon jolla jolta jolle jona joksi | who which
jotka joiden joita joissa joista joihin joilla joilta joille joina joiksi | (pl)
| conjunctions
että | that
ja | and
jos | if
koska | because
kuin | than
mutta | but
niin | so
sekä | and
sillä | for
tai | or
vaan | but
vai | or
vaikka | although
| prepositions
kanssa | with
mukaan | according to
noin | about
poikki | across
yli | over, across
| other
kun | when
niin | so
nyt | now
itse | self

View File

@ -0,0 +1,184 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| A French stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
au | a + le
aux | a + les
avec | with
ce | this
ces | these
dans | with
de | of
des | de + les
du | de + le
elle | she
en | `of them' etc
et | and
eux | them
il | he
je | I
la | the
le | the
leur | their
lui | him
ma | my (fem)
mais | but
me | me
même | same; as in moi-même (myself) etc
mes | me (pl)
moi | me
mon | my (masc)
ne | not
nos | our (pl)
notre | our
nous | we
on | one
ou | where
par | by
pas | not
pour | for
qu | que before vowel
que | that
qui | who
sa | his, her (fem)
se | oneself
ses | his (pl)
son | his, her (masc)
sur | on
ta | thy (fem)
te | thee
tes | thy (pl)
toi | thee
ton | thy (masc)
tu | thou
un | a
une | a
vos | your (pl)
votre | your
vous | you
| single letter forms
c | c'
d | d'
j | j'
l | l'
à | to, at
m | m'
n | n'
s | s'
t | t'
y | there
| forms of être (not including the infinitive):
été
étée
étées
étés
étant
suis
es
est
sommes
êtes
sont
serai
seras
sera
serons
serez
seront
serais
serait
serions
seriez
seraient
étais
était
étions
étiez
étaient
fus
fut
fûmes
fûtes
furent
sois
soit
soyons
soyez
soient
fusse
fusses
fût
fussions
fussiez
fussent
| forms of avoir (not including the infinitive):
ayant
eu
eue
eues
eus
ai
as
avons
avez
ont
aurai
auras
aura
aurons
aurez
auront
aurais
aurait
aurions
auriez
auraient
avais
avait
avions
aviez
avaient
eut
eûmes
eûtes
eurent
aie
aies
ait
ayons
ayez
aient
eusse
eusses
eût
eussions
eussiez
eussent
| Later additions (from Jean-Christophe Deschamps)
ceci | this
cela | that
celà | that
cet | this
cette | this
ici | here
ils | they
les | the (pl)
leurs | their (pl)
quel | which
quels | which
quelle | which
quelles | which
sans | without
soi | oneself

View File

@ -0,0 +1,110 @@
a
ach
ag
agus
an
aon
ar
arna
as
b'
ba
beirt
bhúr
caoga
ceathair
ceathrar
chomh
chtó
chuig
chun
cois
céad
cúig
cúigear
d'
daichead
dar
de
deich
deichniúr
den
dhá
do
don
dtí
dár
faoi
faoin
faoina
faoinár
fara
fiche
gach
gan
go
gur
haon
hocht
i
iad
idir
in
ina
ins
inár
is
le
leis
lena
lenár
m'
mar
mo
na
nach
naoi
naonúr
níor
nócha
ocht
ochtar
os
roimh
sa
seacht
seachtar
seachtó
seasca
seisear
siad
sibh
sinn
sna
tar
thar
thú
triúr
trí
trína
trínár
tríocha
um
ár
é
éis
í
ó
ón
óna
ónár

View File

@ -0,0 +1,161 @@
# galican stopwords
a
aínda
alí
aquel
aquela
aquelas
aqueles
aquilo
aquí
ao
aos
as
así
á
ben
cando
che
co
coa
comigo
con
connosco
contigo
convosco
coas
cos
cun
cuns
cunha
cunhas
da
dalgunha
dalgunhas
dalgún
dalgúns
das
de
del
dela
delas
deles
desde
deste
do
dos
dun
duns
dunha
dunhas
e
el
ela
elas
eles
en
era
eran
esa
esas
ese
eses
esta
estar
estaba
está
están
este
estes
estiven
estou
eu
é
facer
foi
foron
fun
había
hai
iso
isto
la
las
lle
lles
lo
los
mais
me
meu
meus
min
miña
miñas
moi
na
nas
neste
nin
no
non
nos
nosa
nosas
noso
nosos
nós
nun
nunha
nuns
nunhas
o
os
ou
ó
ós
para
pero
pode
pois
pola
polas
polo
polos
por
que
se
senón
ser
seu
seus
sexa
sido
sobre
súa
súas
tamén
tan
te
ten
teñen
teño
ter
teu
teus
ti
tido
tiña
tiven
túa
túas
un
unha
unhas
uns
vos
vosa
vosas
voso
vosos
vós

View File

@ -0,0 +1,235 @@
# Also see http://www.opensource.org/licenses/bsd-license.html
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# This file was created by Jacques Savoy and is distributed under the BSD license.
# Note: by default this file also contains forms normalized by HindiNormalizer
# for spelling variation (see section below), such that it can be used whether or
# not you enable that feature. When adding additional entries to this list,
# please add the normalized form as well.
अंदर
अत
अपना
अपनी
अपने
अभी
आदि
आप
इत्यादि
इन
इनका
इन्हीं
इन्हें
इन्हों
इस
इसका
इसकी
इसके
इसमें
इसी
इसे
उन
उनका
उनकी
उनके
उनको
उन्हीं
उन्हें
उन्हों
उस
उसके
उसी
उसे
एक
एवं
एस
ऐसे
और
कई
कर
करता
करते
करना
करने
करें
कहते
कहा
का
काफ़ी
कि
कितना
किन्हें
किन्हों
किया
किर
किस
किसी
किसे
की
कुछ
कुल
के
को
कोई
कौन
कौनसा
गया
घर
जब
जहाँ
जा
जितना
जिन
जिन्हें
जिन्हों
जिस
जिसे
जीधर
जैसा
जैसे
जो
तक
तब
तरह
तिन
तिन्हें
तिन्हों
तिस
तिसे
तो
था
थी
थे
दबारा
दिया
दुसरा
दूसरे
दो
द्वारा
नहीं
ना
निहायत
नीचे
ने
पर
पर
पहले
पूरा
पे
फिर
बनी
बही
बहुत
बाद
बाला
बिलकुल
भी
भीतर
मगर
मानो
मे
में
यदि
यह
यहाँ
यही
या
यिह
ये
रखें
रहा
रहे
ऱ्वासा
लिए
लिये
लेकिन
वर्ग
वह
वह
वहाँ
वहीं
वाले
वुह
वे
वग़ैरह
संग
सकता
सकते
सबसे
सभी
साथ
साबुत
साभ
सारा
से
सो
ही
हुआ
हुई
हुए
है
हैं
हो
होता
होती
होते
होना
होने
# additional normalized forms of the above
अपनि
जेसे
होति
सभि
तिंहों
इंहों
दवारा
इसि
किंहें
थि
उंहों
ओर
जिंहें
वहिं
अभि
बनि
हि
उंहिं
उंहें
हें
वगेरह
एसे
रवासा
कोन
निचे
काफि
उसि
पुरा
भितर
हे
बहि
वहां
कोइ
यहां
जिंहों
तिंहें
किसि
कइ
यहि
इंहिं
जिधर
इंहें
अदि
इतयादि
हुइ
कोनसा
इसकि
दुसरे
जहां
अप
किंहों
उनकि
भि
वरग
हुअ
जेसा
नहिं

View File

@ -0,0 +1,209 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/hungarian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
| Hungarian stop word list
| prepared by Anna Tordai
a
ahogy
ahol
aki
akik
akkor
alatt
által
általában
amely
amelyek
amelyekben
amelyeket
amelyet
amelynek
ami
amit
amolyan
amíg
amikor
át
abban
ahhoz
annak
arra
arról
az
azok
azon
azt
azzal
azért
aztán
azután
azonban
bár
be
belül
benne
cikk
cikkek
cikkeket
csak
de
e
eddig
egész
egy
egyes
egyetlen
egyéb
egyik
egyre
ekkor
el
elég
ellen
elő
először
előtt
első
én
éppen
ebben
ehhez
emilyen
ennek
erre
ez
ezt
ezek
ezen
ezzel
ezért
és
fel
felé
hanem
hiszen
hogy
hogyan
igen
így
illetve
ill.
ill
ilyen
ilyenkor
ison
ismét
itt
jól
jobban
kell
kellett
keresztül
keressünk
ki
kívül
között
közül
legalább
lehet
lehetett
legyen
lenne
lenni
lesz
lett
maga
magát
majd
majd
már
más
másik
meg
még
mellett
mert
mely
melyek
mi
mit
míg
miért
milyen
mikor
minden
mindent
mindenki
mindig
mint
mintha
mivel
most
nagy
nagyobb
nagyon
ne
néha
nekem
neki
nem
néhány
nélkül
nincs
olyan
ott
össze
ő
ők
őket
pedig
persze
s
saját
sem
semmi
sok
sokat
sokkal
számára
szemben
szerint
szinte
talán
tehát
teljes
tovább
továbbá
több
úgy
ugyanis
új
újabb
újra
után
utána
utolsó
vagy
vagyis
valaki
valami
valamint
való
vagyok
van
vannak
volt
voltam
voltak
voltunk
vissza
vele
viszont
volna

View File

@ -0,0 +1,46 @@
# example set of Armenian stopwords.
այդ
այլ
այն
այս
դու
դուք
եմ
են
ենք
ես
եք
է
էի
էին
էինք
էիր
էիք
էր
ըստ
թ
ի
ին
իսկ
իր
կամ
համար
հետ
հետո
մենք
մեջ
մի
ն
նա
նաև
նրա
նրանք
որ
որը
որոնք
որպես
ու
ում
պիտի
վրա
և

Some files were not shown because too many files have changed in this diff Show More